pub struct Tokenizer { /* private fields */ }Expand description
BERT-compatible tokenizer.
Combines vocabulary, pre-tokenization, and WordPiece into a single pipeline.
Implementations§
Source§impl Tokenizer
impl Tokenizer
Sourcepub fn new<P: AsRef<Path>>(vocab_path: P, do_lower_case: bool) -> Result<Self>
pub fn new<P: AsRef<Path>>(vocab_path: P, do_lower_case: bool) -> Result<Self>
Create a new tokenizer from a vocabulary file.
§Arguments
vocab_path: Path tovocab.txtdo_lower_case: Whether to lowercase input text
Sourcepub fn from_vocab(vocab: Vocab, do_lower_case: bool) -> Self
pub fn from_vocab(vocab: Vocab, do_lower_case: bool) -> Self
Create a tokenizer from a pre-loaded vocabulary.
Sourcepub fn encode(&self, text: &str, max_length: usize) -> Result<Encoding>
pub fn encode(&self, text: &str, max_length: usize) -> Result<Encoding>
Tokenize and encode a single text.
Pipeline: text -> pre-tokenize -> WordPiece -> add [CLS] and [SEP] -> truncate -> pad
§Arguments
text: Input textmax_length: Maximum sequence length (including special tokens)
Sourcepub fn encode_batch(
&self,
texts: &[&str],
max_length: usize,
) -> Result<Vec<Encoding>>
pub fn encode_batch( &self, texts: &[&str], max_length: usize, ) -> Result<Vec<Encoding>>
Encode a batch of texts in parallel.
Uses rayon for parallel tokenization. All encodings are padded to max_length.
Sourcepub fn vocab_size(&self) -> usize
pub fn vocab_size(&self) -> usize
Get the vocabulary size.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for Tokenizer
impl RefUnwindSafe for Tokenizer
impl Send for Tokenizer
impl Sync for Tokenizer
impl Unpin for Tokenizer
impl UnwindSafe for Tokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more