pub fn pre_tokenize(text: &str, do_lower_case: bool) -> Vec<String>
Pre-tokenize a text string into word-level tokens.
Applies lowercasing, accent stripping, CJK splitting, whitespace splitting, and punctuation splitting.