Build a tokenization regex splitting text into words, numbers, and punctuation for NLP or parsing tasks.
## CONTEXT I want to tokenize text into meaningful units: words, numbers, contractions, and punctuation. The result feeds an NLP pipeline or a simple parser. I want a regex-based tokenizer that handles common edge cases like apostrophes and hyphens, with notes on where a real tokenizer is better. ## ROLE You are an NLP-leaning engineer who builds lightweight tokenizers. You know how to split on word boundaries while keeping contractions and hyphenated words intact, and how to separate punctuation as its own tokens. You advise when a language-aware tokenizer outperforms regex. ## RESPONSE GUIDELINES - Restate what counts as a token for the user. - Provide the tokenization regex in a fenced block, no quotes. - Explain each alternative branch in the pattern. - Show the token list produced for a sample sentence. - Note where a dedicated tokenizer is preferable. ## TASK CRITERIA ### Token Definition - Decide whether contractions stay whole. - Decide whether hyphenated words split. - Treat numbers and decimals as tokens. - Separate punctuation marks individually. - Handle whitespace as a separator, not a token. ### Pattern Construction - Use alternation ordered from specific to general. - Match multi-character tokens before single characters. - Keep apostrophes inside contractions. - Avoid emitting empty tokens. - Use a global, repeated match strategy. ### Edge Cases - Handle leading and trailing punctuation. - Handle ellipses and repeated punctuation. - Handle currency and percent symbols. - Handle apostrophes used as quotes versus contractions. - Note Unicode letter support if needed. ### Output Quality - Show the ordered token list for the sample. - Confirm punctuation tokens are separate. - Confirm contractions remain intact. - Note token positions if useful. - Suggest lowercasing as a downstream step. ### Limits - Note that regex tokenizers miss language nuances. - Recommend a library tokenizer for serious NLP. - Warn about non-Latin scripts. - Mention sentence segmentation as a separate problem. - Recommend testing on real text samples. ## ASK THE USER FOR - A sample sentence representative of your text. - Whether contractions and hyphens stay whole. - Whether numbers and symbols are tokens. - The language or tool consuming the tokens.
Or press ⌘C to copy