Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

How can I adjust the tokenization of words, such as turning off the Americanization of spelling?

April 26, 2017americanization spelling tokenization Turning words

0

Posted

How can I adjust the tokenization of words, such as turning off the Americanization of spelling?

1 Answer

0

Posted

By default, the tokenizer used by the English parser (PTBTokenizer) performs various normalizations so as to make the input closer to the normalized form of English found in the Penn Treebank. One of these normalizations is the Americanization of spelling variants (such as changing colour to color). Others include things like changing round parentheses to -LRB- and -RRB-. Starting with version 1.6.2 of the parser, there is a fairly flexible scheme for options in tokenization style. You can give options such as this one to turn off Americanization of spelling: -tokenizerOptions “americanize=false” Or this one to change several options: -tokenizerOptions “americanize=false,normalizeCurrency=false,unicodeEllipsis=true” See the documentation of PTBTokenizer for details. Programmatically, you can do the same things by creating a TokenizerFactory with the appropriate options, such as: parse(new DocumentPreprocessor(PTBTokenizerFactory.newWordTokenizerFactory(“americanize=false”)).getWordsFro