How can I adjust the tokenization of words, such as turning off the Americanization of spelling?
By default, the tokenizer used by the English parser (PTBTokenizer) performs various normalizations so as to make the input closer to the normalized form of English found in the Penn Treebank. One of these normalizations is the Americanization of spelling variants (such as changing colour to color). Others include things like changing round parentheses to -LRB- and -RRB-. Starting with version 1.6.2 of the parser, there is a fairly flexible scheme for options in tokenization style. You can give options such as this one to turn off Americanization of spelling: -tokenizerOptions “americanize=false” Or this one to change several options: -tokenizerOptions “americanize=false,normalizeCurrency=false,unicodeEllipsis=true” See the documentation of PTBTokenizer for details. Programmatically, you can do the same things by creating a TokenizerFactory with the appropriate options, such as: parse(new DocumentPreprocessor(PTBTokenizerFactory.newWordTokenizerFactory(“americanize=false”)).getWordsFro