What are the training sets for the different parser models?
For Chinese (and Arabic, German, and “WSJ”, you can look at the included file makeSerialized.csh , and easily see exactly what files the models are trained on, in terms of LDC or Negra file numbers. The only opaque case is english{Factored|PCFG}. For comparable results with others, you should use the WSJ models which are trained on standard WSJ sections 2-21, but the english* models should work a bit better for anything other than 1980s WSJ text. english{Factored|PCFG} is currently trained on: • WSJ sections 1-21 • Genia as reformatted by Andrew Clegg, his training split • 2 English Chinese Translation Treebank and 3 English Arabic Translation Treebank files backported to the original treebank annotation standards (by us) • 95 sentences parsed by us (mainly questions and imperatives; a few from recent newswire). However, this is likely to change in future releases. (In a future release, we’ll likely change to the official Genia release, there are bigger and better sources of parsed que