Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

What are the training sets for the different parser models?

April 26, 2017different models parser TRAINING

0

0 Posted

What are the training sets for the different parser models?

1 Answer

0

0 Posted

For Chinese (and Arabic, German, and “WSJ”, you can look at the included file makeSerialized.csh , and easily see exactly what files the models are trained on, in terms of LDC or Negra file numbers. The only opaque case is english{Factored|PCFG}. For comparable results with others, you should use the WSJ models which are trained on standard WSJ sections 2-21, but the english* models should work a bit better for anything other than 1980s WSJ text. english{Factored|PCFG} is currently trained on: • WSJ sections 1-21 • Genia as reformatted by Andrew Clegg, his training split • 2 English Chinese Translation Treebank and 3 English Arabic Translation Treebank files backported to the original treebank annotation standards (by us) • 95 sentences parsed by us (mainly questions and imperatives; a few from recent newswire). However, this is likely to change in future releases. (In a future release, we’ll likely change to the official Genia release, there are bigger and better sources of parsed que