Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

What is the difference between freely available n-gram statistical libraries for tokenization and segmentation, and full morphological analysis?

0
10 Posted

What is the difference between freely available n-gram statistical libraries for tokenization and segmentation, and full morphological analysis?

0

Tokenization is the process of breaking content into words. While it is trivial in English, other languages may not have ‘white spaces’, or have joined words, etc. The process of breaking a non-delimited chunk of text into words is called segmentation. In simple words, n-gram statistical libraries are probabilistic machines. The results are very rough approximations. Morphological analysis, on the other hand, uses heuristics based on a language’s grammar. Here is an excellent whitepaper comparing the two approaches. When it comes to Semitic languages which concatenate prepositions and conjunctions as prefixes, or Germanic languages which use compounds, n-gram segmentation is simply useless.

Related Questions

What is your question?

*Sadly, we had to bring back ads too. Hopefully more targeted.

Experts123