How does newspaper article and page segmentation work?
The main distinction in newspaper processing is between “page-level access” and “article-level segmentation.” Page-level access is intended to be the equivalent of reading a newspaper hardcopy: The reader sees the order of the content and the integrity of each article by looking at the page in its original layout. This is a relatively standard process for newspaper digitization projects. It helps link OCR-ed text files to the proper image, links the images from each page in an issue to each other in proper order, and also provides machine-captured metadata to the METS and ALTO files. Article segmentation is a special process that “segments” the page into the individual components (article, photo, weather box, etc.) so that in certain delivery systems they may be displayed as a separate digital entity, yet still maintaining a relationship back to the parent page. An example of article segmentation can be found at http://www.dimemanews.com/cdm4/browse.php. Select the first paper to view