Digitization by layout recognition and OCR
The digitization is primarily used for conservation and fast retrieving of information.
The digitizing process converts the scanned page to machine-readable text.

The file-name of each page contains meta-data like:
- issue number of the newspapers
- initials of the issue
- publication date
- page number
These information will be appendend to each article by the article separation.
Further detection criteria on the page:
- columns
- roof lines
- titles
- subtitles
- leaders
- texts
- picture legends
- photographers
- authors
- photos and graphics (drawings) must be differed
- spread sheets must be recognized
- advertisements must be recognized
These information is labeled and selected by the layout recognition and will be available for the article separation and PPS-Finder.




