• PDF

Digitization by layout recognition and OCR

The digitization is primarily used for conservation and fast retrieving of information.
The digitizing process converts the scanned page to machine-readable text.

OCR-Erkennung

The file-name of each page contains meta-data like:

  • issue number of the newspapers
  • initials of the issue
  • publication date
  • page number

 

These information will be appendend to each article by the article separation.


Further detection criteria on the page:

  • columns
  • roof lines
  • titles
  • subtitles
  • leaders
  • texts
  • picture legends
  • photographers
  • authors
  • photos and graphics (drawings) must be differed
  • spread sheets must be recognized
  • advertisements must be recognized


These information is labeled and selected by the layout recognition and will be available for the article separation and PPS-Finder.