Research/OCR

From Publication Station
Revision as of 13:49, 4 December 2015 by Andre (talk | contribs) (→‎command)

OCR (optical character recognition)

tesseract

https://code.google.com/p/tesseract-ocr/

Teassearct is OCR software. It was HP Labs between 1985 and 1995 currently is developed by Google.

install

Debian:

aptitude install tesseract-ocr

Mac:

using homebrew need to run the commands:

brew install leptonica --with-libtiff
brew install tesseract --all-languages

https://gist.github.com/henrik/1967035


Run

prerequisites

source files should be:

  • in .tiff format
  • have at least 300dpi - otherwise the text recognition will be very sloppy

command

tesseract input.tiff output

will result in OCRed file output.txt