Research/OCR
From Publication Station
OCR (optical character recognition)
tesseract
https://code.google.com/p/tesseract-ocr/
Teassearct is OCR software. It was HP Labs between 1985 and 1995 currently is developed by Google.
install
Debian:
aptitude install tesseract-ocr
Mac:
using homebrew need to run the commands:
brew install leptonica --with-libtiff brew install tesseract --all-languages
https://gist.github.com/henrik/1967035
Run
prerequisites
source files should be:
- in .tiff format
- have at least 300dpi - otherwise the text recognition will be very sloppy
command
tesseract input.tiff output.txt