Difference between revisions of "Research/OCR"
From Publication Station
Line 24: | Line 24: | ||
===command=== | ===command=== | ||
tesseract input.tiff output.txt | tesseract input.tiff output | ||
will result in OCRed file output.txt |
Revision as of 13:49, 4 December 2015
OCR (optical character recognition)
tesseract
https://code.google.com/p/tesseract-ocr/
Teassearct is OCR software. It was HP Labs between 1985 and 1995 currently is developed by Google.
install
Debian:
aptitude install tesseract-ocr
Mac:
using homebrew need to run the commands:
brew install leptonica --with-libtiff brew install tesseract --all-languages
https://gist.github.com/henrik/1967035
Run
prerequisites
source files should be:
- in .tiff format
- have at least 300dpi - otherwise the text recognition will be very sloppy
command
tesseract input.tiff output
will result in OCRed file output.txt