Difference between revisions of "Research/OCR"
From Publication Station
(Created page with "OCR (optical character recognition) =tesseract= https://code.google.com/p/tesseract-ocr/ Teassearct is OCR software. It was HP Labs between 1985 and 1995 currently is devel...") |
|||
Line 15: | Line 15: | ||
brew install tesseract --all-languages | brew install tesseract --all-languages | ||
https://gist.github.com/henrik/1967035 | https://gist.github.com/henrik/1967035 | ||
==Run== | |||
===prerequisites=== | |||
source files should be: | |||
* in .tiff format | |||
* have at least 300dpi - otherwise the text recognition will be very sloppy | |||
===command=== | |||
tesseract input.tiff output.txt |
Revision as of 13:49, 4 December 2015
OCR (optical character recognition)
tesseract
https://code.google.com/p/tesseract-ocr/
Teassearct is OCR software. It was HP Labs between 1985 and 1995 currently is developed by Google.
install
Debian:
aptitude install tesseract-ocr
Mac:
using homebrew need to run the commands:
brew install leptonica --with-libtiff brew install tesseract --all-languages
https://gist.github.com/henrik/1967035
Run
prerequisites
source files should be:
- in .tiff format
- have at least 300dpi - otherwise the text recognition will be very sloppy
command
tesseract input.tiff output.txt