Difference between revisions of "Research/OCR"
From Publication Station
Line 4: | Line 4: | ||
https://code.google.com/p/tesseract-ocr/ | https://code.google.com/p/tesseract-ocr/ | ||
Teassearct is OCR software. It was HP Labs between 1985 and 1995 currently is | Teassearct is OCR software. It was initially developed by HP Labs between 1985 and 1995 currently its development is sponsored by Google. | ||
It is free software, released under the Apache License. | |||
==install== | ==install== | ||
===Debian:=== | ===Debian:=== | ||
Line 19: | Line 22: | ||
==Run== | ==Run== | ||
===prerequisites=== | ===prerequisites=== | ||
source files should | source files should: | ||
* in .tiff format | * be in .tiff format | ||
* have at least 300dpi - otherwise the text recognition will be very sloppy | * have at least 300dpi - otherwise the text recognition will be very sloppy | ||
* contain only one column text | |||
===command=== | ===command=== | ||
tesseract input.tiff output | tesseract input.tiff output | ||
will result in OCRed file output.txt | will result in OCRed file output.txt | ||
==Languages== | |||
By default tesseract is optimized to work with English language. This behavior can be change by installing extra packages required for other languages and by giving it the correct setting. |
Revision as of 13:56, 4 December 2015
OCR (optical character recognition)
tesseract
https://code.google.com/p/tesseract-ocr/
Teassearct is OCR software. It was initially developed by HP Labs between 1985 and 1995 currently its development is sponsored by Google.
It is free software, released under the Apache License.
install
Debian:
aptitude install tesseract-ocr
Mac:
using homebrew need to run the commands:
brew install leptonica --with-libtiff brew install tesseract --all-languages
https://gist.github.com/henrik/1967035
Run
prerequisites
source files should:
- be in .tiff format
- have at least 300dpi - otherwise the text recognition will be very sloppy
- contain only one column text
command
tesseract input.tiff output
will result in OCRed file output.txt
Languages
By default tesseract is optimized to work with English language. This behavior can be change by installing extra packages required for other languages and by giving it the correct setting.