Research/OCR: Difference between revisions
From Publication Station
| Line 23: | Line 23: | ||
===Preparing source files=== | ===Preparing source files=== | ||
source files (image) should: | source files (image) should: | ||
* be in .png format | * be in .png or .tiff format | ||
* color-space: | * color-space: 2-bit, black (text) & white (background) | ||
* have at least 300dpi - otherwise the text recognition will be very sloppy | * have at least 300dpi - otherwise the text recognition will be very sloppy | ||
* contain only one column text | * contain only one column text | ||
Latest revision as of 15:28, 4 December 2015
OCR (optical character recognition)
tesseract
[edit]https://code.google.com/p/tesseract-ocr/
Teassearct is OCR software. It was initially developed by HP Labs between 1985 and 1995 currently its development is sponsored by Google.
It is free software, released under the Apache License.
install
[edit]Debian:
[edit]aptitude install tesseract-ocr
Mac:
[edit]using homebrew need to run the commands:
brew install leptonica --with-libtiff brew install tesseract --all-languages
https://gist.github.com/henrik/1967035
Run
[edit]Preparing source files
[edit]source files (image) should:
- be in .png or .tiff format
- color-space: 2-bit, black (text) & white (background)
- have at least 300dpi - otherwise the text recognition will be very sloppy
- contain only one column text
- contain no images - replace them by white square
command
[edit]tesseract input.tiff output
will result in OCRed file output.txt
Languages
[edit]By default tesseract is optimized to work with English language. This behavior can be change by installing extra packages required for other languages and by giving it the correct setting.
