Difference between revisions of "Research/OCR"

From Publication Station
(Created page with "OCR (optical character recognition) =tesseract= https://code.google.com/p/tesseract-ocr/ Teassearct is OCR software. It was HP Labs between 1985 and 1995 currently is devel...")
 
Line 15: Line 15:
  brew install tesseract --all-languages
  brew install tesseract --all-languages
https://gist.github.com/henrik/1967035
https://gist.github.com/henrik/1967035
==Run==
===prerequisites===
source files should be:
* in .tiff format
* have at least 300dpi - otherwise the text recognition will be very sloppy
===command===
tesseract input.tiff output.txt

Revision as of 13:49, 4 December 2015

OCR (optical character recognition)

tesseract

https://code.google.com/p/tesseract-ocr/

Teassearct is OCR software. It was HP Labs between 1985 and 1995 currently is developed by Google.

install

Debian:

aptitude install tesseract-ocr

Mac:

using homebrew need to run the commands:

brew install leptonica --with-libtiff
brew install tesseract --all-languages

https://gist.github.com/henrik/1967035


Run

prerequisites

source files should be:

  • in .tiff format
  • have at least 300dpi - otherwise the text recognition will be very sloppy

command

tesseract input.tiff output.txt