Latest revision as of 15:28, 4 December 2015

OCR (optical character recognition)

tesseract

https://code.google.com/p/tesseract-ocr/

Teassearct is OCR software. It was initially developed by HP Labs between 1985 and 1995 currently its development is sponsored by Google.

It is free software, released under the Apache License.

install

Debian:

aptitude install tesseract-ocr

Mac:

using homebrew need to run the commands:

brew install leptonica --with-libtiff
brew install tesseract --all-languages

https://gist.github.com/henrik/1967035

Run

Preparing source files

source files (image) should:

be in .png or .tiff format
color-space: 2-bit, black (text) & white (background)
have at least 300dpi - otherwise the text recognition will be very sloppy
contain only one column text
contain no images - replace them by white square

command

tesseract input.tiff output

will result in OCRed file output.txt

Languages

By default tesseract is optimized to work with English language. This behavior can be change by installing extra packages required for other languages and by giving it the correct setting.

@@ Line 4: / Line 4: @@
 https://code.google.com/p/tesseract-ocr/
-Teassearct is OCR software. It was HP Labs between 1985 and 1995 currently is developed by Google.
+Teassearct is OCR software. It was initially developed by HP Labs between 1985 and 1995 currently its development is sponsored by Google.
+It is free software, released under the Apache License.
 ==install==
 ===Debian:===
@@ Line 18: / Line 21: @@
 ==Run==
-===prerequisites===
+===Preparing source files===
-source files should be:
+source files (image) should:
-* in .tiff format
+* be in .png or .tiff format
+* color-space: 2-bit, black (text) & white (background)
 * have at least 300dpi - otherwise the text recognition will be very sloppy
+* contain only one column text
+* contain no images - replace them by white square
 ===command===
-  tesseract input.tiff output.txt
+  tesseract input.tiff output
+will result in OCRed file output.txt
+==Languages==
+By default tesseract is optimized to work with English language. This behavior can be change by installing extra packages required for other languages and by giving it the correct setting.
+[[Category:research]]

Anonymous

Search

Difference between revisions of "Research/OCR"

Namespaces

More

Page actions

Latest revision as of 15:28, 4 December 2015

Contents

tesseract

install

Debian:

Mac:

Run

Preparing source files

command

Languages

Navigation

Main navigation

Namespaces

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Research/OCR"

Latest revision as of 15:28, 4 December 2015

tesseract

install

Debian:

Mac:

Run

Preparing source files

command

Languages

Navigation

Wiki tools

Page tools

Categories