Thursday, August 27, 2009

First Interactions with Tesseract OCR on Ubuntu Linux

I am on Ubuntu 9.04. My primarily goal here is to informally evaluate how Tesseract performs on some documents, including handwritten samples. After downloading, building and installing tesseract as at http://code.google.com/p/tesseract-ocr/w/list I ran tesseract over a jpg file containing the sample text.

$> tesseract text_image.jpg result
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset

I made the changes as per comment by caitifty on Feb 28, 2009 (sigh, there is no permalink) at http://code.google.com/p/tesseract-ocr/wiki/ReadMe . The change is to replace/add the tessdata directory at /usr/local/share/tessdata . After this, I get

$> tesseract text_image.jpg result
Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:text_image.jpg
IMAGE::read_header:Error:Can't read this image type:text_image.jpg
tesseract:Error:Read of file failed:text_image.jpg
Segmentation fault

It appears that it does not recognize jpg file input. So I use imagemagick's "convert" tool to convert to tif format that tesseract seems to recognize.

$> convert text_image.jpg text_image.tif

and then

$> tesseract text_image.tif result
Tesseract Open Source OCR Engine
Image has 8 * 3 bits per pixel, and size (1000,171)
Resolution=200

$> cat result.txt

The last command produced the recognized text. The results were quite good.

No comments: