I am on Ubuntu 9.04. My primarily goal here is to informally evaluate how Tesseract performs on some documents, including handwritten samples. After downloading, building and installing tesseract as at http://code.google.com/p/tesseract-ocr/w/list I ran tesseract over a jpg file containing the sample text.
$> tesseract text_image.jpg result
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
I made the changes as per comment by caitifty on Feb 28, 2009 (sigh, there is no permalink) at http://code.google.com/p/tesseract-ocr/wiki/ReadMe . The change is to replace/add the tessdata directory at /usr/local/share/tessdata . After this, I get
$> tesseract text_image.jpg result
Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:text_image.jpg
IMAGE::read_header:Error:Can't read this image type:text_image.jpg
tesseract:Error:Read of file failed:text_image.jpg
Segmentation fault
It appears that it does not recognize jpg file input. So I use imagemagick's "convert" tool to convert to tif format that tesseract seems to recognize.
$> convert text_image.jpg text_image.tif
and then
$> tesseract text_image.tif result
Tesseract Open Source OCR Engine
Image has 8 * 3 bits per pixel, and size (1000,171)
Resolution=200
$> cat result.txt
The last command produced the recognized text. The results were quite good.
No comments:
Post a Comment