Implementation of a Web Based Text Extraction Tool using Open Source Object Models
Loading...
Date
2017-08
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
INFLIBNET Centre
Abstract
The library in our institute is the repository of all reports and design documents. In our library,
reports belonging to the past three decades are preserved. The present day scans, reports are all
searchable, but the scan reports that are two decades old are not searchable. They are very important
and constantly referred by Scientists and Engineers for the fast breeder design purpose. It was
decided all these documents would be scanned and added to the collection. When scanning was
done, it was found a skew getting introduced into these documents and OCRs were not directly deskewing
and required a pre-processing to be applied to the document. Optical character recognition
(OCR) tool is a solution to extract the text from image and scanned documents. So it was decided an
OCR would be developed using open source libraries and the de-skewing methods necessary would
be added to make it useful to our task.
This paper discusses what is an OCR, the different steps necessary to extract text from image files
using an OCR, the OCR tool development, evaluation of the OCR tool and the de-skewing method
implemented in the tool.
Description
Keywords
De-skewing, Image Files, OCR