C# tips and tricks 21 - Extracting text from an image using Tesseract OCR library for C# (CSharp)
Автор: Ankpro Training
Загружено: 2018-05-21
Просмотров: 59026
How to extract text from an images like jpg, png, bmp?
Like our facebook page / ankprotraining
Code is available below...
What is OCR?
OCR (Optical character recognition) is the recognition of printed or written text characters by a computer. This involves photo scanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.
Optical Character Recognition, or OCR, is a technology that enables you to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data.
What is Tesseract
Tesseract is an optical character recognition engine for various
operating systems.
Tesseract is an OCR engine with support for Unicode and the ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages.
Tesseract Data Files
Sets of trained data files for tesseract is called tesseract data files
1. What is OCR?
2. What is Tesseract?
3. What is Tesseract Data File?
Steps for coding
1. Add Tesseract Nuget pacakage of version 3.0.2.0
2. Download data files for 3.0.2.0 version from Tesseract OCR official GitHub Project
https://github.com/tesseract-ocr/tess...
3. Search for "Data Files for Version 3.02"
4. Look for language for english and click on the link next to it.
5. unzip the downloaded file using unzip utilities like 7Zip
6. Copy TessData to project and select all files inside the Tessdata folder
7. Right click on selected files and select properties
8. Select copy always in the Copy to output directory option.
9. Create a image using utility like paint, save the file and copy to project.
10. Right click on the file and select properties and select copy always in the Copy to output directory option.
11. Add System.Drawing namespace to the project from references.
By using a AForge library we can achieve better accuracy.
Code :
using System;
using System.Collections.Generic;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Tesseract;
namespace OCRTesseractDemo
{
class Program
{
static void Main(string[] args)
{
Bitmap img = new Bitmap("Test1.jpg");
TesseractEngine engine = new TesseractEngine("./tessdata", "eng", EngineMode.Default);
Page page = engine.Process(img, PageSegMode.Auto);
string result = page.GetText();
Console.WriteLine(result);
}
}
}
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: