OCR, Language Translator, and NLP

Sep 23

Introduction

Humans can understand the contents of an image by looking, computers don’t work the same way.

This is something a Data Scientist or Machine Learning Engineer can face while working on a NLP project.

When collecting data, you find out that the data can come from different formats and languages around the world. In this case, you will have multiple problems to solve before you start analyzing the dataset. The first is to capture the text from an image, second is to identify the language of the particular data, and the third is how you can translate the data to the language of your choice.

Challenges

So how can we solve these three problems through an automated workflow?

First problem: Image Processing

Optical Character Recognition (OCR) is a technology that analyzes the text of a page and turns the letters into machine-readable text data that may be used to process information. OCR results depend on the input data quality. A clean segmentation of the text and no noise in the background gives better results. In the real world, this is not always possible, so we need to apply multiple pre-processing techniques for OCR to give better results. In this demo we will be using Tesseract OCR and EasyOCR.

Second Problem: Language Detection

The second problem is to know how we can detect language for particular data. In this demo, we use couple python packages called langdetect and googletrans. Using more than one detect language tool can help us compare the result from previous step and provide a more accurate perspective in identifying the original language.

NOTE: The language detection algorithm is non-deterministic, for a text which is either too short or too ambiguous, results may vary.

Third Problem: Language Translation

The third problem to solve is to translate a text from one language to the language of your choice. In this case, googletrans can be used again.

Demo Video

(In this video, we use a python package called spaCy to perform a simple Named Entity Recognition (NER) process.)

Ascolta OCR Demo

Conclusion

You have seen a demo to solve challenges when you have image data with different languages and want to translate the data into the single language of your choice for NLP processing.

Is it possible to auto-detect image text in various language and NLP it?

With the right tool(s), the answer is YES.

Technologies Used / Citations