Leçon 1, Chapitre 1
En cours

Language naturel – Introduction

Yann KIDSHAKER 17 mars 2025

Have you ever wondered how personal assistants like Alexa, Siri etc. work ? what rules govern them ? and how to build one on your own? If your answer is Yes, then Great!! Let us learn about Natural Language Processing.

Source: analyticsinsight.net

What is NLP (Natural Language Processing) ?

  • NLP is the ability of a computer program to understand human language as it is spoken and written — referred to as natural language (ex: English, French etc.).
  • NLP is an interdisciplinary field of Computer Science and Linguistics. The ultimate goal of NLP is to make computers, understand and generate human languages like we do, that too all human languages unlike us humans who typically know 2-4 languages.
    Source: clevertap.com

     

  • The field of NLP is divided into three parts:
    1. Speech Recognition —  The translation of spoken language into text.
    2. Natural Language Understanding —  The computer’s ability to understand what we say.
    3. Natural Language Generation —  The generation of natural language by a computer.

How NLP works?

There are two main phases to Natural Language Processing:

  1. Data Preprocessing
  2. Algorithm Development

Data Preprocessing in NLP

Data preprocessing involves cleaning of the text data and preparing it so that it can be analyzed in an efficient method by the machine learning algorithm. Preprocessing transforms the data, such that the model/algorithm can much more easily and accurately detect features in those text, leading to better analysis and hence better performance.

Following are the basic steps, that are followed for processing the textual data before feeding them to the NLP model/algorithm:

  1. Tokenization: Breaking down text into smaller units to work with ex: dividing them simply by words.
  2. Stop Word Removal: Removal of common words from the the text that do not add much meaning to the context ex: is, and, of, a, an etc.
  3. Lemmatization: Grouping together of same kind of words, or variant forms of the same word. Ex: Words: run, running, ran, dart, scurry etc. convey the same kind of meaning.
  4. Part-of-speech tagging: Words are tagged based on the part of speech they belong to such as noun, verb, adverb and adjectives.

Once the data has been processed, an algorithm is developed to further model it.

Algorithm Development in NLP

Scientific advancements in NLP can be divided into the following parts:

  • Rule-based systems-
    • This system, as the name suggests relies on crafting lots of domain specific rules pertaining to the language.
    • This method can automate simple tasks such as extracting structured data (e.g. dates, names) from unstructured data (e.g. webpages, emails).
    • However, due to the complexity of human languages, rule-based systems are weak at picking up context of a sentence, provide low accuracy and can’t generalize across different domains.
  • Classical Machine Learning-
    • In this method we make machine learning models learn the rules of the language by themselves rather than we defining rules for them.
    • The models learn features of the specific language, which can be optimized using feature engineering ex: bag of words, parts of speech etc. A machine learning model such as Naive Bayes is trained using this technique
    • These models find patterns in the language using training data and then make predictions on unseen data.

  • Deep Learning Models-
    • Using Deep Learning models for NLP is currently the most popular way to perform NLP analysis and research.
    • These models allow the highest accuracy of all the machine learning models in NLP analysis and prediction.
    • Unlike classical machine learning models, deep learning models do not need hand crafted features or feature engineering. This is because they can automatically perform feature extraction and engineering from the training data.