How can we evaluate the performance of different Classification algorithms in NLP (Natural Language Processing) and how to choose one?

Problem statement

We have a batch of limited (1000) written restaurant reviews. The reviews are flagged as positive or negative based on the given rating (3+ stars is considered positive review)

What we want to do, is to determine the best classifier for our NLP data model that predicts if a new review is positive or negative.


This exercise makes use of the opensource scikit-learn library which offers Machine Learning capabilities in Python.
Steps for this experiment include:

  1. Importing the dataset with the reviews
  2. Cleaning the dataset
  • Clear special characters from the reviews
  • Lowercase all words
  • Split each review into a list of words
  • Remove all stop words (two cases)
    • including "no" and "not"
    • excluding "no" and "not"
  • Stem all words, by keeping the root to avoid too much Sparcity
  1. Creating the bag of Words model
  • We create this huge sparse matrix through the tokenization method, where the columns are the words and the lines are the reviews
  1. Selecting the Classification algorithm, choosing from:
  • Linear algorithms
    • Logistic Regression
    • SVM
    • Kernel-SVM
  • Non-linear algorithms
    • K-NN
    • Naïve Bayes
    • Decision Tree
    • Random Forest


  1. We compute and interpret the Confusion Matrix for each scenario
  2. We evaluate the performance by looking at the following metrics:
  • Accuracy score
  • Precision score (measures exactness)
  • Recall score (measures completeness)
  • F1 score (compromise between exactness and completeness)


(marked in red) is very important since it addresses type II errors.
Ranges from 0-1. The higher the value, the better.
Why are these type II errors important?

  • In a legal sense we find it preferable to have False Negatives:

It is better that ten guilty persons escape than that one innocent suffer
Read more: False Positives and False Negatives

  • Considering science systems, it's just the opposite. Like earthquake predictions we cannot allow the detection of an possible earthquake or any natural disaster to be labeled as False, not raise the alarm, with disastrous consequences.


(marked in blue) is a metric that addresses type I errors commonly known as False Positives or False Alarms.
Ranges from 0-1. The higher the value, the better.

F1 score

The compromise between Precision and Recall is a good metric to measure the model performance by comparison.


NLP Confusion Matrix simple evaluation classifiers data

NLP Confusion Matrix simple evaluation classifiers graph

Interpreting results and Conclusions

  • In both cases Naive Bayes classifier performs exactly the same with good overall prediction and almost the best F1 score.
    • easier to train;
    • model is smaller by comparison with Random Forest;
    • fragile to overfitting;
  • The SVM(linear) classifier performs very good in the second scenario, but this could be due to:
    • small dataset;
    • reviews are written in a expresive, simple and concise manner;
    • could suffer from overfitting and once applied on a larger text corpus it will drop in performance significantly;
  • Decision Trees and Random Forest perform average-good
    • Question: how will the performance be impacted on a bigger text corpus where language is used loosely, or negating the negation?
    • Random Forest are slower to train;
    • Robust against overfitting;
  • Logistic Regression is a linear classifier and thus has no applicability in this case, but was used for pure comparison reasons. Due to language complexity, linear classifiers will perform worst on big and complex body of texts.

Future thoughts

This article barely scratches the surface in terms of Machine Learning, NLP algorithms and offers no "ready to use" solution for any given purpose. It's scope is purely academic and general enough to offer a high level perspective.

What else? What's next?
  • Using a much higher and richer dataset to increase Accuracy Score
  • Tweak the hyperparameters of the different classification models that were used to result in better predictions
  • Evaluation of the scores using k-Fold Cross Validation to evaluate model performance
  • Optimize the models to avoid overfitting the data
  • Use different scenarios and evaluate performance when considering words like ['against', 'below', 'down', 'out', 'off', 'no', 'nor', 'not', 'don', 'ain', 'aren',
    'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'mightn', 'mustn','needn', 'shan', 'shouldn', 'wasn', 'weren', 'won','wouldn'] which are relevant in Sentiment Analysis
  • many more...