Exam : Fake News Detector - Case Study

Simon Weiss

29.03.2021

TBS MSc Artifical Intelligence and Business Analytics

image.png

1. Introduction

1.1 Context

Fake news has been there since before the advent of the Internet.
They are widely accepted to be fictitious articles deliberately fabricated to deceive readers. Social media and news outlets publish fake news to increase readership or as part of psychological warfare.

The latest hot topic in the news is fake news and many are wondering what data scientists can do to detect it and stymie its viral spread.

1.2 Objective

The objective of this notebook and its attached report is to use Machine learning in order to construct a fake news detection algorithm so as to predict if one news is True or Fake and thus study the possibilities of data science in the detection of fake news.
Based on our results, we will be able to develop initial recommendations to any company or organisation that is trying to counteract fake news.

1.3 Datasets

Our dataset consists of two parts: a training set and a testing set.
We will use our training set to build our models to predict on the test dataset which articles are considered fake or not.

Each dataset contains 05 variables:

1.4 Results

Best Models according to accuracy : 1) 0.971000 Support Vector Machines
2) 0.969000 Stochastic Gradient Decent
3) 0.9626670 Perceptron

2. Exploring the data

2.1 Loading libraries and dataset

2.2 Analyze by describing data

Which features are available in the dataset?

Our dataset is well import. Let us check data information

We have null values. We will check if those empyt texts will be a problem in the futur for our model. For now, let's keep them as N.A

2.3 Target analysis : Check for Class Imbalance¶

1 stands for unreliable and 0 for reliable.

From this plot, we can observe that we do not have a class imbalance. There are very slightly more unfounded news, which is not a problem.

2.4 Check for text content

Let us check the content of of 3 text columns : author, the title and the content of the article.
We will use our training dataset content for this check since our test dataset has the same structure.

Article are quite long, we print only our 2 first articles

Article are quite long and have a lot of content. We will have to process them in order to pass them in a ML model

One author could have written different article. Let us count the number of author in both our training and test datastets

We can observe that some authors are much more present than others. Maybe some authors are particularly prone to write fake news. We also observe that we find some of the top authors in our test and training dataset.

Let us check the same thing for the title

some titles appear several times. Maybe some articles appear twice.

2.5 Merge text context

We will merge all our text content that we have just analysed in one column both in training and testing dataset

Convert text article into str

3.Preprocessing of the article

We will clean the article by removing the punctuation, the special characters, tokenize them, remove the stopwords in order to use machine learning models

We apply the same step for our testing dataset

4. Article Post-processing EDA

Now that we have cleaned our data and convert them into the right format, we can do some nice vizualisation.

First, let us plot wordcloud in order to grasp the most frequent terms in our article

Visualizing the true news with Wordcloud.

Let us compare with our testing dataset

Now we will plot the pareto chart to better visualize the frequencies of the words.

5. Modeling and classify in test dataset

Now, let us build our machine learning model.
We will create the BOW and the TF-IDF in order to use those models

Now we will train several Machine Learning models and compare their results. Note that because the dataset does not provide labels for their testing-set, we need to use the predictions on the training set to compare the algorithms with each other.

  1. Random Forest
  2. Decision Tree
  3. KNN
  4. Logistic Regression
  5. Naive Bayes
  6. Stochastic Gradient Decent
  7. Support Vector Machines
  8. Perceptron

Our best model is Support Vector Machine with an accuracy score by 0.971000.
Let us plot our confusion matrix.

Support Vector Machine is a supervised ML algorithm which can be used for both classification of regression. In this alogith we plot each data item as a point in n-dimensionsal space (here n=2) with the value of each feature being the value of a particular coordinate. We then perform classificaiton by finding the hyperplane that differentates thr two classes very well.
For SVM, hyperplace is the plan that maximized the margins from both tags. In other words, the hyperplan whose distance ot the nearest element of each tag is the larget.

image.png

Predict on test data

Now that we have our best model, let us predict on our test dataset.

First we train our SVM model in all our training dataset

We apply the same bow and tf-idf

Export result in csv pred

We export our classification in csv for

6. Interpretation of results

Let us plot our results and compare the terms frequency according to label in our training dataset and testing dataset

Let us plot the frequency of labels first

1 stands for unreliable and 0 for reliable.We have a rather good balanced classification

Visualizing the true news with Wordcloud.

Let us compare with our training dataset

Now we will plot the pareto chart to better visualize the frequencies of the words.

From these visualisations, we can conclude that a number of keywords are particularly relevant for fake news. We can indeed observe that these words are the same in our training and testing dataset.

6.Recommandation

From our notebook we can conclude and recommand :