Simon Weiss
29.03.2021
TBS MSc Artifical Intelligence and Business Analytics
Fake news has been there since before the advent of the Internet.
They are widely accepted to be fictitious articles deliberately fabricated to deceive readers. Social media and news outlets publish fake news to increase readership or as part of psychological warfare.
The latest hot topic in the news is fake news and many are wondering what data scientists can do to detect it and stymie its viral spread.
The objective of this notebook and its attached report is to use Machine learning in order to construct a fake news detection algorithm so as to predict if one news is True or Fake and thus study the possibilities of data science in the detection of fake news.
Based on our results, we will be able to develop initial recommendations to any company or organisation that is trying to counteract fake news.
Our dataset consists of two parts: a training set and a testing set.
We will use our training set to build our models to predict on the test dataset which articles are considered fake or not.
Each dataset contains 05 variables:
Best Models according to accuracy :
1) 0.971000 Support Vector Machines
2) 0.969000 Stochastic Gradient Decent
3) 0.9626670 Perceptron
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
import os
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# Words
import nltk
import re
from wordcloud import WordCloud
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#Loading the datasets
train_df = pd.read_csv('./data/training_set22.csv',index_col='Unnamed: 0')
test_df = pd.read_csv('./data/testing_set22.csv',index_col='Unnamed: 0')
Which features are available in the dataset?
print(train_df.columns.values)
['id' 'title' 'author' 'text' 'label']
train_df.head()
id | title | author | text | label | |
---|---|---|---|---|---|
5973 | 5972 | Report: Facebook Pays ’Army’ of Filipinos to P... | Lucas Nolan | Facebook reportedly pays an army of Filipino c... | 0 |
11304 | 11303 | WikiLeaks: Hillary Clinton knew Saudi, Qatar w... | wmw_admin | SEE ALSO: Why Hillary Clinton is Responsible f... | 1 |
3797 | 3796 | Apple iPhone, Once a Status Symbol in China, L... | Paul Mozur | HONG KONG — Since 2010, Yu Kai has followed... | 0 |
11197 | 11196 | U.F.C. Sells Itself for $4 Billion - The New Y... | Michael J. de la Merced | SAN FRANCISCO — Ultimate Fighting Champions... | 0 |
1490 | 1489 | At the Mosul Front: Traps, Smoke Screens and S... | Bryan Denton and Michael R. Gordon | Mr. Denton, a Times photographer, and Mr. Gord... | 0 |
test_df.head()
id | title | author | text | |
---|---|---|---|---|
3971 | 24770 | WATCH: Diamond And Silk Slam Obama, Kerry For ... | Deborah Danan | TEL AVIV — The YouTube sensations known as ... |
2219 | 23018 | Elite Soccer Clubs Sign Gamers to Compete in E... | Jack Williams | Every now and again, Koen Weijland feels like ... |
677 | 21476 | #NotMyPresident? #DrainTheSwamp? The REAL Revo... | Corbett | Podcast: Play in new window | Download | Embed... |
3226 | 24025 | Election 2016 Current Votes in 8 Battleground ... | smith.j | For decades, media networks, newspapers, and j... |
789 | 21588 | Bus Driver Pulls Over to Stop Suicidal Woman f... | Breitbart TV | An Ohio bus driver’s quick thinking last month... |
Our dataset is well import. Let us check data information
train_df.info()
print(' ')
print('=*='*20)
print(' ')
test_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 15000 entries, 5973 to 2933 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 15000 non-null int64 1 title 14600 non-null object 2 author 13599 non-null object 3 text 14974 non-null object 4 label 15000 non-null int64 dtypes: int64(2), object(3) memory usage: 703.1+ KB =*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*= <class 'pandas.core.frame.DataFrame'> Int64Index: 4000 entries, 3971 to 5 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 4000 non-null int64 1 title 3908 non-null object 2 author 3600 non-null object 3 text 3995 non-null object dtypes: int64(1), object(3) memory usage: 156.2+ KB
#Searching for null values.
train_df.isna().sum()
id 0 title 400 author 1401 text 26 label 0 dtype: int64
test_df.isna().sum()
id 0 title 92 author 400 text 5 dtype: int64
We have null values. We will check if those empyt texts will be a problem in the futur for our model. For now, let's keep them as N.A
train_df['label'].value_counts()
1 7508 0 7492 Name: label, dtype: int64
1 stands for unreliable and 0 for reliable.
plt.hist(train_df['label'], color = 'coral', align='mid')
plt.ylabel('Count', fontsize=14)
col_names=['reliable','Unreliable']
x_values = np.arange(0, len(col_names))
plt.xticks(x_values, col_names)
plt.xlabel('Label', fontsize=14)
plt.title('Count of news per label', fontsize = 16)
Text(0.5, 1.0, 'Count of news per label')
From this plot, we can observe that we do not have a class imbalance. There are very slightly more unfounded news, which is not a problem.
Let us check the content of of 3 text columns : author, the title and the content of the article.
We will use our training dataset content for this check since our test dataset has the same structure.
t = train_df["author"].to_list()
for i in range(5):
print('Author of article '+str(i+1)+': '+t[i])
Author of article 1: Lucas Nolan Author of article 2: wmw_admin Author of article 3: Paul Mozur Author of article 4: Michael J. de la Merced Author of article 5: Bryan Denton and Michael R. Gordon
t = train_df["title"].to_list()
for i in range(5):
print('Title of article '+str(i+1)+': '+t[i])
Title of article 1: Report: Facebook Pays ’Army’ of Filipinos to Police ’Offensive’ Material - Breitbart Title of article 2: WikiLeaks: Hillary Clinton knew Saudi, Qatar were funding ISIS – but still took their money for Foundation Title of article 3: Apple iPhone, Once a Status Symbol in China, Loses Its Luster - The New York Times Title of article 4: U.F.C. Sells Itself for $4 Billion - The New York Times Title of article 5: At the Mosul Front: Traps, Smoke Screens and Suicide Bombers - The New York Times
Article are quite long, we print only our 2 first articles
t = train_df["text"].to_list()
for i in range(2):
print('Content of article '+str(i+1)+': '+t[i])
Content of article 1: Facebook reportedly pays an army of Filipino content curators to police content on the social network’s platform. [The Daily Mail reports that Facebook has been hiring young Filipinos to act as content curators on the Facebook platform. The workers reportedly work grueling shifts for little pay and decide on whether or not content on Facebook should be removed or allowed. Facebook founder Mark Zuckerberg stated earlier this month that the social media company, worth $435 billion, would be adding another 3, 000 content moderators to the team of 4, 500 it already employs and pledged to “improve the process for [reporting content] quickly. ” It was discovered during an investigation by the Mail on Sunday that Facebook outsources much of its content policing to the professional services firm Accenture. “We hire college graduates, experienced hires, provide intensive training and pay competitive wages,” Accenture claimed. When asked for comment, Facebook said, “We’ve built a global network of operations centres to work so that we have people in the right country with the right language and cultural skills to review reports. We recognise this work can be difficult, which is why our contracts with partners stipulate that wellness and psychological support must be provided. ” Content of article 2: SEE ALSO: Why Hillary Clinton is Responsible for US Failures in Libya and Syria According to FOX News , FBI sources have said that ‘indictments are likely’ for the Clinton Foundation investigation. One only wonders how this latest Assange revelation will factor into the wider investigation – as it goes right to the heart of the national security and foreign policy – two things which Clinton trades heavily on in her campaigning. Assange went on to explain the deep ramifications of this latest criminal allegation against Clinton and her family foundation: “All serious analysts know, and even the US government has agreed, that some Saudi figures have been supporting ISIS and funding ISIS, but the dodge has always been that it is some “rogue” princes using their oil money to do whatever they like, but actually the government disapproves. But that email says that it is the government of Saudi Arabia, and the government of Qatar that have been funding ISIS.” During their 25-minute interview filmed at the Ecuadorian Embassy in London, Assange and Pilger discussed the obvious conflict of interest between Clinton as Secretary of State, the Clinton Foundation and Gulf monarchies who financed them. The following is an excerpt from the interview transcript: John Pilger: The Saudis, the Qataris, the Moroccans, the Bahrainis, particularly the first two, are giving all this money to the Clinton Foundation, while Hillary Clinton is secretary of state, and the State Department is approving massive arms sales, particularly Saudi Arabia. Julian Assange: Under Hillary Clinton – and the Clinton emails reveal a significant discussion of it – the biggest-ever arms deal in the world was made with Saudi Arabia: more than $80 billion. During her tenure, the total arms exports from the US doubled in dollar value. JP: Of course, the consequence of that is that this notorious jihadist group, called ISIL or ISIS, is created largely with money from people who are giving money to the Clinton Foundation? JA: Yes. Watch a brief preview of the interview here : Courtesy Peter Myers
Article are quite long and have a lot of content. We will have to process them in order to pass them in a ML model
One author could have written different article. Let us count the number of author in both our training and test datastets
train_df['author'].value_counts().head(n=20)
Pam Key 178 admin 150 Jerome Hudson 114 John Hayward 108 Charlie Spiering 106 Warner Todd Huston 97 Katherine Rodriguez 92 Daniel Nussbaum 88 Jeff Poor 84 Ian Hanchett 77 Breitbart News 76 Trent Baker 74 Bob Price 74 Charlie Nash 71 AWR Hawkins 71 Breitbart London 70 Starkman 66 Ben Kew 66 Pakalert 66 Alex Ansary 63 Name: author, dtype: int64
test_df['author'].value_counts().head(n=20)
Pam Key 45 Jerome Hudson 42 admin 40 Daniel Nussbaum 29 Charlie Spiering 27 Charlie Nash 26 Joel B. Pollak 25 Katherine Rodriguez 24 Warner Todd Huston 24 Ian Hanchett 24 Breitbart News 23 AWR Hawkins 23 Breitbart London 19 Tom Ciccotta 19 BareNakedIslam 19 John Hayward 18 EdJenner 18 Ben Kew 17 The Saker 17 Chris Tomlinson 17 Name: author, dtype: int64
# Plotting a bar graph of the number of author
author_count = train_df['author'].value_counts()
author_count = author_count[:10,]
plt.figure(figsize=(10,5))
sns.barplot(author_count.index, author_count.values, alpha=0.8)
plt.title('Top 10 Author training dataset')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Name', fontsize=12)
plt.show()
# Plotting a bar graph of the number of author
author_count = test_df['author'].value_counts()
author_count = author_count[:10,]
plt.figure(figsize=(10,5))
sns.barplot(author_count.index, author_count.values, alpha=0.8)
plt.title('Top 10 Author testing dataset')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Name', fontsize=12)
plt.show()
We can observe that some authors are much more present than others. Maybe some authors are particularly prone to write fake news. We also observe that we find some of the top authors in our test and training dataset.
Let us check the same thing for the title
train_df['title'].value_counts().head(n=20)
Will Barack Obama Delay Or Suspend The Election If Hillary Is Forced Out By The New FBI Email Investigation? 4 Michael Moore Owes Me $4.99 4 The Dark Agenda Behind Globalism And Open Borders 4 Get Ready For Civil Unrest: Survey Finds That Most Americans Are Concerned About Election Violence 4 Let’s Be Clear – A Vote For Warmonger Hillary Clinton Is A Vote For World War 3 4 Las imágenes libres de derechos más destacadas de la semana 4 Brother of Clinton’s Campaign Chair is an Active Foreign Agent on the Saudi Arabian Payroll 3 Televisión: lo más visto ayer 3 Break the Silence or Support Self-Determination? In Syria, the Answer Should be Obvious 3 The Pathologization of Dissent 3 Not Just Hillary, Entire Obama Administration Exposed for Using Private Email to Avoid FOIA Requests 3 President Elect Trump – A New Era of Unpredictability Awaits 3 FDA Found Manipulating The Media In Favor Of Big Pharma 3 Trump warns of World War III if Clinton is elected 3 The U.S./Turkey Plan For “Seizing, Holding, And Occupying” Syrian Territory In Raqqa 3 Dennis Kucinich’s Extraordinary Warning on D.C.’s Think Tank Warmongers 3 WWN’s Horoscopes 3 What to Cook This Week - The New York Times 3 Woman Arrested On Own Property After Her Land Was Stolen By DAPL 3 Biden Blames “Lazy American Women” For The Economy: “They Sit Around Doing Nothing, Only Hillary Can Force Them To Work” 3 Name: title, dtype: int64
test_df['title'].value_counts().head(n=20)
The De Facto US/Al Qaeda Alliance 2 TRUMP CALLS FOR TEACHING ‘PATRIOTISM’ IN SCHOOLS 2 When Slaveholders Controlled the Government—An Interview with Matthew Karp 2 One of the Most Undervalued Storable Survival Foods 2 Thousands Of Buffalo Appear At Site Of Standing Rock Protest [Watch] 2 The Modern History of ‘Rigged’ US Elections 2 Get Ready For Civil Unrest: Survey Finds That Most Americans Are Concerned About Election Violence 2 Massive Voter Fraud In Texas 2 CETA: Canada Has Challenged The EU’s Chemical Regulations 21 Times 2 BREAKING: Racketeering indictment of Hillary Clinton now ‘likely’ as FOIA for Datto backup device reveals FBI possesses ALL the incriminating emails 2 2016 Shows Record Number of Refugee Deaths in Mediterranean 1 Everyone now officially an artisan 1 Colombian Opposition to Peace Deal Feeds Off Gay Rights Backlash - The New York Times 1 ’Saturday Night Live’ Writer Tweets Barron Trump Will Be America’s ‘First Homeschool Shooter’ 1 Snow reports from around the Northland (Duluth) through 9 a.m. Saturday 1 In Cranes’ Shadow, Los Angeles Strains to See a Future With Less Sprawl - The New York Times 1 Netanyahu: U.S.-Israel Alliance is ’About to Get Stronger’ 1 Nature: The Ultimate Cure 1 Christo’s Newest Project: Walking on Water - The New York Times 1 Mexico Ends Its Soccer Frustration on U.S. Soil - The New York Times 1 Name: title, dtype: int64
some titles appear several times. Maybe some articles appear twice.
We will merge all our text content that we have just analysed in one column both in training and testing dataset
#We will join title, text and author to create the article feature
train_df['article'] = train_df['title']+""+train_df['text']+""+train_df['author']
#We will join title, text and author to create the article feature
test_df['article'] = test_df['title']+""+test_df['text']+""+test_df['author']
#Creating the final Dataframe with article and label.
train_df_final = train_df[['article','label']]
# Create empty label target column in our test dataset
test_df['label']=""
#Creating the final Dataframe test with article and label.
test_df_final = test_df[['article','label']]
Convert text article into str
train_df_final['article'] = train_df_final['article'].astype(str)
test_df_final['article'] = test_df_final['article'].astype(str)
We will clean the article by removing the punctuation, the special characters, tokenize them, remove the stopwords in order to use machine learning models
#Converting to lower case
train_df_final['article'] = train_df_final['article'].apply(lambda x: x.lower())
test_df_final['article'] = test_df_final['article'].apply(lambda x: x.lower())
train_df_final['article'].head()
5973 report: facebook pays ’army’ of filipinos to p... 11304 wikileaks: hillary clinton knew saudi, qatar w... 3797 apple iphone, once a status symbol in china, l... 11197 u.f.c. sells itself for $4 billion - the new y... 1490 at the mosul front: traps, smoke screens and s... Name: article, dtype: object
#Removing punctuation
import string
def punctuation_removal(messy_str):
clean_list = [char for char in messy_str if char not in string.punctuation]
clean_str = ''.join(clean_list)
return clean_str
train_df_final['article'] = train_df_final['article'].apply(punctuation_removal)
test_df_final['article'] = test_df_final['article'].apply(punctuation_removal)
train_df_final['article'].head()
5973 report facebook pays ’army’ of filipinos to po... 11304 wikileaks hillary clinton knew saudi qatar wer... 3797 apple iphone once a status symbol in china los... 11197 ufc sells itself for 4 billion the new york t... 1490 at the mosul front traps smoke screens and sui... Name: article, dtype: object
test_df_final['article'].head()
3971 watch diamond and silk slam obama kerry for ’s... 2219 elite soccer clubs sign gamers to compete in e... 677 notmypresident draintheswamp the real revoluti... 3226 election 2016 current votes in 8 battleground ... 789 bus driver pulls over to stop suicidal woman f... Name: article, dtype: object
#Converting X to format removing and punctuation stopwords in the process
X = []
stop_words = set(nltk.corpus.stopwords.words("english"))
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
for par in train_df_final["article"].values:
tmp = []
sentences = nltk.sent_tokenize(par)
for sent in sentences:
sent = sent.lower()
tokens = tokenizer.tokenize(sent)
filtered_words = [w.strip() for w in tokens if w not in stop_words and len(w) > 1]
tmp.extend(filtered_words)
X.append(tmp)
train_df_final["article"]=X
train_df_final["article"]=train_df_final["article"].astype(str)
train_df_final.head()
article | label | |
---|---|---|
5973 | ['report', 'facebook', 'pays', 'army', 'filipi... | 0 |
11304 | ['wikileaks', 'hillary', 'clinton', 'knew', 's... | 1 |
3797 | ['apple', 'iphone', 'status', 'symbol', 'china... | 0 |
11197 | ['ufc', 'sells', 'billion', 'new', 'york', 'ti... | 0 |
1490 | ['mosul', 'front', 'traps', 'smoke', 'screens'... | 0 |
We apply the same step for our testing dataset
#Converting X to format removing and punctuation stopwords in the process
X = []
stop_words = set(nltk.corpus.stopwords.words("english"))
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
for par in test_df_final["article"].values:
tmp = []
sentences = nltk.sent_tokenize(par)
for sent in sentences:
sent = sent.lower()
tokens = tokenizer.tokenize(sent)
filtered_words = [w.strip() for w in tokens if w not in stop_words and len(w) > 1]
tmp.extend(filtered_words)
X.append(tmp)
test_df_final["article"]=X
test_df_final["article"]=test_df_final["article"].astype(str)
test_df_final.head()
article | label | |
---|---|---|
3971 | ['watch', 'diamond', 'silk', 'slam', 'obama', ... | |
2219 | ['elite', 'soccer', 'clubs', 'sign', 'gamers',... | |
677 | ['notmypresident', 'draintheswamp', 'real', 'r... | |
3226 | ['election', '2016', 'current', 'votes', 'batt... | |
789 | ['bus', 'driver', 'pulls', 'stop', 'suicidal',... |
Now that we have cleaned our data and convert them into the right format, we can do some nice vizualisation.
First, let us plot wordcloud in order to grasp the most frequent terms in our article
%matplotlib inline
from wordcloud import WordCloud
all_words = ' '.join([text for text in train_df_final.article])
wordcloud = WordCloud(width= 800, height= 500,
max_font_size = 110,
collocations = False).generate(all_words)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Visualizing the true news with Wordcloud.
train_df_final[train_df_final['label']==0]
article | label | |
---|---|---|
5973 | ['report', 'facebook', 'pays', 'army', 'filipi... | 0 |
3797 | ['apple', 'iphone', 'status', 'symbol', 'china... | 0 |
11197 | ['ufc', 'sells', 'billion', 'new', 'york', 'ti... | 0 |
1490 | ['mosul', 'front', 'traps', 'smoke', 'screens'... | 0 |
6758 | ['frank', 'gaffney', 'tillerson', 'incoherent'... | 0 |
... | ... | ... |
18667 | ['right', 'left', 'partisan', 'writing', 'miss... | 0 |
9954 | ['hirakhand', 'express', 'train', 'derails', '... | 0 |
19494 | ['lirr', 'train', 'crashed', 'going', 'twice',... | 0 |
19662 | ['putin', 'allegations', 'russian', 'meddling'... | 0 |
2933 | ['donald', 'trump', 'delta', 'air', 'lines', '... | 0 |
7492 rows × 2 columns
%matplotlib inline
from wordcloud import WordCloud
all_words = ' '.join([text for text in train_df_final[train_df_final['label']==0].article])
wordcloud = WordCloud(width= 800, height= 500,
max_font_size = 110,
collocations = False).generate(all_words)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
%matplotlib inline
from wordcloud import WordCloud
all_words = ' '.join([text for text in train_df_final[train_df_final['label']==1].article ])
wordcloud = WordCloud(width= 800, height= 500,
max_font_size = 110,
collocations = False).generate(all_words)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Let us compare with our testing dataset
%matplotlib inline
from wordcloud import WordCloud
all_words = ' '.join([text for text in test_df_final.article ])
wordcloud = WordCloud(width= 800, height= 500,
max_font_size = 110,
collocations = False).generate(all_words)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Now we will plot the pareto chart to better visualize the frequencies of the words.
## remove whitespace
from nltk import tokenize
token_space = tokenize.WhitespaceTokenizer()
import seaborn as sns
import nltk
def pareto(text, column_text, quantity):
all_words = ' '.join([text for text in text[column_text]])
token_phrase = token_space.tokenize(all_words)
frequency = nltk.FreqDist(token_phrase)
df_frequency = pd.DataFrame({"Word": list(frequency.keys()),
"Frequency": list(frequency.values())})
df_frequency = df_frequency.nlargest(columns = "Frequency", n = quantity)
plt.figure(figsize=(12,8))
ax = sns.barplot(data = df_frequency, x = "Word", y = "Frequency", color = 'blue')
ax.set(ylabel = "Count")
plt.show()
#The 20 more frequent word in reliable article
pareto(train_df_final[train_df_final['label']==0], "article", 20)
#The 20 more frequent word in ureliable article
pareto(train_df_final[train_df_final['label']==1], "article", 20)
#The 20 more frequent word in ureliable article
pareto(test_df_final, "article", 20)
Now, let us build our machine learning model.
We will create the BOW and the TF-IDF in order to use those models
from sklearn.feature_extraction.text import CountVectorizer
#Creating the bag of words
bow_article = CountVectorizer().fit(train_df_final['article'])
article_vect = bow_article.transform(train_df_final['article'])
#TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(article_vect)
news_tfidf = tfidf_transformer.transform(article_vect)
print(news_tfidf.shape)
(15000, 159882)
#We will use 20% of the data to train the models.
seed_value= 12321
from sklearn.model_selection import train_test_split
X = news_tfidf
y = train_df_final['label']
X_train, X_test, Y_train,Y_test= train_test_split(X, y, test_size=0.2, random_state=42)
Now we will train several Machine Learning models and compare their results. Note that because the dataset does not provide labels for their testing-set, we need to use the predictions on the training set to compare the algorithms with each other.
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test,y_preds)
matrix_proportions = np.zeros((3,3))
for i in range(0,3):
matrix_proportions[i,:] = confusion_matrix[i,:]/float(confusion_matrix[i,:].sum())
names=['Hate','Offensive','Neither']
confusion_df = pd.DataFrame(matrix_proportions, index=names,columns=names)
plt.figure(figsize=(5,5))
seaborn.heatmap(confusion_df,annot=True,annot_kws={"size": 12},cmap='YlGnBu',cbar=False, square=True,fmt='.2f')
plt.ylabel(r'True Value',fontsize=14)
plt.xlabel(r'Predicted Value',fontsize=14)
plt.tick_params(labelsize=12)
# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
def build_best_machine_learning_models(X_train,Y_train,X_test):
#Stochastic Gradient Descent (SGD):
#Only continious calue (vectorisation)
sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = accuracy_score(Y_test,Y_pred)
#Random Forest:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
acc_random_forest = accuracy_score(Y_test,Y_pred)
#Logistic Regression:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = accuracy_score(Y_test,Y_pred)
# K Nearest Neighbor:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = accuracy_score(Y_test,Y_pred)
# Naive Bayes:
multinb = MultinomialNB()
multinb.fit(X_train, Y_train)
Y_pred = multinb.predict(X_test)
acc_multinb = accuracy_score(Y_test,Y_pred)
#Perceptron:
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = accuracy_score(Y_test,Y_pred)
# Linear Support Vector Machine:
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = accuracy_score(Y_test,Y_pred)
# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = accuracy_score(Y_test,Y_pred)
# Which is the best Model ?
results = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent',
'Decision Tree'],
'Score': [acc_linear_svc, acc_knn, acc_log,
acc_random_forest, acc_multinb, acc_perceptron,
acc_sgd, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
return result_df
results_df = build_best_machine_learning_models(X_train,Y_train,X_test)
results_df
Model | |
---|---|
Score | |
0.971000 | Support Vector Machines |
0.969000 | Stochastic Gradient Decent |
0.962667 | Perceptron |
0.956000 | Logistic Regression |
0.926333 | Decision Tree |
0.913667 | Random Forest |
0.880667 | KNN |
0.793000 | Naive Bayes |
Our best model is Support Vector Machine with an accuracy score by 0.971000.
Let us plot our confusion matrix.
linear_svc = LinearSVC()
clf=linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
print (classification_report(Y_test, Y_pred))
precision recall f1-score support 0 0.97 0.97 0.97 1471 1 0.97 0.97 0.97 1529 accuracy 0.97 3000 macro avg 0.97 0.97 0.97 3000 weighted avg 0.97 0.97 0.97 3000
Support Vector Machine is a supervised ML algorithm which can be used for both classification of regression.
In this alogith we plot each data item as a point in n-dimensionsal space (here n=2) with the value of each feature being the value of a particular coordinate.
We then perform classificaiton by finding the hyperplane that differentates thr two classes very well.
For SVM, hyperplace is the plan that maximized the margins from both tags. In other words, the hyperplan whose distance ot the nearest element of each tag is the larget.
Now that we have our best model, let us predict on our test dataset.
First we train our SVM model in all our training dataset
linear_svc = LinearSVC()
linear_svc.fit(X, y)
LinearSVC()
We apply the same bow and tf-idf
from sklearn.feature_extraction.text import CountVectorizer
#Creating the bag of words
bow_article_test = CountVectorizer().fit(test_df_final['article'])
article_vect_test = bow_article.transform(test_df_final['article'])
#TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(article_vect)
news_tfidf = tfidf_transformer.transform(article_vect_test)
print(news_tfidf.shape)
X_test = news_tfidf
(4000, 159882)
Y_pred = linear_svc.predict(X_test)
test_df_final['label']=Y_pred
test_df_final
article | label | lable | |
---|---|---|---|
3971 | ['watch', 'diamond', 'silk', 'slam', 'obama', ... | 1 | 1 |
2219 | ['elite', 'soccer', 'clubs', 'sign', 'gamers',... | 0 | 0 |
677 | ['notmypresident', 'draintheswamp', 'real', 'r... | 1 | 1 |
3226 | ['election', '2016', 'current', 'votes', 'batt... | 1 | 1 |
789 | ['bus', 'driver', 'pulls', 'stop', 'suicidal',... | 0 | 0 |
... | ... | ... | ... |
3504 | ['rio', 'olympics', 'today', 'us', 'swimmers',... | 0 | 0 |
3294 | ['jim', 'mattis', 'says', 'us', 'shoulder', 's... | 0 | 0 |
286 | ['colombian', 'opposition', 'peace', 'deal', '... | 0 | 0 |
65 | ['six', 'gulf', 'protectors', 'arrested', 'cha... | 1 | 1 |
5 | ['keiser', 'report', 'meme', 'wars', 'e99542',... | 1 | 1 |
4000 rows × 3 columns
We export our classification in csv for
test_df_final.to_csv("article_classification.csv",index=True)
Let us plot our results and compare the terms frequency according to label in our training dataset and testing dataset
Let us plot the frequency of labels first
test_df_final['label'].value_counts()
0 2011 1 1989 Name: label, dtype: int64
1 stands for unreliable and 0 for reliable.We have a rather good balanced classification
Visualizing the true news with Wordcloud.
test_df_final[test_df_final['label']==0]
article | label | |
---|---|---|
2219 | ['elite', 'soccer', 'clubs', 'sign', 'gamers',... | 0 |
789 | ['bus', 'driver', 'pulls', 'stop', 'suicidal',... | 0 |
2719 | ['millions', 'risk', 'deportation', 'justices'... | 0 |
3681 | ['40', 'japan', 'confirmed', 'dead', 'earthqua... | 0 |
720 | ['good', 'cop', 'dead', 'cop', 'says', 'allege... | 0 |
... | ... | ... |
1114 | ['macy', 'close', '100', 'stores', 'erivals', ... | 0 |
4418 | ['jobs', 'donald', 'trump', 'celebrates', 'mas... | 0 |
3504 | ['rio', 'olympics', 'today', 'us', 'swimmers',... | 0 |
3294 | ['jim', 'mattis', 'says', 'us', 'shoulder', 's... | 0 |
286 | ['colombian', 'opposition', 'peace', 'deal', '... | 0 |
2011 rows × 2 columns
%matplotlib inline
from wordcloud import WordCloud
all_words = ' '.join([text for text in test_df_final[test_df_final['label']==0].article])
wordcloud = WordCloud(width= 800, height= 500,
max_font_size = 110,
collocations = False).generate(all_words)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
%matplotlib inline
from wordcloud import WordCloud
all_words = ' '.join([text for text in test_df_final[test_df_final['label']==1].article ])
wordcloud = WordCloud(width= 800, height= 500,
max_font_size = 110,
collocations = False).generate(all_words)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Let us compare with our training dataset
Now we will plot the pareto chart to better visualize the frequencies of the words.
import seaborn as sns
import nltk
def pareto(text, column_text, quantity):
all_words = ' '.join([text for text in text[column_text]])
token_phrase = token_space.tokenize(all_words)
frequency = nltk.FreqDist(token_phrase)
df_frequency = pd.DataFrame({"Word": list(frequency.keys()),
"Frequency": list(frequency.values())})
df_frequency = df_frequency.nlargest(columns = "Frequency", n = quantity)
plt.figure(figsize=(12,8))
ax = sns.barplot(data = df_frequency, x = "Word", y = "Frequency", color = 'blue')
ax.set(ylabel = "Count")
plt.show()
#The 20 more frequent word in reliable article
pareto(test_df_final[test_df_final['label']==0], "article", 20)
#The 20 more frequent word in ureliable article
pareto(test_df_final[test_df_final['label']==1], "article", 20)
From these visualisations, we can conclude that a number of keywords are particularly relevant for fake news. We can indeed observe that these words are the same in our training and testing dataset.
From our notebook we can conclude and recommand :