# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
import os

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Words
import nltk
import re
from wordcloud import WordCloud

from nltk.stem.porter import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


#Loading the datasets

train_df = pd.read_csv('./data/training_set22.csv',index_col='Unnamed: 0')
test_df = pd.read_csv('./data/testing_set22.csv',index_col='Unnamed: 0')


print(train_df.columns.values)

['id' 'title' 'author' 'text' 'label']


train_df.head()


test_df.head()


train_df.info()
print(' ')
print('=*='*20)
print(' ')
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15000 entries, 5973 to 2933
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      15000 non-null  int64 
 1   title   14600 non-null  object
 2   author  13599 non-null  object
 3   text    14974 non-null  object
 4   label   15000 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 703.1+ KB
 
=*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*==*=
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4000 entries, 3971 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      4000 non-null   int64 
 1   title   3908 non-null   object
 2   author  3600 non-null   object
 3   text    3995 non-null   object
dtypes: int64(1), object(3)
memory usage: 156.2+ KB


#Searching for null values.

train_df.isna().sum()

id           0
title      400
author    1401
text        26
label        0
dtype: int64


test_df.isna().sum()

id          0
title      92
author    400
text        5
dtype: int64


train_df['label'].value_counts()

1    7508
0    7492
Name: label, dtype: int64


plt.hist(train_df['label'], color = 'coral', align='mid')
plt.ylabel('Count', fontsize=14)
col_names=['reliable','Unreliable']
x_values = np.arange(0, len(col_names))


plt.xticks(x_values, col_names)
plt.xlabel('Label', fontsize=14)
plt.title('Count of news per label', fontsize = 16)

Text(0.5, 1.0, 'Count of news per label')


t = train_df["author"].to_list()
for i in range(5):
    print('Author of article '+str(i+1)+': '+t[i])

Author of article 1: Lucas Nolan
Author of article 2: wmw_admin
Author of article 3: Paul Mozur
Author of article 4: Michael J. de la Merced
Author of article 5: Bryan Denton and Michael R. Gordon


t = train_df["title"].to_list()
for i in range(5):
    print('Title of article '+str(i+1)+': '+t[i])

Title of article 1: Report: Facebook Pays ’Army’ of Filipinos to Police ’Offensive’ Material - Breitbart
Title of article 2: WikiLeaks: Hillary Clinton knew Saudi, Qatar were funding ISIS – but still took their money for Foundation
Title of article 3: Apple iPhone, Once a Status Symbol in China, Loses Its Luster - The New York Times
Title of article 4: U.F.C. Sells Itself for $4 Billion - The New York Times
Title of article 5: At the Mosul Front: Traps, Smoke Screens and Suicide Bombers - The New York Times


t = train_df["text"].to_list()
for i in range(2):
    print('Content of article '+str(i+1)+': '+t[i])

Content of article 1: Facebook reportedly pays an army of Filipino content curators to police content on the social network’s platform. [The Daily Mail reports that Facebook has been hiring young Filipinos to act as content curators on the Facebook platform. The workers reportedly work grueling shifts for little pay and decide on whether or not content on Facebook should be removed or allowed.  Facebook founder Mark Zuckerberg stated earlier this month that the social media company, worth $435 billion, would be adding another 3, 000 content moderators to the team of 4, 500 it already employs and pledged to “improve the process for [reporting content] quickly. ” It was discovered during an investigation by the Mail on Sunday that Facebook outsources much of its content policing to the professional services firm Accenture. “We hire college graduates, experienced hires, provide intensive training and pay competitive wages,” Accenture claimed. When asked for comment, Facebook said, “We’ve built a global network of operations centres to work   so that we have people in the right country with the right language and cultural skills to review reports. We recognise this work can be difficult, which is why our contracts with partners stipulate that wellness and psychological support must be provided. ”
Content of article 2: SEE ALSO: Why Hillary Clinton is Responsible for US Failures in Libya and Syria 
According to FOX News , FBI sources have said that ‘indictments are likely’ for the Clinton Foundation investigation. One only wonders how this latest Assange revelation will factor into the wider investigation – as it goes right to the heart of the national security and foreign policy – two things which Clinton trades heavily on in her campaigning. 
Assange went on to explain the deep ramifications of this latest criminal allegation against Clinton and her family foundation: 
“All serious analysts know, and even the US government has agreed, that some Saudi figures have been supporting ISIS and funding ISIS, but the dodge has always been that it is some “rogue” princes using their oil money to do whatever they like, but actually the government disapproves. But that email says that it is the government of Saudi Arabia, and the government of Qatar that have been funding ISIS.” 
During their 25-minute interview filmed at the Ecuadorian Embassy in London, Assange and Pilger discussed the obvious conflict of interest between Clinton as Secretary of State, the Clinton Foundation and Gulf monarchies who financed them. The following is an excerpt from the interview transcript: 
John Pilger: The Saudis, the Qataris, the Moroccans, the Bahrainis, particularly the first two, are giving all this money to the Clinton Foundation, while Hillary Clinton is secretary of state, and the State Department is approving massive arms sales, particularly Saudi Arabia. 
Julian Assange: Under Hillary Clinton – and the Clinton emails reveal a significant discussion of it – the biggest-ever arms deal in the world was made with Saudi Arabia: more than $80 billion. During her tenure, the total arms exports from the US doubled in dollar value. 
JP: Of course, the consequence of that is that this notorious jihadist group, called ISIL or ISIS, is created largely with money from people who are giving money to the Clinton Foundation? 
JA: Yes. 
Watch a brief preview of the interview here : 
Courtesy Peter Myers


train_df['author'].value_counts().head(n=20)

Pam Key                178
admin                  150
Jerome Hudson          114
John Hayward           108
Charlie Spiering       106
Warner Todd Huston      97
Katherine Rodriguez     92
Daniel Nussbaum         88
Jeff Poor               84
Ian Hanchett            77
Breitbart News          76
Trent Baker             74
Bob Price               74
Charlie Nash            71
AWR Hawkins             71
Breitbart London        70
Starkman                66
Ben Kew                 66
Pakalert                66
Alex Ansary             63
Name: author, dtype: int64


test_df['author'].value_counts().head(n=20)

Pam Key                45
Jerome Hudson          42
admin                  40
Daniel Nussbaum        29
Charlie Spiering       27
Charlie Nash           26
Joel B. Pollak         25
Katherine Rodriguez    24
Warner Todd Huston     24
Ian Hanchett           24
Breitbart News         23
AWR Hawkins            23
Breitbart London       19
Tom Ciccotta           19
BareNakedIslam         19
John Hayward           18
EdJenner               18
Ben Kew                17
The Saker              17
Chris Tomlinson        17
Name: author, dtype: int64


# Plotting a bar graph of the number of author
author_count  = train_df['author'].value_counts()
author_count = author_count[:10,]
plt.figure(figsize=(10,5))
sns.barplot(author_count.index, author_count.values, alpha=0.8)
plt.title('Top 10 Author training dataset')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Name', fontsize=12)
plt.show()


# Plotting a bar graph of the number of author
author_count  = test_df['author'].value_counts()
author_count = author_count[:10,]
plt.figure(figsize=(10,5))
sns.barplot(author_count.index, author_count.values, alpha=0.8)
plt.title('Top 10 Author testing dataset')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Name', fontsize=12)
plt.show()


train_df['title'].value_counts().head(n=20)

Will Barack Obama Delay Or Suspend The Election If Hillary Is Forced Out By The New FBI Email Investigation?                4
Michael Moore Owes Me $4.99                                                                                                 4
The Dark Agenda Behind Globalism And Open Borders                                                                           4
Get Ready For Civil Unrest: Survey Finds That Most Americans Are Concerned About Election Violence                          4
Let’s Be Clear – A Vote For Warmonger Hillary Clinton Is A Vote For World War 3                                             4
Las imágenes libres de derechos más destacadas de la semana                                                                 4
Brother of Clinton’s Campaign Chair is an Active Foreign Agent on the Saudi Arabian Payroll                                 3
Televisión: lo más visto ayer                                                                                               3
Break the Silence or Support Self-Determination? In Syria, the Answer Should be Obvious                                     3
The Pathologization of Dissent                                                                                              3
Not Just Hillary, Entire Obama Administration Exposed for Using Private Email to Avoid FOIA Requests                        3
President Elect Trump – A New Era of Unpredictability Awaits                                                                3
FDA Found Manipulating The Media In Favor Of Big Pharma                                                                     3
Trump warns of World War III if Clinton is elected                                                                          3
The U.S./Turkey Plan For “Seizing, Holding, And Occupying” Syrian Territory In Raqqa                                        3
Dennis Kucinich’s Extraordinary Warning on D.C.’s Think Tank Warmongers                                                     3
WWN’s Horoscopes                                                                                                            3
What to Cook This Week - The New York Times                                                                                 3
Woman Arrested On Own Property After Her Land Was Stolen By DAPL                                                            3
Biden Blames “Lazy American Women” For The Economy: “They Sit Around Doing Nothing, Only Hillary Can Force Them To Work”    3
Name: title, dtype: int64


test_df['title'].value_counts().head(n=20)

The De Facto US/Al Qaeda Alliance                                                                                                                       2
TRUMP CALLS FOR TEACHING ‘PATRIOTISM’ IN SCHOOLS                                                                                                        2
When Slaveholders Controlled the Government—An Interview with Matthew Karp                                                                              2
One of the Most Undervalued Storable Survival Foods                                                                                                     2
Thousands Of Buffalo Appear At Site Of Standing Rock Protest [Watch]                                                                                    2
The Modern History of ‘Rigged’ US Elections                                                                                                             2
Get Ready For Civil Unrest: Survey Finds That Most Americans Are Concerned About Election Violence                                                      2
Massive Voter Fraud In Texas                                                                                                                            2
CETA: Canada Has Challenged The EU’s Chemical Regulations 21 Times                                                                                      2
BREAKING: Racketeering indictment of Hillary Clinton now ‘likely’ as FOIA for Datto backup device reveals FBI possesses ALL the incriminating emails    2
2016 Shows Record Number of Refugee Deaths in Mediterranean                                                                                             1
Everyone now officially an artisan                                                                                                                      1
Colombian Opposition to Peace Deal Feeds Off Gay Rights Backlash - The New York Times                                                                   1
’Saturday Night Live’ Writer Tweets Barron Trump Will Be America’s ‘First Homeschool Shooter’                                                           1
Snow reports from around the Northland (Duluth) through 9 a.m. Saturday                                                                                 1
In Cranes’ Shadow, Los Angeles Strains to See a Future With Less Sprawl - The New York Times                                                            1
Netanyahu: U.S.-Israel Alliance is ’About to Get Stronger’                                                                                              1
Nature: The Ultimate Cure                                                                                                                               1
Christo’s Newest Project: Walking on Water - The New York Times                                                                                         1
Mexico Ends Its Soccer Frustration on U.S. Soil - The New York Times                                                                                    1
Name: title, dtype: int64


#We will join title, text and author to create the article feature
train_df['article'] = train_df['title']+""+train_df['text']+""+train_df['author']


#We will join title, text and author to create the article feature
test_df['article'] = test_df['title']+""+test_df['text']+""+test_df['author']


#Creating the final Dataframe with article and label.
train_df_final = train_df[['article','label']]


# Create empty label target column in our test dataset
test_df['label']=""


#Creating the final Dataframe test with article and label.
test_df_final = test_df[['article','label']]


train_df_final['article'] = train_df_final['article'].astype(str)
test_df_final['article'] = test_df_final['article'].astype(str)


#Converting to lower case

train_df_final['article'] = train_df_final['article'].apply(lambda x: x.lower())
test_df_final['article'] = test_df_final['article'].apply(lambda x: x.lower())


train_df_final['article'].head()

5973     report: facebook pays ’army’ of filipinos to p...
11304    wikileaks: hillary clinton knew saudi, qatar w...
3797     apple iphone, once a status symbol in china, l...
11197    u.f.c. sells itself for $4 billion - the new y...
1490     at the mosul front: traps, smoke screens and s...
Name: article, dtype: object


#Removing punctuation

import string

def punctuation_removal(messy_str):
    clean_list = [char for char in messy_str if char not in string.punctuation]
    clean_str = ''.join(clean_list)
    return clean_str


train_df_final['article'] = train_df_final['article'].apply(punctuation_removal)
test_df_final['article'] = test_df_final['article'].apply(punctuation_removal)


train_df_final['article'].head()

5973     report facebook pays ’army’ of filipinos to po...
11304    wikileaks hillary clinton knew saudi qatar wer...
3797     apple iphone once a status symbol in china los...
11197    ufc sells itself for 4 billion  the new york t...
1490     at the mosul front traps smoke screens and sui...
Name: article, dtype: object


test_df_final['article'].head()

3971    watch diamond and silk slam obama kerry for ’s...
2219    elite soccer clubs sign gamers to compete in e...
677     notmypresident draintheswamp the real revoluti...
3226    election 2016 current votes in 8 battleground ...
789     bus driver pulls over to stop suicidal woman f...
Name: article, dtype: object


#Converting X to format removing and punctuation stopwords in the process
X = []
stop_words = set(nltk.corpus.stopwords.words("english"))
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
for par in train_df_final["article"].values:
    tmp = []
    sentences = nltk.sent_tokenize(par)
    for sent in sentences:
        sent = sent.lower()
        tokens = tokenizer.tokenize(sent)
        filtered_words = [w.strip() for w in tokens if w not in stop_words and len(w) > 1]
        tmp.extend(filtered_words)
    X.append(tmp)


train_df_final["article"]=X


train_df_final["article"]=train_df_final["article"].astype(str)


train_df_final.head()


#Converting X to format removing and punctuation stopwords in the process
X = []
stop_words = set(nltk.corpus.stopwords.words("english"))
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
for par in test_df_final["article"].values:
    tmp = []
    sentences = nltk.sent_tokenize(par)
    for sent in sentences:
        sent = sent.lower()
        tokens = tokenizer.tokenize(sent)
        filtered_words = [w.strip() for w in tokens if w not in stop_words and len(w) > 1]
        tmp.extend(filtered_words)
    X.append(tmp)


test_df_final["article"]=X


test_df_final["article"]=test_df_final["article"].astype(str)


test_df_final.head()


%matplotlib inline

from wordcloud import WordCloud

all_words = ' '.join([text for text in train_df_final.article])

wordcloud = WordCloud(width= 800, height= 500,
                          max_font_size = 110,
                          collocations = False).generate(all_words)


import matplotlib.pyplot as plt

plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


train_df_final[train_df_final['label']==0]


%matplotlib inline

from wordcloud import WordCloud

all_words = ' '.join([text for text in train_df_final[train_df_final['label']==0].article])

wordcloud = WordCloud(width= 800, height= 500,
                          max_font_size = 110,
                          collocations = False).generate(all_words)


import matplotlib.pyplot as plt

plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


%matplotlib inline

from wordcloud import WordCloud

all_words = ' '.join([text for text in train_df_final[train_df_final['label']==1].article ])

wordcloud = WordCloud(width= 800, height= 500,
                          max_font_size = 110,
                          collocations = False).generate(all_words)


import matplotlib.pyplot as plt

plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


%matplotlib inline

from wordcloud import WordCloud

all_words = ' '.join([text for text in test_df_final.article ])

wordcloud = WordCloud(width= 800, height= 500,
                          max_font_size = 110,
                          collocations = False).generate(all_words)


import matplotlib.pyplot as plt

plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


## remove whitespace
from nltk import tokenize

token_space = tokenize.WhitespaceTokenizer()


import seaborn as sns
import nltk
    
def pareto(text, column_text, quantity):
    all_words = ' '.join([text for text in text[column_text]])
    token_phrase = token_space.tokenize(all_words)
    frequency = nltk.FreqDist(token_phrase)
    df_frequency = pd.DataFrame({"Word": list(frequency.keys()),
                                   "Frequency": list(frequency.values())})
    df_frequency = df_frequency.nlargest(columns = "Frequency", n = quantity)
    plt.figure(figsize=(12,8))
    ax = sns.barplot(data = df_frequency, x = "Word", y = "Frequency", color = 'blue')
    ax.set(ylabel = "Count")
    plt.show()


#The 20 more frequent word in reliable article

pareto(train_df_final[train_df_final['label']==0], "article", 20)


#The 20 more frequent word in ureliable article

pareto(train_df_final[train_df_final['label']==1], "article", 20)


#The 20 more frequent word in ureliable article

pareto(test_df_final, "article", 20)


from sklearn.feature_extraction.text import CountVectorizer

#Creating the bag of words
bow_article = CountVectorizer().fit(train_df_final['article'])

article_vect = bow_article.transform(train_df_final['article'])


#TF-IDF

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(article_vect)
news_tfidf = tfidf_transformer.transform(article_vect)
print(news_tfidf.shape)

(15000, 159882)


#We will use 20% of the data to train the models.
seed_value= 12321 

from sklearn.model_selection import train_test_split
X = news_tfidf
y = train_df_final['label']



X_train, X_test, Y_train,Y_test= train_test_split(X, y, test_size=0.2, random_state=42)


from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test,y_preds)
matrix_proportions = np.zeros((3,3))
for i in range(0,3):
    matrix_proportions[i,:] = confusion_matrix[i,:]/float(confusion_matrix[i,:].sum())
names=['Hate','Offensive','Neither']
confusion_df = pd.DataFrame(matrix_proportions, index=names,columns=names)
plt.figure(figsize=(5,5))
seaborn.heatmap(confusion_df,annot=True,annot_kws={"size": 12},cmap='YlGnBu',cbar=False, square=True,fmt='.2f')
plt.ylabel(r'True Value',fontsize=14)
plt.xlabel(r'Predicted Value',fontsize=14)
plt.tick_params(labelsize=12)


# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


def build_best_machine_learning_models(X_train,Y_train,X_test):

    #Stochastic Gradient Descent (SGD):
    #Only continious calue (vectorisation)
    
    sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
    sgd.fit(X_train, Y_train)
    Y_pred = sgd.predict(X_test)
    acc_sgd = accuracy_score(Y_test,Y_pred)

    #Random Forest:
    random_forest = RandomForestClassifier(n_estimators=100)
    random_forest.fit(X_train, Y_train)
    Y_pred = random_forest.predict(X_test)
    acc_random_forest = accuracy_score(Y_test,Y_pred)

    #Logistic Regression:
    logreg = LogisticRegression()
    logreg.fit(X_train, Y_train)
    Y_pred = logreg.predict(X_test)
    acc_log = accuracy_score(Y_test,Y_pred)

    # K Nearest Neighbor:
    knn = KNeighborsClassifier(n_neighbors = 3)
    knn.fit(X_train, Y_train)  
    Y_pred = knn.predict(X_test)  
    acc_knn = accuracy_score(Y_test,Y_pred)

    #  Naive Bayes:
    multinb = MultinomialNB()
    multinb.fit(X_train, Y_train)  
    Y_pred = multinb.predict(X_test)  
    acc_multinb = accuracy_score(Y_test,Y_pred)


    #Perceptron:
    perceptron = Perceptron(max_iter=5)
    perceptron.fit(X_train, Y_train)
    Y_pred = perceptron.predict(X_test)
    acc_perceptron = accuracy_score(Y_test,Y_pred)


    # Linear Support Vector Machine:
    linear_svc = LinearSVC()
    linear_svc.fit(X_train, Y_train)
    Y_pred = linear_svc.predict(X_test)
    acc_linear_svc = accuracy_score(Y_test,Y_pred)

    # Decision Tree
    decision_tree = DecisionTreeClassifier() 
    decision_tree.fit(X_train, Y_train)  
    Y_pred = decision_tree.predict(X_test)  
    acc_decision_tree = accuracy_score(Y_test,Y_pred)


    # Which is the best Model ?
    results = pd.DataFrame({
        'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
                  'Random Forest', 'Naive Bayes', 'Perceptron', 
                  'Stochastic Gradient Decent', 
                  'Decision Tree'],
        
        'Score': [acc_linear_svc, acc_knn, acc_log, 
                  acc_random_forest, acc_multinb, acc_perceptron, 
                  acc_sgd, acc_decision_tree]})
    
    result_df = results.sort_values(by='Score', ascending=False)
    result_df = result_df.set_index('Score')
    
    return result_df


results_df = build_best_machine_learning_models(X_train,Y_train,X_test)


results_df


linear_svc = LinearSVC()
clf=linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)

print (classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1471
           1       0.97      0.97      0.97      1529

    accuracy                           0.97      3000
   macro avg       0.97      0.97      0.97      3000
weighted avg       0.97      0.97      0.97      3000


linear_svc = LinearSVC()
linear_svc.fit(X, y)

LinearSVC()


from sklearn.feature_extraction.text import CountVectorizer

#Creating the bag of words
bow_article_test = CountVectorizer().fit(test_df_final['article'])

article_vect_test = bow_article.transform(test_df_final['article'])


#TF-IDF

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(article_vect)
news_tfidf = tfidf_transformer.transform(article_vect_test)
print(news_tfidf.shape)

X_test = news_tfidf

(4000, 159882)


Y_pred = linear_svc.predict(X_test)


test_df_final['label']=Y_pred


test_df_final


test_df_final.to_csv("article_classification.csv",index=True)


test_df_final['label'].value_counts()

0    2011
1    1989
Name: label, dtype: int64


test_df_final[test_df_final['label']==0]


%matplotlib inline

from wordcloud import WordCloud

all_words = ' '.join([text for text in test_df_final[test_df_final['label']==0].article])

wordcloud = WordCloud(width= 800, height= 500,
                          max_font_size = 110,
                          collocations = False).generate(all_words)


import matplotlib.pyplot as plt

plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


%matplotlib inline

from wordcloud import WordCloud

all_words = ' '.join([text for text in test_df_final[test_df_final['label']==1].article ])

wordcloud = WordCloud(width= 800, height= 500,
                          max_font_size = 110,
                          collocations = False).generate(all_words)


import matplotlib.pyplot as plt

plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


import seaborn as sns
import nltk
    
def pareto(text, column_text, quantity):
    all_words = ' '.join([text for text in text[column_text]])
    token_phrase = token_space.tokenize(all_words)
    frequency = nltk.FreqDist(token_phrase)
    df_frequency = pd.DataFrame({"Word": list(frequency.keys()),
                                   "Frequency": list(frequency.values())})
    df_frequency = df_frequency.nlargest(columns = "Frequency", n = quantity)
    plt.figure(figsize=(12,8))
    ax = sns.barplot(data = df_frequency, x = "Word", y = "Frequency", color = 'blue')
    ax.set(ylabel = "Count")
    plt.show()


#The 20 more frequent word in reliable article

pareto(test_df_final[test_df_final['label']==0], "article", 20)


#The 20 more frequent word in ureliable article

pareto(test_df_final[test_df_final['label']==1], "article", 20)

	id	title	author	text
3971	24770	WATCH: Diamond And Silk Slam Obama, Kerry For ...	Deborah Danan	TEL AVIV — The YouTube sensations known as ...
2219	23018	Elite Soccer Clubs Sign Gamers to Compete in E...	Jack Williams	Every now and again, Koen Weijland feels like ...
677	21476	#NotMyPresident? #DrainTheSwamp? The REAL Revo...	Corbett	Podcast: Play in new window \| Download \| Embed...
3226	24025	Election 2016 Current Votes in 8 Battleground ...	smith.j	For decades, media networks, newspapers, and j...
789	21588	Bus Driver Pulls Over to Stop Suicidal Woman f...	Breitbart TV	An Ohio bus driver’s quick thinking last month...

	article	label
5973	['report', 'facebook', 'pays', 'army', 'filipi...	0
3797	['apple', 'iphone', 'status', 'symbol', 'china...	0
11197	['ufc', 'sells', 'billion', 'new', 'york', 'ti...	0
1490	['mosul', 'front', 'traps', 'smoke', 'screens'...	0
6758	['frank', 'gaffney', 'tillerson', 'incoherent'...	0
...	...	...
18667	['right', 'left', 'partisan', 'writing', 'miss...	0
9954	['hirakhand', 'express', 'train', 'derails', '...	0
19494	['lirr', 'train', 'crashed', 'going', 'twice',...	0
19662	['putin', 'allegations', 'russian', 'meddling'...	0
2933	['donald', 'trump', 'delta', 'air', 'lines', '...	0

	article	label	lable
3971	['watch', 'diamond', 'silk', 'slam', 'obama', ...	1	1
2219	['elite', 'soccer', 'clubs', 'sign', 'gamers',...	0	0
677	['notmypresident', 'draintheswamp', 'real', 'r...	1	1
3226	['election', '2016', 'current', 'votes', 'batt...	1	1
789	['bus', 'driver', 'pulls', 'stop', 'suicidal',...	0	0
...	...	...	...
3504	['rio', 'olympics', 'today', 'us', 'swimmers',...	0	0
3294	['jim', 'mattis', 'says', 'us', 'shoulder', 's...	0	0
286	['colombian', 'opposition', 'peace', 'deal', '...	0	0
65	['six', 'gulf', 'protectors', 'arrested', 'cha...	1	1
5	['keiser', 'report', 'meme', 'wars', 'e99542',...	1	1

	article	label
2219	['elite', 'soccer', 'clubs', 'sign', 'gamers',...	0
789	['bus', 'driver', 'pulls', 'stop', 'suicidal',...	0
2719	['millions', 'risk', 'deportation', 'justices'...	0
3681	['40', 'japan', 'confirmed', 'dead', 'earthqua...	0
720	['good', 'cop', 'dead', 'cop', 'says', 'allege...	0
...	...	...
1114	['macy', 'close', '100', 'stores', 'erivals', ...	0
4418	['jobs', 'donald', 'trump', 'celebrates', 'mas...	0
3504	['rio', 'olympics', 'today', 'us', 'swimmers',...	0
3294	['jim', 'mattis', 'says', 'us', 'shoulder', 's...	0
286	['colombian', 'opposition', 'peace', 'deal', '...	0

Exam : Fake News Detector - Case Study¶

1. Introduction ¶

1.1 Context ¶

1.2 Objective¶

1.3 Datasets¶

1.4 Results¶

2. Exploring the data¶

2.1 Loading libraries and dataset¶

2.2 Analyze by describing data¶

2.3 Target analysis : Check for Class Imbalance¶¶

2.4 Check for text content¶

2.5 Merge text context¶

3.Preprocessing of the article¶

4. Article Post-processing EDA¶

5. Modeling and classify in test dataset¶

Predict on test data¶

Export result in csv pred¶

6. Interpretation of results¶

6.Recommandation¶

	id	title	author	text	label
5973	5972	Report: Facebook Pays ’Army’ of Filipinos to P...	Lucas Nolan	Facebook reportedly pays an army of Filipino c...	0
11304	11303	WikiLeaks: Hillary Clinton knew Saudi, Qatar w...	wmw_admin	SEE ALSO: Why Hillary Clinton is Responsible f...	1
3797	3796	Apple iPhone, Once a Status Symbol in China, L...	Paul Mozur	HONG KONG — Since 2010, Yu Kai has followed...	0
11197	11196	U.F.C. Sells Itself for $4 Billion - The New Y...	Michael J. de la Merced	SAN FRANCISCO — Ultimate Fighting Champions...	0
1490	1489	At the Mosul Front: Traps, Smoke Screens and S...	Bryan Denton and Michael R. Gordon	Mr. Denton, a Times photographer, and Mr. Gord...	0

	Model
Score
0.971000	Support Vector Machines
0.969000	Stochastic Gradient Decent
0.962667	Perceptron
0.956000	Logistic Regression
0.926333	Decision Tree
0.913667	Random Forest
0.880667	KNN
0.793000	Naive Bayes