Natural language processing (NLP) is becoming popular lately for it’s potential to automate many tasks that require workers today, in addition to helping data scientists gather a better understanding of text. Today, I will show an example of how to use apply web scraping and NLP to pull news articles from the internet, summarize them and look at their sentiment.
There has always been a belief that the news chooses to focus more on negative issues over positive news because it instigates peoples sense of fear and draws them in more. See one recent scientific study showing such an effect here. In this project, I wanted to see if some readily available NLP libraries would capture the negative bias for a set of web articles scraped from CNN.com.
The newspaper package was used to scrape the news articles from the web. The articles were then cleaned and summarized using the BART-large-CNN model from the Huggingface Transformers library. The sentiment of the title, body, and summary were computed by the Textblob package. Textblob uses an NLP naive bayes classifier trained on movie reviews, so I will take any result from analysis on news articles as a loose baseline for an initial analysis. Seaborn was used to visualize the data. The analysis is shared below as a Jupyter notebook. This analysis suggests the NLP summary captures the sentiment of the articles better than the original titles.
I compared the sentiment of the articles’ body text against the summary and the title. Surprisingly, the NLP sentiment package I used found that the articles were neutral to positive. I always thought the news was negative. Maybe a custom classifier should be trained?
You can download the code here.
See the code and the results below in the embedded Jupyter notebook.
Scraping and Sentiment Analysis of CNN Articles Using NLP¶
Import the necessary packages.
import newspaper
import transformers
import pandas as pd
import matplotlib.pyplot as plt
from textblob import TextBlob
import seaborn as sns
Crawl the CNN website using the newspaper web crawling package.¶
# link = 'https://www.cnn.com'
# # Scans the webpage and finds all the links on it.
# page_features = newspaper.build(link, language='en', memoize_articles=False)
# # Initialize a list for article titles and text.
# title_text = list()
# # The page_features object contains article objects that are initialized with links to the web pages.
# for article in page_features.articles:
# try:
# # Each article must be downloaded, then parsed individually.
# # This loads the text and title from the webpage to the object.
# article.download()
# article.parse()
# # Keep the text, title and URL from the article and append to a list.
# title_text.append({'title':article.title,
# 'body':article.text,
# 'url': article.url})
# except:
# # If, for any reason the download fails, continue the loop.
# print("Article Download Failed.")
# # Save as a dataframe to avoid excessive calls on the web page.
# articles_df = pd.DataFrame.from_dict(title_text)
# articles_df.to_csv(r'CNN_Articles_Oct15_21.csv')
Check and clean the article data¶
# Load the dataframe from the checkpoint.
articles_df = pd.read_csv(r'CNN_Articles_Oct15_21.csv')
# Drop any NaNs from the dataframe.
articles_df = articles_df.dropna().iloc[:,1:]
# Plot the distribution of body and title length.
fig, (ax1,ax2) = plt.subplots(1,2)
articles_df['title'].apply(lambda x: len(x)).hist(ax=ax1)
ax1.set_title('Title Length Distribution')
articles_df['body'].apply(lambda x: len(x)).hist(ax = ax2)
ax2.set_title('Body Length Distribution')
# Get the character length of each article.
len_df = articles_df.applymap(lambda x:len(x))
# Drop articles where the title is longer than the body (for example, video articles).
len_df['title_gt_body'] = len_df['title'] > len_df['body']
print(len_df['title_gt_body'].sum()/len_df.shape[0])
# Drop non-english articles that were downloaded.
len_df['spanish'] = articles_df['url'].astype(str).str.contains('cnnespanol|arabic')
print(len_df['spanish'].sum()/len_df.shape[0])
len_df['mask'] = len_df['title_gt_body']|len_df['spanish']
print(len_df['mask'].sum()/len_df.shape[0])
# Finish the cleaning, remove bad samples.
article_df_clean = articles_df[~len_df['mask']]
len_clean = len_df[~len_df['mask']]
# Plot histograms of the article body and title lengths after cleaning.
fig, (ax1,ax2) = plt.subplots(1,2)
len_df[~len_df['mask']]['title'].hist(ax= ax1)
ax1.set_title('Title Length Distribution \n after Cleaning')
len_df[~len_df['mask']]['body'].hist(ax= ax2)
ax2.set_title('Body Length Distribution \n after Cleaning')
0.0 0.052901023890784986 0.052901023890784986
Text(0.5, 1.0, 'Body Length Distribution \n after Cleaning')
The titles appear to be somewhat normally distribute around 60 characters, with a slight skew toward higher character counts. The body length is around 5000 characters, but some articles are exceptionally long. Cleaning the data had very little effect on the distribution.
Use NLP to do article summarization and save the results.¶
# Setup a pipeline with a Pytorch/HuggingFace NLP summarization model.
# This CNN can summarize up to 1024 words.
smr_bart = transformers.pipeline(task="summarization", model="facebook/bart-large-cnn")
# Initialize a list for the summaries.
summary_list = list()
# Get a summary for each article.
for ind,x in article_df_clean.iterrows():
# print(len(x['title']))
# Split the text into words.
body = x['body']
body = body.split()
try:
# Check the word count, only analyze up to the first 750 words.
if len(body)>750:
body = body[0:750]#:1023]
# Put the words back together into one article for the NLP pipeline.
body = ' '.join(body)
# Calculate the NLP summary using the model pipeline.
# Make the summary as long as the title.
summary = smr_bart(body,max_length=len(x['title']))[0]['summary_text']
summary_list.append({'index':ind,'summary':summary})
except:
# If there are any failures, print them for debugging later.
print('Failure on Index# '+str(ind))
# Make the summaries into a dataframe.
summary_df = pd.DataFrame.from_dict(summary_list).set_index(keys='index')
# Merge the summaries back with the articles.
article_summary_df = pd.merge(summary_df,article_df_clean,
left_index=True,right_index=True,how='left')
article_summary_df.to_pickle(r'CNN_Articles_wSummaries_Oct15_21.pkl')
Load the Checkpoint and summarize the results.¶
article_summary_df = pd.read_pickle(r'CNN_Articles_wSummaries_Oct15_21.pkl')
Use textblob package to estimate the polarity of the title, body and summary.¶
polarity_df = article_summary_df[['title','body','summary']].\
applymap(lambda x: TextBlob(x).sentiment.polarity)
Display a histogram of the polarity calculations.¶
polarity_df.hist()
array([[<AxesSubplot:title={'center':'title'}>, <AxesSubplot:title={'center':'body'}>], [<AxesSubplot:title={'center':'summary'}>, <AxesSubplot:>]], dtype=object)
Surprisingly, the title and body appear to be distributed slightly positively. The summaries appear to be distributed slightly negatively.
Print the polarity dataframe.¶
print(polarity_df.describe())
title body summary count 531.000000 531.000000 531.000000 mean 0.032714 0.108741 0.078354 std 0.252437 0.088247 0.176488 min -1.000000 -0.475000 -0.750000 25% 0.000000 0.059436 0.000000 50% 0.000000 0.103515 0.065000 75% 0.056250 0.156731 0.172421 max 1.000000 0.550000 0.800000
On the mean, the title, body and summary are slightly positive.
Plot a grid of the results.¶
g = sns.PairGrid(polarity_df)
g.map_upper(sns.regplot)
g.map_lower(sns.regplot)
g.map_diag(sns.histplot, kde=True)
<seaborn.axisgrid.PairGrid at 0x7f76c04ca760>
Pearson correlation coefficient (little r) squared suggests fit quality.¶
print(polarity_df.corr().pow(2))
title body summary title 1.000000 0.055119 0.015707 body 0.055119 1.000000 0.130003 summary 0.015707 0.130003 1.000000
The correlation between the summary and the body is twice as good as the correlation between the title and the body. This suggests that the large CNN is capturing the sentiment of the article better than the title.