Scraping and Sentiment Analysis of CNN Articles Using NLP

Natural language processing (NLP) is becoming popular lately for it’s potential to automate many tasks that require workers today, in addition to helping data scientists gather a better understanding of text. Today, I will show an example of how to use apply web scraping and NLP to pull news articles from the internet, summarize them and look at their sentiment. 

There has always been a belief that the news chooses to focus more on negative issues over positive news because it instigates peoples sense of fear and draws them in more. See one recent scientific study showing such an effect here. In this project, I wanted to see if some readily available NLP libraries would capture the negative bias for a set of web articles scraped from CNN.com.

The newspaper package was used to scrape the news articles from the web. The articles were then cleaned and summarized using the BART-large-CNN model from the Huggingface Transformers library. The sentiment of the title, body, and summary were computed by the Textblob package. Textblob uses an NLP naive bayes classifier trained on movie reviews, so I will take any result from analysis on news articles as a loose baseline for an initial analysis. Seaborn was used to visualize the data. The analysis is shared below as a Jupyter notebook. This analysis suggests the NLP summary captures the sentiment of the articles better than the original titles.

I compared the sentiment of the articles’ body text against the summary and the title. Surprisingly, the NLP sentiment package I used found that the articles were neutral to positive. I always thought the news was negative. Maybe a custom classifier should be trained?

You can download the code here.

See the code and the results below in the embedded Jupyter notebook.

Summarize_Article_Check_Polarity

Leave a Reply