Find Most Frequent Words in an Article - Python Web Scraping with NLTK

This project involves a few libraries from Python such as NLTK, Requests, Request_HTML. It also reinforce some fundamental usage of Python's data types such as dictionary and tuple.

What to achieve:

To find an online article that you are interested, scrape the article for its text content and save it as a txt file. From this text file, we do the word frequency counting so that we can get a list of top 10 most frequently appeared words in the article.

First, we need to pip install all necessary libraries, then import them

import requests
from requests_html import HTML
import nltk
from nltk.corpus import stopwords

The webpage we are going to scrape has the following URL:

url = 'https://www.marketwatch.com/story/based-on-19-bear-markets-in-the-last-140-years-heres-where-the-current-downturn-may-end-says-bank-of-america-11651847842'

First, with this URL and the help of Requests library, we can scrape the whole html content from the page and save it to an html file

def save_html_to_file(url, filename):
    re = requests.get(url)
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(re.text, )
    return filename

Now with the whole html content saved to a file, we can read it and retrieve all its text content of this webpage.

# All text content are in the "#js-article__body" class
def retrieve_text(from_file, to_file):
    
    with open(from_file, 'r', encoding='utf-8') as f:
        html_file = f.read()
        
        # use requests_html -> HTML to parse and retrieve the content
        html_parsed = HTML(html=html_file)
        article_body = html_parsed.find('#js-article__body', first=True)

        # All text are inside "p" tags
        paragraphs = article_body.find('p') 
        text = ''
        for p in paragraphs:
            text = text +"\n" + p.text

        # save the text to a file for further analysis
        with open(to_file, 'w', encoding='utf-8') as f:
            f.write(text)
    return text

After define the methods of retrieving all text content from the URL, let's call the method and get the text content

text = retrieve_text(save_html_to_file(url, 'article.html'), 'text.txt')

This is the text content scraped from the web page:

At nearly the halfway mark in a volatile year of trading, the S&P 500 index is down, but not out to the point of an official bear market yet.
According to a widely followed definition, a bear market occurs when a market or security is down 20% or more from a recent high. The S&P 500 SPX,
-0.57%
is off 13.5% from a January high of 4,796, which for now, just means correction territory, often defined as a 10% drop from a recent high. The battered Nasdaq Composite COMP,
-1.40%
, meanwhile, is currently down 23% from a November 2021 high.
That S&P bear market debate is raging nonetheless, with some strategists and observers saying the S&P 500 is growling just like a bear market should. Wall Street banks like Morgan Stanley have been saying the market is getting close to that point.
Read: A secular bear market is here, says this money manager. These are the key steps for investors to take now.
But should the S&P 500 officially enter the bear’s lair, Bank of America strategists led by Michael Hartnett, have calculated just how long the pain could last. Looking at a history of 19 bear markets over the past 140 years, they found the average price decline was 37.3% and the average duration about 289 days.
While “past performance is no guide to future performance,” Hartnett and the team say the current bear market would end Oct. 19 of this year, with the S&P 500 at 3,000 and the Nasdaq Composite at 10,000. Check out their chart below:
The “good news,” is that many stocks have already reached this point. with 49% of Nasdaq constituents more than 50% below their 52-week highs, and 58% of the Nasdaq more than 37.3% down, with 77% of the index in a bear market. More good news? “Bear markets are quicker than bull markets,” say the strategists.
The bank’s latest weekly data released on Friday, showed another $3.4 billion coming out of stocks, $9.1 billion from bonds and $14 billion from cash. They note many of those moves were “risk off” headed into the recent Federal Reserve meeting.
While the Fed tightened policy as expected again this week, uncertainty over whether its stance is any less hawkish than previously believed, along with concerns that the central bank may not be able to tighten policy without triggering an economic downturn, left stocks dramatically weaker on Thursday, with more selling under way on Friday.
The strategists offer up one final factoid that may also give investors some comfort. Hartnett and the team noted that for every $100 invested in equities over the past year or so, only $3 has been redeemed.
As well, the $1.1 trillion that has flowed into equities since January 2021 had an average entry point of 4,274 on the S&P 500, meaning those investors are “underwater but only somewhat,” said Hartnett and the team.

Since the text has been saved to a file, we can open the file and retrieve the content anytime we want for analysis.

with open('text.txt', 'r', encoding='utf-8') as f:
    text = f.read()

Now create a method to get the most frequent words from the text. Before that however, we should get to know the concept of stop words.

Stop words are those most commonly used words in a language, e.g. “a”, “the”, “is”, “in”, "and" etc. They usually carry little useful information relative to the context, therefore during the practices of Text Mining and Natural Language Processing (NLP), they are often eliminated from the corpus.

# list top 10 frequent words in the article
def get_most_frequent_words(text, num_of_words):

    # get all words in a list in lower case
    words = text.split() 
    lower_words = [w.lower() for w in words if w.isalpha()] 

    # get stop words set from NLTK so that those can be removed from the 'lower_words' list
    stops = set(stopwords.words('english'))
    word_dict = {}
    for w in lower_words:
        if w not in stops:
            word_dict[w] = word_dict.get(w, 0) +1  ### see NOTE1 below

    words_list = list()
    for k,v in word_dict.items():
        # create a tuple with k,v position reversed, save to the words_list
        words_list.append((v, k))

    # sort the list. 
    sorted_words = sorted(words_list, reverse=True)[:num_of_words] ### see NOTE2 below
    be
    print(sorted_words) 
    return sorted_words

get_most_frequent_words(text, 10)

The output should be like this:
[(8, 'market'), (8, 'bear'), (4, 'nasdaq'), (3, 'strategists'), (3, 'recent'), (3, 'investors'), (3, 'hartnett'), (3, 'billion'), (3, 'average'), (2, 'year')]

*In case the stopwords are not available, use this line of code: nltk.download('stopwords') anywhere before this method to download the stopwords package first.*

From the top 10 frequent words, we can sense the main points of the article. It is an article about the recent bear market analysis.

Note1: Python dictionary has a very useful method "get()". It's syntax is dictionary.get(keyname, value). It works like this: first to check if the key is already in a dictionary, if yes, get the value of the key, otherwise assume a default value provided by the method. This is a very common practice for counting frequency by utilizing dictionary and its get() method.

Note2: In Python, tuples are comparable lexicographically by comparing corresponding elements of two tuples; the first item of the first tuple is compared to the first item of the second tuple; if they are equal then the second item in each tuple are compared, then the third and so on. For this project, we use the built-in sorted() method to sort the list, since this list's items are all tuples. The first item of each tuple is an integer, the tuples therefore get sorted by the integers. For a descending order of the sorted_word list, use 'reverse=True'.

Nice.

If you have any comments, I will be glad to see them below.

To get the source code from github, click here

Search This Blog

To Optimize Life Algorithms

Featured

Steps to Create a Project on GitHub

Find Most Frequent Words in an Article - Python Web Scraping with NLTK

Popular Posts

Python SQLite3 Module Database - Many to Many Relationship Database Example

CSS Staggered Animation - Show Different Elements Animations One After Another with CSS Only