Scrape Top Box Office Movies Worldwide with Python Requests-html

Scrape Top Box Office Movies Worldwide with Python Requests-html - Data Science Project

For today's web scraping project, I use boxofficemojo.com.

Libraries to be used: Pandas, Requests-HTML, Jupyter Notebook,

This website doesn't have an API for developers to use, so I am going to use python's Requests to get its HTML content, and Python's Requests-html library for parsing and extract necessary movies' data

import requests
import os.path
import pandas as pd
from requests_html import HTML

BoxOfficeMojo.com has a very simple URL construction format. Its worldwide top box office pages have a common base URL:

https://www.boxofficemojo.com/year/world/

then a four digits for year is appended at the end. For example, to view top box office movie list in 2022, go to:

https://www.boxofficemojo.com/year/world/2022

My plan is to save all the dataset of each year into a csv file respectively, then use pandas for further analysis of all the years data.

Step1: To create a method with 'year' digit as a parameter, this method return two things =>

a year-specific file path for saving the date later to a csv file
a year-specific URL for scraping movie list of that year

def create_file_path_and_url_by_year(year = 2022):
    url = f'https://www.boxofficemojo.com/year/world/{year}'
    # To get the current file's parent directory name
    dir_n = os.path.dirname(__file__)
    
    # To make a directory for dataset -> 'data'
    path = os.path.join(dir_n, 'data')
    os.makedirs(path, exist_ok=True)
    
    # assemble the file path in the same folder with year variable
    filepath = os.path.join(path, f'movie_{year}.csv')
    
    return filepath, url

Step: 2: Once the URL is created, we need to scrape the URL page and retrieve the HTML content of that page. I used Requests module for this part of task. This HTML content can be saved for further analysis, but for this project, I only need to return it so that I can use it later.

def url_to_html(year, save = False):
    filepath, url = create_file_path_and_url_by_year(year)
    
    # use Requests module to parse the whole page html content
    r = requests.get(url)
    if r.status_code == 200:
       return html_text, filepath
    return None

Now comes to the fun part.

Step 3: With HTML content handy, we can now parse HTML and get whatever we need from it. For parsing HTML, a library called Beautiful soup is often used. But for this project, I prefer to use Requests_html library. If you don't already have it, use pip to install it first. We'll use its HTML module for parsing html this time.

Now let's see the code.

  def parse_and_extract_and_save(year):
    html_content, filepath = url_to_html(year)
    if html_content != None:
    	# Use HTML module to parse the content
        scraped_html = HTML(html=html_content)   
        table = scraped_html.find('#table', first=True) 
        
       	# SEE NOTE BELOW
        inner_table = table.find('table', first=True) 
        
        # Get all rows from tr tags
        rows = inner_table.find('tr')
        
        # The first row happens to be the column titles
        headers = rows[0].text.split('\n') 
        # This is the output => 
        #['Rank', 'Release Group', 'Worldwide', 'Domestic', '%', 'Foreign', '%']
        
        # Use list comprehension to collect all text from each row, 
        # then add them to a list called 'data'
        data = []
        for row in rows[1:]:
            td_s_in_each_row = [td.text for td in row.find('td')]
            data.append(td_s_in_each_row)
            
            
	# Use Pandas to transfer the list into a DataFrame 
        # instance, then save the dataframe into csv file. 
        df = pd.DataFrame(data, columns=headers)
        df.to_csv(filepath, index=False)

## NOTE: Initially there are 3 inner table elements within this table element, some of the inner table elements are from JavaScript. So we can use Python interactive terminal (with -i) to test out what is the real inner table element from html. When we use " table.find('table'), the result is just one table from html code, the missing other two are actually generated by JavaScript.

## NOTE: We can also see the same code change when we use dev tool. In dev tool interface, use ctrl+shift+p to open the RUN command panel in dev tool , then search 'disable', a list of options pops up. choose to disable JavaScript. After refreshing the page, the dev tool Elements tab will only show HTML content. This time we can see only ONE table element is shown, other two JavaScript-generated tables are not shown.

Step 4: Now, we can test out our code. Let's choose 2019 for the year. Call 'parse_and_extract_and_save(2019)' to run the code. A new file 'movie_2019.csv' should be created under the 'data' folder. The 'data' folder is created inside the method 'create_file_path_and_url_by_year()' when it was first time run.

parse_and_extract_and_save(2019)

For getting the past 10 years of box office data, we can run a for loop like this:

for year in range(2012, 2023):
    parse_and_extract_and_save(year)

Voila! We got all top box office listings scraped into our csv files. We can use pandas to analysis these files for many different purposes. To easily visualize and analysize the dateset, we can use Jupyter Notebook. Open Jupyter Notebook, import pandas, then open the file you need, transfer the csv file back to DataFrame instance, then you can see the dataset as a dataframe table.

However, for the purpose of this project, I won't go further beyond this.

This is the github repo for this project

Connect with me on LinkedIn

Search This Blog

To Optimize Life Algorithms

Featured

Steps to Create a Project on GitHub

Scrape Top Box Office Movies Worldwide with Python Requests-html - Data Science Project

Popular Posts

Python SQLite3 Module Database - Many to Many Relationship Database Example

CSS Staggered Animation - Show Different Elements Animations One After Another with CSS Only