Scrape Ted Talks and Videos Dataset from ted.com

Scrape Ted Talks and Videos Dataset from ted.com - A Data Science Project

This is part of the python code I used to scrape all talks from Ted.com.

Ted.com is a wonderful website for people to get inspirations and entertainments through thousands of short and powerful talks. As of Today (May,01 2022), There are 5033 talks on ted.com posted on total 154 pages. In order to scrape all talks of the site, I need to go to each page to do the scraping.

Data to be collected from each talk:

=> title, speaker, posted_on date, talk_link

from requests_html import AsyncHTMLSession
import json
import asyncio

Make a list to store all talks later

all_talks = []

def get_all_talks_pages_urls():
    links=[]
    for i in range(1,155):
        link = f'https://www.ted.com/talks?page={i}'
        links.append(link)
    return links

Scrape one page is one task, for each page, collect all '.col' info about each talk and store them into a dictionary.

*In order to run all tasks in one event_loop, asession is used for passing around the AsyncHTMLSession *

async def get_all_talks_from_one_page(url, asession):    
    a_re = await asession.get(url)    
    row = a_re.html.find('div.row.row-sm-4up.row-lg-6up.row-skinny', first=True)
    all_cols = row.find('div.col')

    for col in all_cols:
        speaker = col.find('h4', first=True).text
        a_tag = col.find('div.media__message a.ga-link', first=True)
        title = a_tag.text
        talk_link = a_tag.attrs['href']
        post_date = col.find('span.meta__val', first=True).text
                 
        talk_dict = {
            'title':title,
            'speaker': speaker,
            'posted_on': post_date,
            'talk_link': talk_link
        }
        all_talks.append(talk_dict)
    return all_talks

Now to create the top-level function to call a new session and asyncio.gather() all tasks

async def scrape_all_talks():
    asession = AsyncHTMLSession()
    links = get_all_talks_pages_urls()
    tasks = (get_all_talks_from_one_page(link, asession) for link in links) # generator
    return await asyncio.gather(*tasks)

Run the top-level coroutine using asyncio.run()

**Note:**

If you use Jupyter Notebook, since there is an eventloop running already, use the following will work. However, for python code, this part should go like this: asyncio.run(scrape_all_talks())

await scrape_all_talks()

Save the talks dataset to a json file

with open('ted_talks.json', 'w') as file:
    json.dump(all_talks, file, indent=4)

For source code on github

Search This Blog

To Optimize Life Algorithms

Featured

Steps to Create a Project on GitHub

Scrape Ted Talks and Videos Dataset from ted.com - A Data Science Project

Popular Posts

Python SQLite3 Module Database - Many to Many Relationship Database Example

CSS Staggered Animation - Show Different Elements Animations One After Another with CSS Only