This is part of the python code I used to scrape all talks from Ted.com.
Ted.com is a wonderful website for people to get inspirations and entertainments through thousands of short and powerful talks.
As of Today (May,01 2022), There are 5033 talks on ted.com posted on total 154
pages. In order to scrape all talks of the site, I need to go to each page to do
the scraping.
Data to be collected from each talk:
=> title, speaker, posted_on date, talk_link
from requests_html import AsyncHTMLSession
import json
import asyncio
Make a list to store all talks later
all_talks = []
def get_all_talks_pages_urls():
links=[]
for i in range(1,155):
link = f'https://www.ted.com/talks?page={i}'
links.append(link)
return links
Scrape one page is one task, for each page, collect all '.col' info about each talk and store them into a dictionary. *In order to run all tasks in one event_loop, asession is used for passing around the AsyncHTMLSession *
async def get_all_talks_from_one_page(url, asession):
a_re = await asession.get(url)
row = a_re.html.find('div.row.row-sm-4up.row-lg-6up.row-skinny', first=True)
all_cols = row.find('div.col')
for col in all_cols:
speaker = col.find('h4', first=True).text
a_tag = col.find('div.media__message a.ga-link', first=True)
title = a_tag.text
talk_link = a_tag.attrs['href']
post_date = col.find('span.meta__val', first=True).text
talk_dict = {
'title':title,
'speaker': speaker,
'posted_on': post_date,
'talk_link': talk_link
}
all_talks.append(talk_dict)
return all_talks
Now to create the top-level function to call a new session and asyncio.gather() all tasks
async def scrape_all_talks():
asession = AsyncHTMLSession()
links = get_all_talks_pages_urls()
tasks = (get_all_talks_from_one_page(link, asession) for link in links) # generator
return await asyncio.gather(*tasks)
Run the top-level coroutine using asyncio.run() **Note:**
If you use Jupyter Notebook, since there is an eventloop running already, use the following will work. However, for python code, this part should go like this: asyncio.run(scrape_all_talks())
Save the talks dataset to a json file
with open('ted_talks.json', 'w') as file:
json.dump(all_talks, file, indent=4)