--> Skip to main content

Featured

Steps to Create a Project on GitHub

Steps to create a project on GitHub:  1.   Start a repo on GitHub 2.   Make a README.md file 3.   Open vs code – new folder – new terminal – git clone http:…. (from the repo). 4.   In terminal – cd theprojectname   à move into the project file 5.   Ls -la is for showing hidden file à not working for me ???? 6.   Make some changes to README file, in terminal git status à shows all the changes on the file 7.   To add all changes update to the folder à git add . (the dot . means all changes to be added) 8.   Add a new file index.html, in terminal à git commit -m “title of commit” -m “description of commit” 9.   Then git push origin master 10.                 ****Initial a repo in local text editor à git init 11.                 After use git add . etc, when pus...

Scrape Ted Talks and Videos Dataset from ted.com - A Data Science Project

This is part of the python code I used to scrape all talks from Ted.com.

Ted.com is a wonderful website for people to get inspirations and entertainments through thousands of short and powerful talks. As of Today (May,01 2022), There are 5033 talks on ted.com posted on total 154 pages. In order to scrape all talks of the site, I need to go to each page to do the scraping. 

  Data to be collected from each talk: 
        => title, speaker, posted_on date, talk_link
 
from requests_html import AsyncHTMLSession
import json
import asyncio

Make a list to store all talks later
all_talks = []
def get_all_talks_pages_urls():
    links=[]
    for i in range(1,155):
        link = f'https://www.ted.com/talks?page={i}'
        links.append(link)
    return links

Scrape one page is one task, for each page, collect all '.col' info about each talk and store them into a dictionary.
 
*In order to run all tasks in one event_loop, asession is used for passing around the AsyncHTMLSession *

async def get_all_talks_from_one_page(url, asession):    
    a_re = await asession.get(url)    
    row = a_re.html.find('div.row.row-sm-4up.row-lg-6up.row-skinny', first=True)
    all_cols = row.find('div.col')

    for col in all_cols:
        speaker = col.find('h4', first=True).text
        a_tag = col.find('div.media__message a.ga-link', first=True)
        title = a_tag.text
        talk_link = a_tag.attrs['href']
        post_date = col.find('span.meta__val', first=True).text
                 
        talk_dict = {
            'title':title,
            'speaker': speaker,
            'posted_on': post_date,
            'talk_link': talk_link
        }
        all_talks.append(talk_dict)
    return all_talks

Now to create the top-level function to call a new session and asyncio.gather() all tasks

async def scrape_all_talks():
    asession = AsyncHTMLSession()
    links = get_all_talks_pages_urls()
    tasks = (get_all_talks_from_one_page(link, asession) for link in links) # generator
    return await asyncio.gather(*tasks)

Run the top-level coroutine using asyncio.run()
 
 **Note:** 
If you use Jupyter Notebook, since there is an eventloop running already, use the following will work. However, for python code, this part should go like this: asyncio.run(scrape_all_talks())

await scrape_all_talks()

Save the talks dataset to a json file

with open('ted_talks.json', 'w') as file:
    json.dump(all_talks, file, indent=4)

Popular Posts