Youtube Videos Metadata & Comments Scraper- 6 mins
I have been getting many request o write script to scrape Youtube videos Metadata and comments. So instead of replying separately to everyone, I thought of creating a blog so that it would be easy for everyone to through it and would be a reference for them in future also.
I have made 2 different scripts though both includes almost same code. One is scraping data using the ID of the youtube Video and other is Scraping data of top 10 videos in search page of any terms.
First of all, we need to have
DEVELOPER_KEY of YoutubeDataAPI for this script to work. You can grab them here.
I am using an external library called
pafy to download some data about Youtube. You can know more details about it here.
After successful scraping I am storing all those data into a CSV file. So I have imported library called
Now its time to do some scraping :
Import all required libraries into our file.
from apiclient.discovery import build from apiclient.errors import HttpError from oauth2client.tools import argparser import pafy import csv
Now its time to add our developers key and build youtube.
DEVELOPER_KEY = "#AddYourDeveloperKey" YOUTUBE_API_SERVICE_NAME = "youtube" YOUTUBE_API_VERSION = "v3" pafy.set_api_key("#AddYourAPIKey") youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,developerKey=DEVELOPER_KEY)
We will take the Youtube ID as input and make it into a Perfect Youtube URL.
videoId = raw_input("ID of youtube video : \n") url = "https://www.youtube.com/watch?v=" + videoId
Requesting Metadata from
video = pafy.new(url)
Its time to get all the comments from that Youtube Vdeoa dn save it into an array. Default max results you can get is 100. So if a Video has more than 100 comments we need to iterate the same function to get all the comments.
results = youtube.commentThreads().list( part="snippet", maxResults=100, videoId=videoId, textFormat="plainText" ).execute() totalResults = 0 totalResults = int(results["pageInfo"]["totalResults"]) count = 0 nextPageToken = '' comments =  further = True first = True while further: halt = False if first == False: print "." try: results = youtube.commentThreads().list( part="snippet", maxResults=100, videoId=videoId, textFormat="plainText", pageToken=nextPageToken ).execute() totalResults = int(results["pageInfo"]["totalResults"]) except HttpError, e: print "An HTTP error %d occurred:\n%s" % (e.resp.status, e.content) halt = True if halt == False: count += totalResults for item in results["items"]: comment = item["snippet"]["topLevelComment"] author = comment["snippet"]["authorDisplayName"] text = comment["snippet"]["textDisplay"] comments.append([author,text]) if totalResults < 100: further = False first = False else: further = True first = False try: nextPageToken = results["nextPageToken"] except KeyError, e: print "An KeyError error occurred: %s" % (e) further = False
Now its time to add all the data to our csv file.
add_data(videoId,video.title,video.description,video.author,video.published,video.viewcount, video.duration, video.likes, video.dislikes,video.rating,video.category,comments)
Following code is used to add our data into a csv file.
def add_data(vID,title,description,author,published,viewcount, duration, likes, dislikes,rating,category,comments): data = [vID,title,description,author,published,viewcount, duration, likes, dislikes,rating,category,comments] with open("scraper.csv", "a") as fp: wr = csv.writer(fp, dialect='excel') wr.writerow(data)
This way we can get all the data and comments of a youtube video.
Now a simple extension of this script is to get all the data of top 10 search results.
For this I take the search term as input and then called YoutubeAPI for the search results. From that results I would just take the top 10 videoIDS and call the above script to get all required data.
searchTerm = raw_input("Term you want to Search : \n") search_response = youtube.search().list( q=searchTerm, part="id,snippet", maxResults=30 ).execute() count = 0 for search_result in search_response.get("items", ): if search_result["id"]["kind"] == "youtube#video": if count <10: vID = search_result["id"]["videoId"] get_data(vID) count += 1 else: break else: continue
You can checkout the full scripts in this repo here. Fork it or Star if you like it.
You can mail me at email@example.com for any queries or doubts regarding this.