Youtube Videos Metadata & Comments Scraper

Youtube Videos Metadata & Comments Scraper

- 6 mins

I have been getting many request o write script to scrape Youtube videos Metadata and comments. So instead of replying separately to everyone, I thought of creating a blog so that it would be easy for everyone to through it and would be a reference for them in future also.

I have made 2 different scripts though both includes almost same code. One is scraping data using the ID of the youtube Video and other is Scraping data of top 10 videos in search page of any terms.

First of all, we need to have DEVELOPER_KEY of YoutubeDataAPI for this script to work. You can grab them here.

I am using an external library called pafy to download some data about Youtube. You can know more details about it here.

After successful scraping I am storing all those data into a CSV file. So I have imported library called csv.

Now its time to do some scraping :

Import all required libraries into our file.

from apiclient.discovery import build
from apiclient.errors import HttpError
from oauth2client.tools import argparser
import pafy
import csv

Now its time to add our developers key and build youtube.

DEVELOPER_KEY = "#AddYourDeveloperKey"
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"
pafy.set_api_key("#AddYourAPIKey")

youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,developerKey=DEVELOPER_KEY)

We will take the Youtube ID as input and make it into a Perfect Youtube URL.

videoId = raw_input("ID of youtube video : \n")
url = "https://www.youtube.com/watch?v=" + videoId

Requesting Metadata from pafy

video = pafy.new(url)

Its time to get all the comments from that Youtube Vdeoa dn save it into an array. Default max results you can get is 100. So if a Video has more than 100 comments we need to iterate the same function to get all the comments.

results = youtube.commentThreads().list(
		    part="snippet",
		    maxResults=100,
		    videoId=videoId,
		    textFormat="plainText"
		  ).execute()
totalResults = 0
totalResults = int(results["pageInfo"]["totalResults"])
count = 0
nextPageToken = ''
comments = []
further = True
first = True
while further:
	halt = False
	if first == False:
		print "."
		try:
	  		results = youtube.commentThreads().list(
	  		  part="snippet",
	  		  maxResults=100,
	  		  videoId=videoId,
	  		  textFormat="plainText",
	  		  pageToken=nextPageToken
	  		).execute()
	  		totalResults = int(results["pageInfo"]["totalResults"])
	  	except HttpError, e:
			print "An HTTP error %d occurred:\n%s" % (e.resp.status, e.content)
			halt = True
	if halt == False:
	  	count += totalResults
	  	for item in results["items"]:
		  	comment = item["snippet"]["topLevelComment"]
		  	author = comment["snippet"]["authorDisplayName"]
		  	text = comment["snippet"]["textDisplay"]
		  	comments.append([author,text])
		if totalResults < 100:
			further = False
			first = False
		else:
			further = True
			first = False
			try:
				nextPageToken = results["nextPageToken"]
			except KeyError, e:
				print "An KeyError error occurred: %s" % (e)
				further = False

Now its time to add all the data to our csv file.

add_data(videoId,video.title,video.description,video.author,video.published,video.viewcount, video.duration, video.likes, video.dislikes,video.rating,video.category,comments)

Following code is used to add our data into a csv file.

def add_data(vID,title,description,author,published,viewcount, duration, likes, dislikes,rating,category,comments):
	data = [vID,title,description,author,published,viewcount, duration, likes, dislikes,rating,category,comments]
	with open("scraper.csv", "a") as fp:
	    wr = csv.writer(fp, dialect='excel')
	    wr.writerow(data)

This way we can get all the data and comments of a youtube video.

Now a simple extension of this script is to get all the data of top 10 search results.

For this I take the search term as input and then called YoutubeAPI for the search results. From that results I would just take the top 10 videoIDS and call the above script to get all required data.

searchTerm = raw_input("Term you want to Search : \n")
search_response = youtube.search().list(
      q=searchTerm,
      part="id,snippet",
      maxResults=30
    ).execute()
count = 0
for search_result in search_response.get("items", []):
    if search_result["id"]["kind"] == "youtube#video":
      	if count <10:
	        vID = search_result["id"]["videoId"]
	        get_data(vID)
	        count += 1
	    else:
	    	break
    else:
    	continue

You can checkout the full scripts in this repo here. Fork it or Star if you like it.

You can mail me at sunil@suniltatipelly.in for any queries or doubts regarding this.

Sunil Tatipelly

Sunil Tatipelly

Supposedly Engineer. Major Geek. Food Freak. Proud IITian. Quirkyalone.

comments powered by Disqus
rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora