Jekyll2022-08-28T18:47:51+00:00http://suniltatipelly.in/feed.xmlSunil TatipellySupposedly Engineer. Major Geek. Food Freak. Proud IITian. Quirkyalone.SonyLiv Live Streams Notifier2018-01-08T21:10:00+00:002018-01-08T21:10:00+00:00http://suniltatipelly.in/sonyliv-live-streams-notifier<p>I being a huge fan of cricket and my roomie a football freak, It was always a pain in the ass to open SonyLiv and check if the stream has started or not. Then it striked to me as Why shouldn’t I write a script which can notify me and my roomie whenever any Stream has started in SonyLiv. With that thought I started going through the Network tab in Chrome Developers Tool and I was able to capture the request through which all current live streams are ebing called.</p>
<p>I started my script with the request to the Url I captured from Network tab. I was able to get all the live streams running in the website in JSON form. I used <code class="language-plaintext highlighter-rouge">pynotify</code> to Notify me about the live streams added. After notifing, I saved all those stream data into a <code class="language-plaintext highlighter-rouge">csv</code> file so that no streams are repeated. I used <code class="language-plaintext highlighter-rouge">schedule</code> to run the function every 5 minutes to check if any new streams are added.</p>
<p>You can also filter the notifications with Sports Type. Just add a new line to check <code class="language-plaintext highlighter-rouge">if contentGenre == 'Football'</code> or whatever your required Sport is.</p>
<p>You can add new functions to notify anyone by calling from the <code class="language-plaintext highlighter-rouge">extractSonyLivDataLive</code> function.</p>
<h3 id="requirements">Requirements</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>requests
csv
schedule
pynotify
</code></pre></div></div>
<h3 id="how-to-run">How To Run</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python script.py
</code></pre></div></div>
<h3 id="response">Response</h3>
<p><img src="https://github.com/Sunil02324/SonyLiv-Live-Notifier/blob/master/sample.png?raw=true" alt="Alt Text" /></p>
<hr />
<h3 id="source-on-github--sonyliv-live-notifier">Source on Github : <a href="https://github.com/Sunil02324/SonyLiv-Live-Notifier">SonyLiv Live Notifier</a></h3>
<p>Please comment your views/suggestions below.</p>suniltatipellyI being a huge fan of cricket and my roomie a football freak, It was always a pain in the ass to open SonyLiv and check if the stream has started or not. Then it striked to me as Why shouldn’t I write a script which can notify me and my roomie whenever any Stream has started in SonyLiv. With that thought I started going through the Network tab in Chrome Developers Tool and I was able to capture the request through which all current live streams are ebing called.Youtube Videos Metadata & Comments Scraper2017-04-05T12:48:00+00:002017-04-05T12:48:00+00:00http://suniltatipelly.in/youtube-video-metadata-and-comments-scraper<p>I have been getting many request o write script to scrape Youtube videos Metadata and comments. So instead of replying separately to everyone, I thought of creating a blog so that it would be easy for everyone to through it and would be a reference for them in future also.</p>
<p>I have made 2 different scripts though both includes almost same code. One is scraping data using the ID of the youtube Video and other is Scraping data of top 10 videos in search page of any terms.</p>
<p>First of all, we need to have <code class="language-plaintext highlighter-rouge">DEVELOPER_KEY</code> of YoutubeDataAPI for this script to work. You can grab them <a href="https://console.developers.google.com" target="_blank">here</a>.</p>
<p>I am using an external library called <code class="language-plaintext highlighter-rouge">pafy</code> to download some data about Youtube. You can know more details about it <a href="https://pythonhosted.org/Pafy/" target="_blank">here</a>.</p>
<p>After successful scraping I am storing all those data into a CSV file. So I have imported library called <code class="language-plaintext highlighter-rouge">csv</code>.</p>
<p>Now its time to do some scraping :</p>
<p>Import all required libraries into our file.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">apiclient.discovery</span> <span class="kn">import</span> <span class="n">build</span>
<span class="kn">from</span> <span class="nn">apiclient.errors</span> <span class="kn">import</span> <span class="n">HttpError</span>
<span class="kn">from</span> <span class="nn">oauth2client.tools</span> <span class="kn">import</span> <span class="n">argparser</span>
<span class="kn">import</span> <span class="nn">pafy</span>
<span class="kn">import</span> <span class="nn">csv</span>
</code></pre></div></div>
<p>Now its time to add our developers key and build youtube.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DEVELOPER_KEY</span> <span class="o">=</span> <span class="s">"#AddYourDeveloperKey"</span>
<span class="n">YOUTUBE_API_SERVICE_NAME</span> <span class="o">=</span> <span class="s">"youtube"</span>
<span class="n">YOUTUBE_API_VERSION</span> <span class="o">=</span> <span class="s">"v3"</span>
<span class="n">pafy</span><span class="p">.</span><span class="n">set_api_key</span><span class="p">(</span><span class="s">"#AddYourAPIKey"</span><span class="p">)</span>
<span class="n">youtube</span> <span class="o">=</span> <span class="n">build</span><span class="p">(</span><span class="n">YOUTUBE_API_SERVICE_NAME</span><span class="p">,</span> <span class="n">YOUTUBE_API_VERSION</span><span class="p">,</span><span class="n">developerKey</span><span class="o">=</span><span class="n">DEVELOPER_KEY</span><span class="p">)</span>
</code></pre></div></div>
<p>We will take the Youtube ID as input and make it into a Perfect Youtube URL.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">videoId</span> <span class="o">=</span> <span class="nb">raw_input</span><span class="p">(</span><span class="s">"ID of youtube video : </span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"https://www.youtube.com/watch?v="</span> <span class="o">+</span> <span class="n">videoId</span>
</code></pre></div></div>
<p>Requesting Metadata from <code class="language-plaintext highlighter-rouge">pafy</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">video</span> <span class="o">=</span> <span class="n">pafy</span><span class="p">.</span><span class="n">new</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</code></pre></div></div>
<p>Its time to get all the comments from that Youtube Vdeoa dn save it into an array. Default max results you can get is 100. So if a Video has more than 100 comments we need to iterate the same function to get all the comments.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span> <span class="o">=</span> <span class="n">youtube</span><span class="p">.</span><span class="n">commentThreads</span><span class="p">().</span><span class="nb">list</span><span class="p">(</span>
<span class="n">part</span><span class="o">=</span><span class="s">"snippet"</span><span class="p">,</span>
<span class="n">maxResults</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
<span class="n">videoId</span><span class="o">=</span><span class="n">videoId</span><span class="p">,</span>
<span class="n">textFormat</span><span class="o">=</span><span class="s">"plainText"</span>
<span class="p">).</span><span class="n">execute</span><span class="p">()</span>
<span class="n">totalResults</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">totalResults</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="s">"pageInfo"</span><span class="p">][</span><span class="s">"totalResults"</span><span class="p">])</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">nextPageToken</span> <span class="o">=</span> <span class="s">''</span>
<span class="n">comments</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">further</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">first</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">while</span> <span class="n">further</span><span class="p">:</span>
<span class="n">halt</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">if</span> <span class="n">first</span> <span class="o">==</span> <span class="bp">False</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"."</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">youtube</span><span class="p">.</span><span class="n">commentThreads</span><span class="p">().</span><span class="nb">list</span><span class="p">(</span>
<span class="n">part</span><span class="o">=</span><span class="s">"snippet"</span><span class="p">,</span>
<span class="n">maxResults</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
<span class="n">videoId</span><span class="o">=</span><span class="n">videoId</span><span class="p">,</span>
<span class="n">textFormat</span><span class="o">=</span><span class="s">"plainText"</span><span class="p">,</span>
<span class="n">pageToken</span><span class="o">=</span><span class="n">nextPageToken</span>
<span class="p">).</span><span class="n">execute</span><span class="p">()</span>
<span class="n">totalResults</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="s">"pageInfo"</span><span class="p">][</span><span class="s">"totalResults"</span><span class="p">])</span>
<span class="k">except</span> <span class="n">HttpError</span><span class="p">,</span> <span class="n">e</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"An HTTP error %d occurred:</span><span class="se">\n</span><span class="s">%s"</span> <span class="o">%</span> <span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">resp</span><span class="p">.</span><span class="n">status</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
<span class="n">halt</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">if</span> <span class="n">halt</span> <span class="o">==</span> <span class="bp">False</span><span class="p">:</span>
<span class="n">count</span> <span class="o">+=</span> <span class="n">totalResults</span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">results</span><span class="p">[</span><span class="s">"items"</span><span class="p">]:</span>
<span class="n">comment</span> <span class="o">=</span> <span class="n">item</span><span class="p">[</span><span class="s">"snippet"</span><span class="p">][</span><span class="s">"topLevelComment"</span><span class="p">]</span>
<span class="n">author</span> <span class="o">=</span> <span class="n">comment</span><span class="p">[</span><span class="s">"snippet"</span><span class="p">][</span><span class="s">"authorDisplayName"</span><span class="p">]</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">comment</span><span class="p">[</span><span class="s">"snippet"</span><span class="p">][</span><span class="s">"textDisplay"</span><span class="p">]</span>
<span class="n">comments</span><span class="p">.</span><span class="n">append</span><span class="p">([</span><span class="n">author</span><span class="p">,</span><span class="n">text</span><span class="p">])</span>
<span class="k">if</span> <span class="n">totalResults</span> <span class="o"><</span> <span class="mi">100</span><span class="p">:</span>
<span class="n">further</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">first</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">further</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">first</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">nextPageToken</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="s">"nextPageToken"</span><span class="p">]</span>
<span class="k">except</span> <span class="nb">KeyError</span><span class="p">,</span> <span class="n">e</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"An KeyError error occurred: %s"</span> <span class="o">%</span> <span class="p">(</span><span class="n">e</span><span class="p">)</span>
<span class="n">further</span> <span class="o">=</span> <span class="bp">False</span>
</code></pre></div></div>
<p>Now its time to add all the data to our csv file.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">add_data</span><span class="p">(</span><span class="n">videoId</span><span class="p">,</span><span class="n">video</span><span class="p">.</span><span class="n">title</span><span class="p">,</span><span class="n">video</span><span class="p">.</span><span class="n">description</span><span class="p">,</span><span class="n">video</span><span class="p">.</span><span class="n">author</span><span class="p">,</span><span class="n">video</span><span class="p">.</span><span class="n">published</span><span class="p">,</span><span class="n">video</span><span class="p">.</span><span class="n">viewcount</span><span class="p">,</span> <span class="n">video</span><span class="p">.</span><span class="n">duration</span><span class="p">,</span> <span class="n">video</span><span class="p">.</span><span class="n">likes</span><span class="p">,</span> <span class="n">video</span><span class="p">.</span><span class="n">dislikes</span><span class="p">,</span><span class="n">video</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span><span class="n">video</span><span class="p">.</span><span class="n">category</span><span class="p">,</span><span class="n">comments</span><span class="p">)</span>
</code></pre></div></div>
<p>Following code is used to add our data into a csv file.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">add_data</span><span class="p">(</span><span class="n">vID</span><span class="p">,</span><span class="n">title</span><span class="p">,</span><span class="n">description</span><span class="p">,</span><span class="n">author</span><span class="p">,</span><span class="n">published</span><span class="p">,</span><span class="n">viewcount</span><span class="p">,</span> <span class="n">duration</span><span class="p">,</span> <span class="n">likes</span><span class="p">,</span> <span class="n">dislikes</span><span class="p">,</span><span class="n">rating</span><span class="p">,</span><span class="n">category</span><span class="p">,</span><span class="n">comments</span><span class="p">):</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="n">vID</span><span class="p">,</span><span class="n">title</span><span class="p">,</span><span class="n">description</span><span class="p">,</span><span class="n">author</span><span class="p">,</span><span class="n">published</span><span class="p">,</span><span class="n">viewcount</span><span class="p">,</span> <span class="n">duration</span><span class="p">,</span> <span class="n">likes</span><span class="p">,</span> <span class="n">dislikes</span><span class="p">,</span><span class="n">rating</span><span class="p">,</span><span class="n">category</span><span class="p">,</span><span class="n">comments</span><span class="p">]</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"scraper.csv"</span><span class="p">,</span> <span class="s">"a"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
<span class="n">wr</span> <span class="o">=</span> <span class="n">csv</span><span class="p">.</span><span class="n">writer</span><span class="p">(</span><span class="n">fp</span><span class="p">,</span> <span class="n">dialect</span><span class="o">=</span><span class="s">'excel'</span><span class="p">)</span>
<span class="n">wr</span><span class="p">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre></div></div>
<p>This way we can get all the data and comments of a youtube video.</p>
<p>Now a simple extension of this script is to get all the data of top 10 search results.</p>
<p>For this I take the search term as input and then called YoutubeAPI for the search results. From that results I would just take the top 10 videoIDS and call the above script to get all required data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">searchTerm</span> <span class="o">=</span> <span class="nb">raw_input</span><span class="p">(</span><span class="s">"Term you want to Search : </span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
<span class="n">search_response</span> <span class="o">=</span> <span class="n">youtube</span><span class="p">.</span><span class="n">search</span><span class="p">().</span><span class="nb">list</span><span class="p">(</span>
<span class="n">q</span><span class="o">=</span><span class="n">searchTerm</span><span class="p">,</span>
<span class="n">part</span><span class="o">=</span><span class="s">"id,snippet"</span><span class="p">,</span>
<span class="n">maxResults</span><span class="o">=</span><span class="mi">30</span>
<span class="p">).</span><span class="n">execute</span><span class="p">()</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">search_result</span> <span class="ow">in</span> <span class="n">search_response</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"items"</span><span class="p">,</span> <span class="p">[]):</span>
<span class="k">if</span> <span class="n">search_result</span><span class="p">[</span><span class="s">"id"</span><span class="p">][</span><span class="s">"kind"</span><span class="p">]</span> <span class="o">==</span> <span class="s">"youtube#video"</span><span class="p">:</span>
<span class="k">if</span> <span class="n">count</span> <span class="o"><</span><span class="mi">10</span><span class="p">:</span>
<span class="n">vID</span> <span class="o">=</span> <span class="n">search_result</span><span class="p">[</span><span class="s">"id"</span><span class="p">][</span><span class="s">"videoId"</span><span class="p">]</span>
<span class="n">get_data</span><span class="p">(</span><span class="n">vID</span><span class="p">)</span>
<span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">continue</span>
</code></pre></div></div>
<p>You can checkout the full scripts in this repo <a href="https://github.com/Sunil02324/Youtube-Meta-Data-Comments-Scraper">here</a>. Fork it or Star if you like it.</p>
<p>You can mail me at <a href="mailto:sunil@suniltatipelly.in">sunil@suniltatipelly.in</a> for any queries or doubts regarding this.</p>suniltatipellyI have been getting many request o write script to scrape Youtube videos Metadata and comments. So instead of replying separately to everyone, I thought of creating a blog so that it would be easy for everyone to through it and would be a reference for them in future also.MIT OpenCourseWare Scraper2017-02-09T22:48:00+00:002017-02-09T22:48:00+00:00http://suniltatipelly.in/mit-ocw-scraper<p>MIT OpenCourseWare (MIT OCW) is an initiative of the Massachusetts Institute of Technology (MIT) to put all of the educational materials from its undergraduate – and graduate-level courses online, freely and openly available to anyone, anywhere. MIT OpenCourseWare is a large-scale, web-based publication of MIT course materials.</p>
<p>Recently while going through the <a href="https://ocw.mit.edu/index.htm">website</a> I had the idea of getting all the course materials and its details offline. So I checked how all the course details are being populated in the <code class="language-plaintext highlighter-rouge">Search by Topic</code> page, as it would be easy to scrap the search page rather than other ones.</p>
<p>I am unable to write a brief description on how was this total script is running, but I made the script as simply as possible and you will be able to understand all the steps easily.</p>
<p>You can start the script by running <code class="language-plaintext highlighter-rouge">python script.py</code></p>
<p>You can checkout the repo <a href="https://github.com/Sunil02324/MIT-OpenCourseWare-Scraper">here</a>. Fork it or Star if you like it.</p>
<p>You can mail me at <a href="mailto:sunil@suniltatipelly.in">sunil@suniltatipelly.in</a> for any queries or doubts regarding this.</p>
<hr />
<p>Below is the output of the script :</p>
<p><img class="image" src="http://suniltatipelly.in/assets/images/posts/mit1.png" alt="Alt Text" style="display:block;margin:0 auto;" /></p>
<hr />
<p>You can checkout the repo <a href="https://github.com/Sunil02324/MIT-OpenCourseWare-Scraper">here</a>. Fork it or Star if you like it.</p>suniltatipellyMIT OpenCourseWare (MIT OCW) is an initiative of the Massachusetts Institute of Technology (MIT) to put all of the educational materials from its undergraduate – and graduate-level courses online, freely and openly available to anyone, anywhere. MIT OpenCourseWare is a large-scale, web-based publication of MIT course materials.Facebook Conversation ID Retriever2016-11-27T02:10:00+00:002016-11-27T02:10:00+00:00http://suniltatipelly.in/facebook-conversation-id-retriever<p>This is a support script to the <a href="http://suniltatipelly.in/facebook-messages-counter/">Facebook Messages Counter</a> I have posted earlier. This can be used to get the conversation ID from the Profile ID.</p>
<p>I used the same Graph API V2.2 which I have used in the previous post also to retrieve the Conversation ID. You can use the same Access Token generated from the graph explorer. If you haven’t generated till now you can generate once from <a href="https://developers.facebook.com/tools/explorer/145634995501895?method=GET&path=me%2Finbox&version=v2.2">here</a>.</p>
<p>Replaces the <code class="language-plaintext highlighter-rouge"><access_token></code> from <code class="language-plaintext highlighter-rouge">url</code> in the script with your access token. Also specify the UserID of the person in <code class="language-plaintext highlighter-rouge">userId</code>.</p>
<p>Then start running the script with <code class="language-plaintext highlighter-rouge">python script.py</code></p>
<p>The required conversation ID is printed if the conversation has been initiated before.</p>
<p>The final script is :</p>
<script src="https://gist.github.com/Sunil02324/6248ffaf6ee139535fafccc4035c1f0d.js"></script>
<hr />
<h3 id="sample-">Sample :</h3>
<p><img class="image" src="http://suniltatipelly.in/assets/images/posts/messenger1.PNG" alt="Alt Text" style="display:block;margin:0 auto;" /></p>
<hr />
<p>You can checkout the repo <a href="https://github.com/Sunil02324/Facebook-Conversation-ID-Retriever">here</a>. Fork it or Star if you like it.</p>
<p>You can mail me at <a href="mailto:sunil@suniltatipelly.in">sunil@suniltatipelly.in</a> for any queries or doubts regarding this.</p>suniltatipellyThis is a support script to the Facebook Messages Counter I have posted earlier. This can be used to get the conversation ID from the Profile ID.Facebook Messages Counter2016-11-26T14:46:00+00:002016-11-26T14:46:00+00:00http://suniltatipelly.in/facebook-messages-counter<p>I always had this thought of getting the count of all the messages I sent & received in my group chat with friends and log those messages automatically daily. But never made it into script. As my exams were going on, in no mood for studying I wrote this script.</p>
<p>I used the Graph API v2.2 to get the converstaion details. <a href="https://developers.facebook.com/docs/graph-api/reference/v2.2/conversation">This</a> is the link for the docs of Graph API. I used <a href="https://developers.facebook.com/tools/explorer/145634995501895/">Graph API Explorer</a> to check whether I am getting all the required details from the API call.</p>
<p>The most important thing needed for this script is the initial URL for which request must be send to retrieve the details. That url can be accessed by selecting any of your conversations from <a href="https://developers.facebook.com/tools/explorer/145634995501895/?method=GET&path=me%2Finbox&version=v2.2">here</a>. Please note that Conversation ID is different from Profile/User ID.</p>
<p>After selecting the conversation ID, a request is called from the explorer and the response is displayed. Copy the url from the <code class="language-plaintext highlighter-rouge">Get Code</code> button at the bottom right of the explorer. It looks similar to ` https://graph.facebook.com/v2.2/<conversationID>?access_token=###`</conversationID></p>
<p>For each request I was only getting 25 messages along with url for the next request. So I wrote a while loop which terminates when a boolean <code class="language-plaintext highlighter-rouge">loop</code> is set to <code class="language-plaintext highlighter-rouge">false</code>.</p>
<p>From the initial response, I am saving a lists with Names, Profile Ids, total members in group chat. Once the lists are generated, I am initializing an array with messages count as 0.</p>
<p>I am also updating the variable <code class="language-plaintext highlighter-rouge">url</code> with the <code class="language-plaintext highlighter-rouge">next</code> url I get from the response. In each response, after parsing each message, I check the from ID of the message and update the same in messages count list. This process is repeated till all the requests are finished.</p>
<p>Once it is done, I ran a for loop to print the name of the sender and the no. of messages sent by them.</p>
<p>After making the above changes the script looked smething like below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import requests
import json
from time import sleep
## We had proxy enabled in our institute. Comment the below lines if not needed.
http_proxy = "http://host:port"
https_proxy = "https://host:port"
proxyDict = {
"http" : http_proxy,
"https" : https_proxy
}
## Select any of your conversation ID from here https://developers.facebook.com/tools/explorer/145634995501895/?method=GET&path=me%2Finbox&version=v2.2
## Replace the <conversationID> and access_token from your Graph API explorer
##URL for the converstaion : https://graph.facebook.com/v2.2/<conversationID>?access_token=###
## Please note that Profile ID and Cnversation ID are not same.
url = '###'
first = True
loop = True
requests_count = 0
peoples = []
ids = []
count = 0
resp = requests.get(url,proxies=proxyDict)
data = resp.json()
mums = data['to']['data']
for mum in mums:
peoples.append(mum['name'])
ids.append(mum['id'])
count += 1
sleep(1)
messages_count = [0]*count
##Used Sleep several times to escape from the API rate limiting.
##There might be some cases of Access Token Expiration. In that case, decrease the time of sleep.
while loop:
try:
print requests_count
resp = requests.get(url,proxies=proxyDict)
if first:
data = resp.json()
messages = data['comments']['data']
for message in messages:
for i in range(0,count):
if message['from']['id'] == ids[i]:
messages_count[i] = int(messages_count[i]) + 1
break
url = data['comments']['paging']['next']
first = False
requests_count += 1
sleep(0.5)
else:
data1 = resp.json()
if(data1['data']):
messages = data1['data']
for message in messages:
for i in range(0,count):
if message['from']['id'] == ids[i]:
messages_count[i] = int(messages_count[i]) + 1
break
url = data1['paging']['next']
requests_count += 1
sleep(0.5)
else:
loop = False
except IOError as e:
print "Socket error. Sleeping for 2 seconds"
sleep(2)
continue
except requests.exceptions.ConnectionError as e:
print "Proxy Error. Sleeping for 2 seconds"
sleep(2)
continue
## Printing the Member names and there messages count respectively.
for i in range(0,count):
print "Messages Sent By " + str(peoples[i]) + " : " + str(messages_count[i])
</code></pre></div></div>
<p>I ran the script. It was going good till so0me time but I was getting various exceptions with which I had to run script again and again. I changed the script to handle those exceptions also so that the script won’t stop till the end.</p>
<p>Some of the exceptions I got are :</p>
<ul>
<li>API call limit exceeded. I handled this exception by making the script to sleep for 20 seconds.</li>
<li>Response does not contain ‘data’. This exception arises when there is no more messages in the conversation. I made the script to break once this arises.</li>
<li>Socket and Proxy Error. As I was behind the proxy I had tohandle these exceptions.</li>
</ul>
<hr />
<p>And the final script looked like this.</p>
<script src="https://gist.github.com/Sunil02324/a22ca9587c453027c6b7404901d18a4b.js"></script>
<hr />
<p>Below is the image of the result of my group chat with my friends.</p>
<p><img class="image" src="http://suniltatipelly.in/assets/images/posts/messenger.PNG" alt="Alt Text" style="display:block;margin:0 auto;" /></p>
<p>If your chat count crosses ours, do share it with this blog and tag me or mail me.</p>
<hr />
<p>This script can be used to both count the messages and also log the messages. Some slight changes should be made to log the messages of the conversation.</p>
<p>I will update a new script to get the conversation ID from the given Profile ID later.</p>
<p>You can checkout the repo <a href="https://github.com/Sunil02324/Facebook-Messages-Counter">here</a>. Fork it or Star if you like it.</p>
<p>You can mail me at <a href="mailto:sunil@suniltatipelly.in">sunil@suniltatipelly.in</a> for any queries or doubts regarding this.</p>suniltatipellyI always had this thought of getting the count of all the messages I sent & received in my group chat with friends and log those messages automatically daily. But never made it into script. As my exams were going on, in no mood for studying I wrote this script.How to Report Bugs Effectively2016-11-07T22:48:00+00:002016-11-07T22:48:00+00:00http://suniltatipelly.in/how-to-report-bugs-effectively<h2 id="introduction">Introduction</h2>
<p>Anybody who has written software for public use will probably have received at least one bad bug report. Reports that say nothing (“It doesn’t work!”); reports that make no sense; reports that don’t give enough information; reports that give wrong information. Reports of problems that turn out to be user error; reports of problems that turn out to be the fault of somebody else’s program; reports of problems that turn out to be network failures.</p>
<p>There’s a reason why technical support is seen as a horrible job to be in, and that reason is bad bug reports. However, not all bug reports are unpleasant: I maintain free software, when I’m not earning my living, and sometimes I receive wonderfully clear, helpful, informative bug reports.</p>
<p>In this essay I’ll try to state clearly what makes a good bug report. Ideally I would like everybody in the world to read this essay before reporting any bugs to anybody. Certainly I would like everybody who reports bugs to me to have read it.</p>
<p>In a nutshell, the aim of a bug report is to enable the programmer to see the program failing in front of them. You can either show them in person, or give them careful and detailed instructions on how to make it fail. If they can make it fail, they will try to gather extra information until they know the cause. If they can’t make it fail, they will have to ask you to gather that information for them.</p>
<p>In bug reports, try to make very clear what are actual facts (“I was at the computer and this happened”) and what are speculations (“I think the problem might be this”). Leave out speculations if you want to, but don’t leave out facts.</p>
<p>When you report a bug, you are doing so because you want the bug fixed. There is no point in swearing at the programmer or being deliberately unhelpful: it may be their fault and your problem, and you might be right to be angry with them, but the bug will get fixed faster if you help them by supplying all the information they need. Remember also that if the program is free, then the author is providing it out of kindness, so if too many people are rude to them then they may stop feeling kind.</p>
<h2 id="it-doesnt-work">“It doesn’t work.”</h2>
<p>Give the programmer some credit for basic intelligence: if the program really didn’t work at all, they would probably have noticed. Since they haven’t noticed, it must be working for them. Therefore, either you are doing something differently from them, or your environment is different from theirs. They need information; providing this information is the purpose of a bug report. More information is almost always better than less.</p>
<p>Many programs, particularly free ones, publish their list of known bugs. If you can find a list of known bugs, it’s worth reading it to see if the bug you’ve just found is already known or not. If it’s already known, it probably isn’t worth reporting again, but if you think you have more information than the report in the bug list, you might want to contact the programmer anyway. They might be able to fix the bug more easily if you can give them information they didn’t already have.</p>
<p>This essay is full of guidelines. None of them is an absolute rule. Particular programmers have particular ways they like bugs to be reported. If the program comes with its own set of bug-reporting guidelines, read them. If the guidelines that come with the program contradict the guidelines in this essay, follow the ones that come with the program!</p>
<p>If you are not reporting a bug but just asking for help using the program, you should state where you have already looked for the answer to your question. (“I looked in chapter 4 and section 5.2 but couldn’t find anything that told me if this is possible.”) This will let the programmer know where people will expect to find the answer, so they can make the documentation easier to use.</p>
<h2 id="show-me">“Show me.”</h2>
<p>One of the very best ways you can report a bug is by showing it to the programmer. Stand them in front of your computer, fire up their software, and demonstrate the thing that goes wrong. Let them watch you start the machine, watch you run the software, watch how you interact with the software, and watch what the software does in response to your inputs.</p>
<p>They know that software like the back of their hand. They know which parts they trust, and they know which parts are likely to have faults. They know intuitively what to watch for. By the time the software does something obviously wrong, they may well have already noticed something subtly wrong earlier which might give them a clue. They can observe everything the computer does during the test run, and they can pick out the important bits for themselves.</p>
<p>This may not be enough. They may decide they need more information, and ask you to show them the same thing again. They may ask you to talk them through the procedure, so that they can reproduce the bug for themselves as many times as they want. They might try varying the procedure a few times, to see whether the problem occurs in only one case or in a family of related cases. If you’re unlucky, they may need to sit down for a couple of hours with a set of development tools and really start investigating. But the most important thing is to have the programmer looking at the computer when it goes wrong. Once they can see the problem happening, they can usually take it from there and start trying to fix it.</p>
<h2 id="show-me-how-to-show-myself">“Show me how to show myself.”</h2>
<p>This is the era of the Internet. This is the era of worldwide communication. This is the era in which I can send my software to somebody in Russia at the touch of a button, and he can send me comments about it just as easily. But if he has a problem with my program, he can’t have me standing in front of it while it fails. “Show me” is good when you can, but often you can’t.</p>
<p>If you have to report a bug to a programmer who can’t be present in person, the aim of the exercise is to enable them to reproduce the problem. You want the programmer to run their own copy of the program, do the same things to it, and make it fail in the same way. When they can see the problem happening in front of their eyes, then they can deal with it.</p>
<p>So tell them exactly what you did. If it’s a graphical program, tell them which buttons you pressed and what order you pressed them in. If it’s a program you run by typing a command, show them precisely what command you typed. Wherever possible, you should provide a verbatim transcript of the session, showing what commands you typed and what the computer output in response.</p>
<p>Give the programmer all the input you can think of. If the program reads from a file, you will probably need to send a copy of the file. If the program talks to another computer over a network, you probably can’t send a copy of that computer, but you can at least say what kind of computer it is, and (if you can) what software is running on it.</p>
<h2 id="works-for-me-so-what-goes-wrong">“Works for me. So what goes wrong?”</h2>
<p>If you give the programmer a long list of inputs and actions, and they fire up their own copy of the program and nothing goes wrong, then you haven’t given them enough information. Possibly the fault doesn’t show up on every computer; your system and theirs may differ in some way. Possibly you have misunderstood what the program is supposed to do, and you are both looking at exactly the same display but you think it’s wrong and they know it’s right.</p>
<p>So also describe what happened. Tell them exactly what you saw. Tell them why you think what you saw is wrong; better still, tell them exactly what you expected to see. If you say “and then it went wrong”, you have left out some very important information.</p>
<p>If you saw error messages then tell the programmer, carefully and precisely, what they were. They are important! At this stage, the programmer is not trying to fix the problem: they’re just trying to find it. They need to know what has gone wrong, and those error messages are the computer’s best effort to tell you that. Write the errors down if you have no other easy way to remember them, but it’s not worth reporting that the program generated an error unless you can also report what the error message was.</p>
<p>In particular, if the error message has numbers in it, do let the programmer have those numbers. Just because you can’t see any meaning in them doesn’t mean there isn’t any. Numbers contain all kinds of information that can be read by programmers, and they are likely to contain vital clues. Numbers in error messages are there because the computer is too confused to report the error in words, but is doing the best it can to get the important information to you somehow.</p>
<p>At this stage, the programmer is effectively doing detective work. They don’t know what’s happened, and they can’t get close enough to watch it happening for themselves, so they are searching for clues that might give it away. Error messages, incomprehensible strings of numbers, and even unexplained delays are all just as important as fingerprints at the scene of a crime. Keep them!</p>
<p>If you are using Unix, the program may have produced a core dump. Core dumps are a particularly good source of clues, so don’t throw them away. On the other hand, most programmers don’t like to receive huge core files by e-mail without warning, so ask before mailing one to anybody. Also, be aware that the core file contains a record of the complete state of the program: any “secrets” involved (maybe the program was handling a personal message, or dealing with confidential data) may be contained in the core file.</p>
<h2 id="so-then-i-tried---">“So then I tried . . .”</h2>
<p>There are a lot of things you might do when an error or bug comes up. Many of them make the problem worse. A friend of mine at school deleted all her Word documents by mistake, and before calling in any expert help, she tried reinstalling Word, and then she tried running Defrag. Neither of these helped recover her files, and between them they scrambled her disk to the extent that no Undelete program in the world would have been able to recover anything. If she’d only left it alone, she might have had a chance.</p>
<p>Users like this are like a mongoose backed into a corner: with its back to the wall and seeing certain death staring it in the face, it attacks frantically, because doing something has to be better than doing nothing. This is not well adapted to the type of problems computers produce.</p>
<p>Instead of being a mongoose, be an antelope. When an antelope is confronted with something unexpected or frightening, it freezes. It stays absolutely still and tries not to attract any attention, while it stops and thinks and works out the best thing to do. (If antelopes had a technical support line, it would be telephoning it at this point.) Then, once it has decided what the safest thing to do is, it does it.</p>
<p>When something goes wrong, immediately stop doing anything. Don’t touch any buttons at all. Look at the screen and notice everything out of the ordinary, and remember it or write it down. Then perhaps start cautiously pressing “OK” or “Cancel”, whichever seems safest. Try to develop a reflex reaction - if a computer does anything unexpected, freeze.</p>
<p>If you manage to get out of the problem, whether by closing down the affected program or by rebooting the computer, a good thing to do is to try to make it happen again. Programmers like problems that they can reproduce more than once. Happy programmers fix bugs faster and more efficiently.</p>
<h2 id="i-think-the-tachyon-modulation-must-be-wrongly-polarised">“I think the tachyon modulation must be wrongly polarised.”</h2>
<p>It isn’t only non-programmers who produce bad bug reports. Some of the worst bug reports I’ve ever seen come from programmers, and even from good programmers.</p>
<p>I worked with another programmer once, who kept finding bugs in his own code and trying to fix them. Every so often he’d hit a bug he couldn’t solve, and he’d call me over to help. “What’s gone wrong?” I’d ask. He would reply by telling me his current opinion of what needed to be fixed.</p>
<p>This worked fine when his current opinion was right. It meant he’d already done half the work and we were able to finish the job together. It was efficient and useful.</p>
<p>But quite often he was wrong. We would work for some time trying to figure out why some particular part of the program was producing incorrect data, and eventually we would discover that it wasn’t, that we’d been investigating a perfectly good piece of code for half an hour, and that the actual problem was somewhere else.</p>
<p>I’m sure he wouldn’t do that to a doctor. “Doctor, I need a prescription for Hydroyoyodyne.” People know not to say that to a doctor: you describe the symptoms, the actual discomforts and aches and pains and rashes and fevers, and you let the doctor do the diagnosis of what the problem is and what to do about it. Otherwise the doctor dismisses you as a hypochondriac or crackpot, and quite rightly so.</p>
<p>It’s the same with programmers. Providing your own diagnosis might be helpful sometimes, but always state the symptoms. The diagnosis is an optional extra, and not an alternative to giving the symptoms. Equally, sending a modification to the code to fix the problem is a useful addition to a bug report but not an adequate substitute for one.</p>
<p>If a programmer asks you for extra information, don’t make it up! Somebody reported a bug to me once, and I asked him to try a command that I knew wouldn’t work. The reason I asked him to try it was that I wanted to know which of two different error messages it would give. Knowing which error message came back would give a vital clue. But he didn’t actually try it - he just mailed me back and said “No, that won’t work”. It took me some time to persuade him to try it for real.</p>
<p>Using your intelligence to help the programmer is fine. Even if your deductions are wrong, the programmer should be grateful that you at least tried to make their life easier. But report the symptoms as well, or you may well make their life much more difficult instead.</p>
<h2 id="thats-funny-it-did-it-a-moment-ago">“That’s funny, it did it a moment ago.”</h2>
<p>Say “intermittent fault” to any programmer and watch their face fall. The easy problems are the ones where performing a simple sequence of actions will cause the failure to occur. The programmer can then repeat those actions under closely observed test conditions and watch what happens in great detail. Too many problems simply don’t work that way: there will be programs that fail once a week, or fail once in a blue moon, or never fail when you try them in front of the programmer but always fail when you have a deadline coming up.</p>
<p>Most intermittent faults are not truly intermittent. Most of them have some logic somewhere. Some might occur when the machine is running out of memory, some might occur when another program tries to modify a critical file at the wrong moment, and some might occur only in the first half of every hour! (I’ve actually seen one of these.)</p>
<p>Also, if you can reproduce the bug but the programmer can’t, it could very well be that their computer and your computer are different in some way and this difference is causing the problem. I had a program once whose window curled up into a little ball in the top left corner of the screen, and sat there and sulked. But it only did it on 800x600 screens; it was fine on my 1024x768 monitor.</p>
<p>The programmer will want to know anything you can find out about the problem. Try it on another machine, perhaps. Try it twice or three times and see how often it fails. If it goes wrong when you’re doing serious work but not when you’re trying to demonstrate it, it might be long running times or large files that make it fall over. Try to remember as much detail as you can about what you were doing to it when it did fall over, and if you see any patterns, mention them. Anything you can provide has to be some help. Even if it’s only probabilistic (such as “it tends to crash more often when Emacs is running”), it might not provide direct clues to the cause of the problem, but it might help the programmer reproduce it.</p>
<p>Most importantly, the programmer will want to be sure of whether they’re dealing with a true intermittent fault or a machine-specific fault. They will want to know lots of details about your computer, so they can work out how it differs from theirs. A lot of these details will depend on the particular program, but one thing you should definitely be ready to provide is version numbers. The version number of the program itself, and the version number of the operating system, and probably the version numbers of any other programs that are involved in the problem.</p>
<h2 id="so-i-loaded-the-disk-on-to-my-windows---">“So I loaded the disk on to my Windows . . .”</h2>
<p>Writing clearly is essential in a bug report. If the programmer can’t tell what you meant, you might as well not have said anything.</p>
<p>I get bug reports from all around the world. Many of them are from non-native English speakers, and a lot of those apologise for their poor English. In general, the bug reports with apologies for their poor English are actually very clear and useful. All the most unclear reports come from native English speakers who assume that I will understand them even if they don’t make any effort to be clear or precise.</p>
<ul>
<li>Be specific. If you can do the same thing two different ways, state which one you used. “I selected Load” might mean “I clicked on Load” or “I pressed Alt-L”. Say which you did. Sometimes it matters.</li>
<li>Be verbose. Give more information rather than less. If you say too much, the programmer can ignore some of it. If you say too little, they have to come back and ask more questions. One bug report I received was a single sentence; every time I asked for more information, the reporter would reply with another single sentence. It took me several weeks to get a useful amount of information, because it turned up one short sentence at a time.</li>
<li>Be careful of pronouns. Don’t use words like “it”, or references like “the window”, when it’s unclear what they mean. Consider this: “I started FooApp. It put up a warning window. I tried to close it and it crashed.” It isn’t clear what the user tried to close. Did they try to close the warning window, or the whole of FooApp? It makes a difference. Instead, you could say “I started FooApp, which put up a warning window. I tried to close the warning window, and FooApp crashed.” This is longer and more repetitive, but also clearer and less easy to misunderstand.</li>
<li>Read what you wrote. Read the report back to yourself, and see if you think it’s clear. If you have listed a sequence of actions which should produce the failure, try following them yourself, to see if you missed a step.</li>
</ul>
<h2 id="summary">Summary</h2>
<ul>
<li>The first aim of a bug report is to let the programmer see the failure with their own eyes. If you can’t be with them to make it fail in front of them, give them detailed instructions so that they can make it fail for themselves.</li>
<li>In case the first aim doesn’t succeed, and the programmer can’t see it failing themselves, the second aim of a bug report is to describe what went wrong. Describe everything in detail. State what you saw, and also state what you expected to see. Write down the error messages, especially if they have numbers in.</li>
<li>When your computer does something unexpected, freeze. Do nothing until you’re calm, and don’t do anything that you think might be dangerous.</li>
<li>By all means try to diagnose the fault yourself if you think you can, but if you do, you should still report the symptoms as well.</li>
<li>Be ready to provide extra information if the programmer needs it. If they didn’t need it, they wouldn’t be asking for it. They aren’t being deliberately awkward. Have version numbers at your fingertips, because they will probably be needed.</li>
<li>Write clearly. Say what you mean, and make sure it can’t be misinterpreted.</li>
<li>Above all, be precise. Programmers like precision.</li>
</ul>
<p>Please comment your views/suggestions below.</p>suniltatipellyIntroductionHow to Prevent Web Scraping2016-10-14T22:48:00+00:002016-10-14T22:48:00+00:00http://suniltatipelly.in/how-to-prevent-scraping<h1 id="a-guide-to-prevent-webscraping">A guide to prevent Webscraping</h1>
<p><strong>(Or at least making it harder)</strong></p>
<hr />
<p><strong>Essentially, hindering scraping means that you need to make it difficult for scripts and machines to get the wanted data from your website, while not making it difficult for real users and search engines</strong>.</p>
<p>Unfortunately this is hard, and you will need to make trade-offs between preventing scraping and degrading the accessibility for real users and search engines.</p>
<p>In order to hinder scraping (also known as <em>Webscraping</em>, <em>Screenscraping</em>, <em>web data mining</em>, <em>web harvesting</em>, or <em>web data extraction</em>), it helps to know how these scrapers work, and what prevents them from working well, and this is what this answer is about.</p>
<p>Generally, these scraper programs are written in order to extract specific information from your site, such as articles, search results, product details, or in your case, artist and album information. Usually, people scrape websites for specific data, in order to reuse it on their own site (and make money out of your content !), or to build alternative frontends for your site (such as mobile apps), or even just for private research or analysis purposes.</p>
<p>Essentially, there are various types of scraper, and each works differently:</p>
<ul>
<li>
<p>Spiders, such as <a href="http://googlebot.com">Google’s bot</a> or website copiers like <a href="http://www.httrack.com">HTtrack</a>, which visit your website, and recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with a HTML parser to extract the desired data from each page.</p>
</li>
<li>
<p>Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the desired data, usually using a shell script. These are the simplest kind of scraper, and also the most fragile kind (<a href="http://example.com">Don’t ever try parse HTML with regex !</a>). These are thus the easiest kind of scraper to break and screw with.</p>
</li>
<li>
<p>HTML scrapers and parsers, such as ones based on <a href="http://jsoup.org">Jsoup</a>, <a href="http://scrapy.org/">Scrapy</a>, and many others. Similar to shell-script regex based ones, these work by extracting data from your pages based on patterns in your HTML, usually ignoring everything else.</p>
<p>So, for example: If your website has a search feature, such a scraper might submit a HTTP request for a search, and then get all the result links and their titles from the results page HTML, sometimes hundreds of times for hundreds of different searches, in order to specifically get only search result links and their titles. These are the most common.</p>
</li>
<li>
<p>Screenscrapers, based on eg. <a href="http://www.seleniumhq.org/">Selenium</a> or <a href="http://phantomjs.org">PhantomJS</a>, which actually open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by:</p>
<ul>
<li>
<p>Getting the HTML from the browser after your page has been loaded and JavaScript has run, and then using a HTML parser to extract the desired data or text. These are the most common, and so many of the methods for breaking HTML parsers / scrapers also work here.</p>
</li>
<li>
<p>Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up.</p>
</li>
</ul>
<p>Browser-based screenscrapers harder to deal with, as they run scripts, render HTML, and can behave like a real human browsing your site.</p>
</li>
<li>
<p>Webscraping services such as <a href="http://scrapinghub.com/">ScrapingHub</a> or <a href="https://www.kimonolabs.com/">Kimono</a>. In fact, there’s people whose job is to figure out how to scrape your site and pull out the content for others to use. These sometimes use large networks of proxies and ever changing IP addresses to get around limits and blocks, so they are especially problematic.</p>
<p>Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website.</p>
</li>
<li>
<p>Embedding your website in other site’s pages with <a href="https://en.wikipedia.org/wiki/Framing_(World_Wide_Web)">frames</a>, and embedding your site in mobile apps.</p>
<p>While not technically scraping, this is also a problem, as mobile apps (Android and iOS) can embed your website, and even inject custom CSS and JavaScript, thus completely changing the appearance of your site, and only showing the desired information, such as the article content itself or the list of search results, and hiding things like headers, footers, or ads.</p>
</li>
<li>
<p>Human copy - and - paste: People will copy and paste your content in order to use it elsewhere. Unfortunately, there’s not much you can do about this.</p>
</li>
</ul>
<p>There is a lot overlap between these different kinds of scraper, and many scrapers will behave similarly, even though they use different technologies and methods to get your content.</p>
<p>This collection of tips are mostly my own ideas, various difficulties that I’ve encountered while writing scrapers, as well as bits of information and ideas from around the interwebs.</p>
<p>##How to prevent scraping</p>
<p>Some general methods to detect and deter scrapers:</p>
<p>###Monitor your logs & traffic patterns; limit access if you see unusual activity:</p>
<p>Check your logs regularly, and in case of unusual activity indicative of automated access (scrapers), such as many similar actions from the same IP address, you can block or limit access.</p>
<p>Specifically, some ideas:</p>
<ul>
<li>
<p><strong>Rate limiting:</strong></p>
<p>Only allow users (and scrapers) to perform a limited number of actions in a certain time - for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would.</p>
</li>
<li>
<p><strong>Detect unusual activity:</strong></p>
<p>If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests.</p>
</li>
<li>
<p><strong>Don’t just monitor & rate limit by IP address - use other indicators too:</strong></p>
<p>If you do block or rate limit, don’t just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users / scrapers include:</p>
<ul>
<li>
<p>How fast users fill out forms, and where on a button they click;</p>
</li>
<li>
<p>You can gather a lot of information with JavaScript, such as screen size / resolution, timezone, installed fonts, etc; you can use this to identify users.</p>
</li>
<li>
<p>Http headers and their order, especially User-Agent.</p>
</li>
</ul>
<p>As an example, if you get many request from a single IP address, all using the same User agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it’s probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won’t inconvenience real users on that IP address, eg. in case of a shared internet connection.</p>
<p>You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users.</p>
<p>This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.</p>
<p>Related questions on Security Stack Exchange:</p>
<ul>
<li>
<p><a href="http://security.stackexchange.com/questions/81302/how-to-uniquely-identify-users-with-the-same-external-ip-address">How to uniquely identify users with the same external IP address?</a> for more details, and</p>
</li>
<li>
<p><a href="http://security.stackexchange.com/questions/96377/why-do-people-use-ip-address-bans-when-ip-addresses-often-change">Why do people use IP address bans when IP addresses often change?</a> for info on the limits of these methods.</p>
</li>
</ul>
</li>
<li>
<p><strong>Instead of temporarily blocking access, use a Captcha:</strong></p>
<p>The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.</p>
</li>
</ul>
<h3 id="require-registration--login">Require registration & login</h3>
<p>Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.</p>
<ul>
<li>If you require account creation and login, you can accurately track user and scraper actions. This way, you can easily detect when a specific account is being used for scraping, and ban it. Things like rate limiting or detecting abuse (such as a huge number of searches in a short time) become easier, as you can identify specific scrapers instead of just IP addresses.</li>
</ul>
<p>In order to avoid scripts creating many accounts, you should:</p>
<ul>
<li>
<p>Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.</p>
</li>
<li>
<p>Require a captcha to be solved during registration / account creation, again to prevent scripts from creating accounts.</p>
</li>
</ul>
<p>Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.</p>
<h3 id="block-access-from-cloud-hosting-and-scraping-service-ip-addresses">Block access from cloud hosting and scraping service IP addresses</h3>
<p>Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or Google app Engine, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services. You can also block access from IP addresses used by scraping services.</p>
<p>Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.</p>
<p>Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.</p>
<h3 id="make-your-error-message-nondescript-if-you-do-block">Make your error message nondescript if you do block</h3>
<p>If you do block / limit access, you should ensure that you don’t tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:</p>
<ul>
<li>
<p>Too many requests from your IP address, please try again later.</p>
</li>
<li>
<p>Error, User Agent header not present !</p>
</li>
</ul>
<p>Instead, show a friendly error message that doesn’t tell the scraper what caused it. Something like this is much better:</p>
<ul>
<li>Sorry, something went wrong. You can contact support via <code class="language-plaintext highlighter-rouge">helpdesk@example.com</code>, should the problem persist.</li>
</ul>
<p>This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don’t block and thus cause legitimate users to contact you.</p>
<h3 id="use-captchas-if-you-suspect-that-your-website-is-being-accessed-by-a-scraper">Use Captchas if you suspect that your website is being accessed by a scraper.</h3>
<p>Captchas (“Completely Automated Test to Tell Computers and Humans apart”) are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.</p>
<p>As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn’t a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.</p>
<p>Things to be aware of when using Captchas:</p>
<ul>
<li>
<p>Don’t roll your own, use something like Google’s <a href="https://www.google.com/recaptcha/intro/index.html">reCaptcha</a> : It’s a lot easier than implementing a captcha yourself, it’s more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it’s also a lot harder for a scripter to solve than a simple image served from your site</p>
</li>
<li>
<p>Don’t include the solution to the captcha in the HTML markup: I’ve actually seen one website which had the solution for the captcha <em>in the page itself</em>, (although quite well hidden) thus making it pretty useless. Don’t do something like this. Again, use a service like reCaptcha, and you won’t have this kind of problem (if you use it properly).</p>
</li>
<li>
<p>Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to by used unless your data is really valuable.</p>
</li>
</ul>
<h3 id="serve-your-text-content-as-an-image">Serve your text content as an image</h3>
<p>You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.</p>
<p>However, this is bad for screen readers, search engines, performance, and pretty much everything else. It’s also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it’s also easy to circumvent with some OCR, so don’t do it.</p>
<p>You can do something similar with CSS sprites, but that suffers from the same problems.</p>
<h3 id="dont-expose-your-complete-dataset">Don’t expose your complete dataset:</h3>
<p>If feasible, don’t provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don’t have a list of <em>all</em> the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.</p>
<p>This will be ineffective if:</p>
<ul>
<li>The bot / script does not want / need the full dataset anyway.</li>
<li>Your articles are served from a URL which looks something like <code class="language-plaintext highlighter-rouge">example.com/article.php?articleId=12345</code>. This (and similar things) which will allow scrapers to simply iterate over all the <code class="language-plaintext highlighter-rouge">articleId</code>s and request all the articles that way.</li>
<li>There are other ways to eventually find all the articles, such as by writing a script to follow links within articles which lead to other articles.</li>
<li>Searching for something like “and” or “the” can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results).</li>
<li>You need search engines to find your content.</li>
</ul>
<h3 id="dont-expose-your-apis-endpoints-and-similar-things">Don’t expose your APIs, endpoints, and similar things:</h3>
<p>Make sure you don’t expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.</p>
<h2 id="to-deter-html-parsers-and-scrapers">To deter HTML parsers and scrapers:</h2>
<p>Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.</p>
<h3 id="frequently-change-your-html">Frequently change your HTML</h3>
<p>Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a <code class="language-plaintext highlighter-rouge">div</code> with an id of <code class="language-plaintext highlighter-rouge">article-content</code>, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the <code class="language-plaintext highlighter-rouge">article-content </code> div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.</p>
<p>If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.</p>
<ul>
<li>
<p>You can frequently change the id’s and classes of elements in your HTML, perhaps even automatically. So, if your <code class="language-plaintext highlighter-rouge">div.article-content</code> becomes something like <code class="language-plaintext highlighter-rouge">div.a4c36dda13eaf0</code>, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids / classes too, otherwise the scraper will use <code class="language-plaintext highlighter-rouge">div.[any-14-characters]</code> to find the desired div instead. Beware of other similar holes too..</p>
</li>
<li>
<p>If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every <code class="language-plaintext highlighter-rouge">div</code> inside a <code class="language-plaintext highlighter-rouge">div</code> which comes after a <code class="language-plaintext highlighter-rouge">h1</code> is the article content, scrapers will get the article content based on that. Again, to break this, you can add / remove extra markup to your HTML, periodically and randomly, eg. adding extra <code class="language-plaintext highlighter-rouge">div</code>s or <code class="language-plaintext highlighter-rouge">span</code>s. With modern server side HTML processing, this should not be too hard.</p>
</li>
</ul>
<p>Things to be aware of:</p>
<ul>
<li>
<p>It will be tedious and difficult to implement, maintain, and debug.</p>
</li>
<li>
<p>You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.</p>
</li>
<li>
<p>Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. <a href="https://code.google.com/archive/p/boilerpipe/">Boilerpipe</a> does exactly this.</p>
</li>
</ul>
<p>Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.</p>
<p>See also <a href="http://stackoverflow.com/questions/30361740/">How to prevent crawlers depending on XPath from getting page contents</a> for details on how this can be implemented in PHP.</p>
<h3 id="change-your-html-based-on-the-users-location">Change your HTML based on the user’s location</h3>
<p>This is sort of similar to the previous tip. If you serve different HTML based on your user’s location / country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it’s actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.</p>
<h3 id="frequently-change-your-html-actively-screw-with-the-scrapers-by-doing-so-">Frequently change your HTML, actively screw with the scrapers by doing so !</h3>
<p>An example: You have a search feature on your website, located at <code class="language-plaintext highlighter-rouge">example.com/search?query=somesearchquery</code>, which returns the following HTML:</p>
<pre><code class="language-HTML"><div class="search-result">
<h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"search-result-link" href="/stories/stack-overflow-has-become-the-most-popular">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)
</code></pre>
<p>As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also<strong>leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper.</strong>Here’s how the search results page could be changed:</p>
<pre><code class="language-HTML"><div class="the-real-search-result">
<h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"the-real-search-result-link" href="/stories/stack-overflow-has-become-the-most-popular">Read more</a>
</div>
<div class="search-result" style="display:none">
<h3 class="search-result-title">Visit example.com now, for all the latest Stack Overflow related news !</h3>
<p class="search-result-excerpt">EXAMPLE.COM IS SO AWESOME, VISIT NOW! (Real users of your site will never see this, only the scrapers will.)</p>
<a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)
</code></pre>
<p>This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they’re hidden with CSS.</p>
<h3 id="screw-with-the-scraper-insert-fake-invisible-honeypot-data-into-your-page">Screw with the scraper: Insert fake, invisible honeypot data into your page</h3>
<p>Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:</p>
<pre><code class="language-HTML"><div class="search-result" style="display:none">
<h3 class="search-result-title">This search result is here to prevent scraping</h3>
<p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
Note that clicking the link below will block access to this site for 24 hours.</p>
<a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)
</code></pre>
<p>A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won’t visit the link. A genuine and desirable spider such as Google’s will not visit the link either because you disallowed <code class="language-plaintext highlighter-rouge">/scrapertrap/</code> in your robots.txt (don’t forget this!)</p>
<p>You can make your <code class="language-plaintext highlighter-rouge">scrapertrap.php</code> do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.</p>
<ul>
<li>
<p>Don’t forget to disallow your honeypot (<code class="language-plaintext highlighter-rouge">/scrapertrap/</code>) in your robots.txt file so that search engine bots don’t fall into it.</p>
</li>
<li>
<p>You can / should combine this with the previous tip of changing your HTML frequently.</p>
</li>
<li>
<p>Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a <code class="language-plaintext highlighter-rouge">style</code> attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.</p>
</li>
<li>
<p>Beware that malicious people can post something like <code class="language-plaintext highlighter-rouge">[img]http://yoursite.com/scrapertrap/scrapertrap.php[img]</code> on a forum (or elsewhere), and thus DOS legitimate users when they visit that forum and their browser hits your honeypot URL. Thus, the previous tip of changing the URL is doubly important, and you could also check the Referer.</p>
</li>
</ul>
<h3 id="serve-fake-and-useless-data-if-you-detect-a-scraper">Serve fake and useless data if you detect a scraper</h3>
<p>If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don’t know that they’re being screwed with.</p>
<p>As an example: if you have a news website; if you detect a scraper, instead of blocking access, just serve up fake, <a href="https://en.wikipedia.org/wiki/Markov_chain#Markov_text_generators">randomly generated</a> articles, and this will poison the data the scraper gets. If you make your faked data or articles indistinguishable from the real thing, you’ll make it hard for scrapers to get what they want, namely the actual, real articles.</p>
<h3 id="dont-accept-requests-if-the-user-agent-is-empty--missing">Don’t accept requests if the User Agent is empty / missing</h3>
<p>Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.</p>
<p>If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)</p>
<p>It’s trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.</p>
<h3 id="dont-accept-requests-if-the-user-agent-is-a-common-scraper-one-blacklist-ones-used-by-scrapers">Don’t accept requests if the User Agent is a common scraper one; blacklist ones used by scrapers</h3>
<p>In some cases, scrapers will use a User Agent which no real browser or search engine spider uses, such as:</p>
<ul>
<li>“Mozilla” (Just that, nothing else. I’ve seen a few questions about scraping here, using that. A real browser will never use only that)</li>
<li>“Java 1.7.43_u43” (By default, Java’s HttpUrlConnection uses something like this.)</li>
<li>“BIZCO EasyScraping Studio 2.0”</li>
<li>“wget”, “curl”, “libcurl”,.. (Wget and cURL are sometimes used for basic scraping)</li>
</ul>
<p>If you find that a specific User Agent string is used by scrapers on your site, and it is not used by real browsers or legitimate spiders, you can also add it to your blacklist.</p>
<h3 id="check-the-referer-header">Check the Referer header</h3>
<p>Adding on to the previous item, you can also check for the <a href="https://en.wikipedia.org/wiki/HTTP_referer header">Referer</a> (yes, it’s Referer, not Referrer), as lazily written scrapers may not send it, or always send the same thing (sometimes “google.com”). As an example, if the user comes to an article page from a on-site search results page, check that the Referer header is present and points to that search results page.</p>
<p>Beware that:</p>
<ul>
<li>
<p>Real browsers don’t always send it either;</p>
</li>
<li>
<p>It’s trivial to spoof.</p>
</li>
</ul>
<p>Again, as an additional measure against poorly written scrapers it may be worth implementing.</p>
<h3 id="if-it-doesnt-request-assets-css-images-its-not-a-real-browser">If it doesn’t request assets (CSS, images), it’s not a real browser.</h3>
<p>A real browser will (almost always) request and download assets such as images and CSS. HTML parsers and scrapers won’t as they are only interested in the actual pages and their content.</p>
<p>You could log requests to your assets, and if you see lots of requests for only the HTML, it may be a scraper.</p>
<p>Beware that search engine bots, ancient mobile devices, screen readers and misconfigured devices may not request assets either.</p>
<h3 id="use-and-require-cookies-use-them-to-track-user-and-scraper-actions">Use and require cookies; use them to track user and scraper actions.</h3>
<p>You can require cookies to be enabled in order to view your website. This will deter inexperienced and newbie scraper writers, however it is easy to for a scraper to send cookies. If you do use and require them, you can track user and scraper actions with them, and thus implement rate-limiting, blocking, or showing captchas on a per-user instead of a per-IP basis.</p>
<p>For example: when the user performs search, set a unique identifying cookie. When the results pages are viewed, verify that cookie. If the user opens all the search results (you can tell from the cookie), then it’s probably a scraper.</p>
<p>Using cookies may be ineffective, as scrapers can send the cookies with their requests too, and discard them as needed. You will also prevent access for real users who have cookies disabled, if your site only works with cookies.</p>
<p>Note that if you use JavaScript to set and retrieve the cookie, you’ll block scrapers which don’t run JavaScript, since they can’t retrieve and send the cookie with their request.</p>
<h3 id="use-javascript--ajax-to-load-your-content">Use JavaScript + Ajax to load your content</h3>
<p>You could use JavaScript + AJAX to load your content after the page itself loads. This will make the content inaccessible to HTML parsers which do not run JavaScript. This is often an effective deterrent to newbie and inexperienced programmers writing scrapers.</p>
<p>Be aware of:</p>
<ul>
<li>
<p>Using JavaScript to load the actual content will degrade user experience and performance</p>
</li>
<li>
<p>Search engines may not run JavaScript either, thus preventing them from indexing your content. This may not be a problem for search results pages, but may be for other things, such as article pages.</p>
</li>
<li>
<p>A programmer writing a scraper who knows what they’re doing can discover the endpoints where the content is loaded from and use them.</p>
</li>
</ul>
<h3 id="obfuscate-your-markup-network-requests-from-scripts-and-everything-else">Obfuscate your markup, network requests from scripts, and everything else.</h3>
<p>If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex with multiple layers of obfuscation, bit-shifting, and maybe even encryption), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.</p>
<ul>
<li>
<p>If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML.</p>
</li>
<li>
<p>You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too).</p>
</li>
<li>
<p>You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.</p>
</li>
</ul>
<p>There are several disadvantages to doing something like this, though:</p>
<ul>
<li>
<p>It will be tedious and difficult to implement, maintain, and debug.</p>
</li>
<li>
<p><strong>It will be ineffective against scrapers and screenscrapers which actually run JavaScript and then extract the data</strong>. (Most simple HTML parsers don’t run JavaScript though)</p>
</li>
<li>
<p>It will make your site nonfunctional for real users if they have JavaScript disabled.</p>
</li>
<li>
<p>Performance and page-load times will suffer.</p>
</li>
</ul>
<h2 id="non-technical">Non-Technical:</h2>
<h3 id="your-hosting-provider-may-provide-bot---and-scraper-protection">Your hosting provider may provide bot - and scraper protection:</h3>
<p>For example, CloudFlare provides an anti-bot and anti-scraping protection, which you just need to enable, and so does AWS. There is also mod_evasive, an Apache module which let’s you implement rate-limiting easily.</p>
<h3 id="tell-people-not-to-scrape-and-some-will-respect-it">Tell people not to scrape, and some will respect it</h3>
<p>You should tell people not to scrape your site, eg. in your conditions or Terms Of Service. Some people will actually respect that, and not scrape data from your website without permission.</p>
<h3 id="find-a-lawyer">Find a lawyer</h3>
<p>They know how to deal with copyright infringement, and can send a cease-and-desist letter. The DMCA is also helpful in this regard.</p>
<p>This is the approach Stack Overflow and Stack Exchange uses.</p>
<h3 id="make-your-data-available-provide-an-api">Make your data available, provide an API:</h3>
<p>This may seem counterproductive, but you could make your data easily available and require attribution and a link back to your site. Maybe even charge $$$ for it..</p>
<p>Again, Stack Exchange provides an API, but with attribution required.</p>
<h2 id="miscellaneous">Miscellaneous:</h2>
<ul>
<li>
<p>Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, so you will need to find compromises.</p>
</li>
<li>
<p>Don’t forget your mobile site and apps: If you have a mobile version of your site, beware that scrapers can also scrape that. If you have a mobile app, that can be screen scraped too, and network traffic can be inspected to figure out the REST endpoints it uses.</p>
</li>
<li>
<p>If you serve a special version of your site for specific browsers, eg. a cut-down version for older versions of Internet Explorer, don’t forget that scrapers can scrape that, too.</p>
</li>
<li>
<p>Use these tips in combination, pick what works best for you.</p>
</li>
<li>
<p>Scrapers can scrape other scrapers: If there is one website which shows content scraped from your website, other scrapers can scrape from that scraper’s website.</p>
</li>
</ul>
<h2 id="whats-the-most-effective-way-">What’s the most effective way ?</h2>
<p>In my experience of writing scrapers and helping people to write scrapers here on SO, the most effective methods are :</p>
<ul>
<li>
<p>Changing the HTML markup frequently</p>
</li>
<li>
<p>Honeypots and fake data</p>
</li>
<li>
<p>Using obfuscated JavaScript, AJAX, and Cookies</p>
</li>
<li>
<p>Rate limiting and scraper detection and subsequent blocking.</p>
</li>
</ul>
<h2 id="further-reading">Further reading:</h2>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Web_scraping">Wikipedia’s article on Web scraping</a>. Many details on the technologies involved and the different types of web scraper, general information on how webscraping is done, as well as a look at the legalities of scraping.</li>
</ul>
<p><strong>Good luck on the perilous journey of protecting your content…</strong></p>
<p>Please comment your views/suggestions below.</p>suniltatipellyA guide to prevent WebscrapingChai Bisket Articles Downloader2016-04-03T22:50:00+00:002016-04-03T22:50:00+00:00http://suniltatipelly.in/chai-bisket-articles-downloader<p>I want to make this blog short. Being a regular follower of Chai-Bisket I always want to have the offline copy of their articles.</p>
<p>Initially I thought of writing a python script to download them but later changed my mind and made a chrome extension which creates a “Download” button in the articles.</p>
<p><img class="image" src="http://suniltatipelly.in/assets/images/posts/chaibisket1.png" alt="Alt Text" /></p>
<hr />
<h3 id="source-on-github--chai-bisket-articles-downloader">Source on Github : <a href="https://github.com/Sunil02324/Chai-Bisket-Articles-Downloader">Chai Bisket Articles Downloader</a></h3>
<p>Please comment your views/suggestions below.</p>suniltatipellyI want to make this blog short. Being a regular follower of Chai-Bisket I always want to have the offline copy of their articles.Samosa Clips Auto Downloader2016-03-04T22:48:00+00:002016-03-04T22:48:00+00:00http://suniltatipelly.in/samosa-clips-downloader<p>Recently while scrolling through my news feed in facebook, an aricle on <a href="http://getsamosa.com/">Samosa App</a> in <a href="http://chaibisket.com/samosa-hyderabad-startup/">Chai Bisket</a> came to my notice. I have gone through the article and found the concept of their startup very interesting and opened their website to see how that actually works.</p>
<p>I was surprised to see their huge collection of movie dialogues, ringtones and many other.</p>
<p>Seeing that, I got a very bad idea of downloading all those clips to my laptop. Without any delay, I started to look at their source code and to my surprise they were using JSON requests to render those clips from the server.That made my efforts even more easier.</p>
<p>Then I started writing the python script to download those clips from their site. In each request, I was able to download 15 clips.</p>
<p>To avoid any unneccessary attention and blockage, I made the script to run every 10 seconds. I kept the script to run for 2 hours and I was able to download nearly 1500 clips from their site.</p>
<h3 id="snapshots">Snapshots:</h3>
<p><img class="image" src="http://suniltatipelly.in/assets/images/posts/samosa1.png" alt="Alt Text" />
<img class="image" src="http://suniltatipelly.in/assets/images/posts/samosa2.png" alt="Alt Text" /></p>
<h3 id="steps-to-avoid-them">Steps to avoid them:</h3>
<ul>
<li>Using authorisation or access token for requests</li>
<li>Limiting the no. of requests from each user</li>
<li>Blocking unauthorised requests</li>
</ul>
<hr />
<h3 id="code">Code:</h3>
<script src="https://gist.github.com/Sunil02324/86a1843e7dedb4b56373.js"></script>
<hr />
<h3 id="source-on-github--samosa-clips-auto-downloader">Source on Github : <a href="https://github.com/Sunil02324/Samosa-Clips-Auto-Downloader">Samosa Clips Auto Downloader</a></h3>
<p>Please comment your views/suggestions below.</p>suniltatipellyRecently while scrolling through my news feed in facebook, an aricle on Samosa App in Chai Bisket came to my notice. I have gone through the article and found the concept of their startup very interesting and opened their website to see how that actually works.Instagram Image Downloader2016-02-21T22:50:00+00:002016-02-21T22:50:00+00:00http://suniltatipelly.in/instagram-image-downloader<p>My friend has a very bad habit of stalking people on Instagram. You can find him online all the time in Instagram. One day he came to me and asked to write code to download the images from instagram.</p>
<p>I have gone through the source code of instagram and seen that the images are loaded in same way in all pages. So I used BeautifulSoup (bs4) in python to extract the image url from the given post url. I have also added some functionalities like downloading images from various Urls single time, specifing download location.</p>
<hr />
<h3 id="usage">Usage:</h3>
<p>insta.py [OPTION] [URL]</p>
<h3 id="options">Options:</h3>
<p>-u [Instagram URL] Download single photo from Instagram URL</p>
<p>-f [File path] Download Instagram photo(s) using file list</p>
<p>-h, –help Show this help message</p>
<h3 id="example">Example:</h3>
<p>python insta.py -u https://instagram.com/p/xxxxx</p>
<p>python insta.py -f /home/username/filelist.txt</p>
<hr />
<h3 id="code">Code:</h3>
<script src="https://gist.github.com/Sunil02324/91d002d433e56c726a9f.js"></script>
<hr />
<h3 id="source-on-github--instagram-image-downloader">Source on Github : <a href="https://github.com/Sunil02324/Instagram-Image-Downloader">Instagram Image Downloader</a></h3>
<p>Please comment your views/suggestions below.</p>suniltatipellyMy friend has a very bad habit of stalking people on Instagram. You can find him online all the time in Instagram. One day he came to me and asked to write code to download the images from instagram.