So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. It’ll be tedious to do so manually (i.e. copy and paste). This is where Web Scraping tools and techniques would be useful. They allow transformation of unstructured data on the web into structured data that can be stored and analyzed in a database or spreadsheet. Web scraping can also be used to automate search, access and aggregate specific kinds of information from otherwise disparate sources (mashups). It is conceptually similar with web crawling and indexing techniques adopted by most of Search Engines. Here I’ll focus on Scraping website using Python Package called Scrapy.
Scrapy is an application framework for crawling web sites and extracting structured data. It provides built-in features and functionalities that make parsing Web Document painless (Some familiarity with Xpath to navigate through markup elements and attributes will be helpful here to get Scrapy up and running quickly). Using Scrapy, developers could also potentially scale their applications by leveraging on following features:
-Abstractions: asynchronous network I/O, data processing pipelines
-HTTP/web: retries, throttling, backoff, concurrency, cookie/form handling
-Infrastructure: crawling queues, monitoring
Installing Scrapy (on Windows 7)
The installation steps assume that you have the Python 2.7 and PIP (Python packages manager) installed.
1. Install OpenSSL
Download and install Visual C++ 2008 redistributables and respective OpenSSL according your Win OS architecture (I use regular OpenSSL 1.0.1e 32-bit, do not use the light one) from http://slproweb.com/products/Win32OpenSSL.html. Add OpenSSL to the system path by adding its directory (i.e. C:\OpenSSL-Win32\bin) to the PATH environment variable from the Control Panel.
2. Install Scrapy dependency components
Download and install following precompiled Windows Python binary libraries from http://www.lfd.uci.edu/~gohlke/pythonlibs/:
pywin32, Twisted, zope.interface, lxml, pyOpenSSL
Note: You may encounter ‘Python Version 2.7 required which was not found in the registry’ error message when installing components above. This is known bugs from Python Windows installer, which can be easily rectified using following registry patch: http://stackoverflow.com/a/9131949. If you encounter “unable to find vcvarsall.bat” error message, please re-check if all components have been succesfully installed. Also make sure consistent 32/64-bit installation packages being used.
3. Install Scrapy
Launch command prompt console and type “pip install Scrapy”