Guest post by Vincent Granville
Tales (http://en.wikipedia.org/wiki/Thales) is a block tolerant (IP Blocking) web scraper (http://en.wikipedia.org/wiki/Web_scraping) that runs on top of aws and rackspace. Tales is design to be easy to deploy, configure, and manage. With Tales you can scrape 10s or even 100s of domains concurrently.
Tales is simple, light, reliable, and has been tested on production environments scraping more than 200 million urls.
With Tales you can do web monitoring, research, aggregators, etc.
Tales is design to scrape the web continuously, even when the domain being scraped blocks the scraper server ip; it goes around this problem by creating a new server, and then moving the scraper to the new server (failover).
Its very easy to code the scraper instructions called Templates. Once the templates are ready, all you need to do is push the code into git (git push origin), and the nodes (servers) alive will grab the code and recompile themselves.
You can also have several branches (git) with different configurations and templates -- environments. This gives you the ability of running tests in a separate set of servers.
In the dashboard you can also kill processes, delete servers, and look at critical errors.
There is a centralized log database that keeps logs of the activity and errors that happens on the system. The logging system, saves error information, server where the error occurred, and other useful data.
One of the ideas with Tales was that it should be able to grab data (scrape), back it up (backup), and then be able to shut down to minimize costs (shutdown).
If you want to continue scraping, you can simply create a new node, run the restore backup class, and start the scraper again.
Data is backup into AWS S3. The gzip file comes with a timestamp (Date.getTime()), the server ip that put the dump there, and the file name. Inside the gzip file there is a simple sql dump. The idea with the backups is that you could run map/reduce jobs on those sqls -- I will add support for AWS EMR soon.
You can also store the data into mongoDB and Solr; which comes prepack in all the nodes.
Tales is design to keep updates of the data that you scrape. For instance, if a twitter user changes his location from "CR" to "SF", Tales will keep "CR" and store "SF"; keeps a log of the changes.
This is very useful if you want to do regressions, some math, or see how data evolves.
- Deploying to the Cloud
- How to install Tales
- Tales workflow sample
- Template structure
- Pushing templates
- Config file
- Database design
- MongoDB and Solr
- TalesDB updateAttribute vs addAttribute
Read more at https://github.com/calufa/tales-core