Monday, November 21, 2011

Anemone - web crawler

Hello Guys,
         Anemone is a free, multi-threaded ruby web spider framework. It is useful for collecting information about websites. It's crawl sites with initial level. With Anemone you can write task to generate statistics on a site just by giving it the URL. Anemone supports the nokogiri for HTML and XML parsing.

Lets see the simple example.. so you can get the idea how it works
 
First of all we have to install the anemone gem by
gem install anemone

It will install anemone along with dependencies robots, nokogiri.

require 'anemone'

desc "crawl the website data at initial level"
task :crawl_website => :environment do
  Anemone.crawl("http://priyankapathak.wordpress.com/") do |anemone|
    anemone.on_every_page do |page|
      puts page.url
      # store the visited pages in file system or db
    end
  end
end

As an above example, that will take a domain as 'http://priyankapathak.wordpress.com', and start tracing every page. If you want to store traced pages than just write the code to store at db or file 
system. Invoke above task by rake crawl_website --trace

There are many other inbuilt methods available with anemone. like
  • after_crawl - run a block on the PageHash (a data-structure of all the crawled pages) after the crawl is finished
  • focus_crawl - use a block to select which links to follow on each page
  • on_every_page - run a block on each page as they are encountered
  • on_pages_like - given one or more RegEx patterns, run a block on every page with a matching URL
  • skip_links_like - given one or more RegEx patterns, skip the any link that matches patten
If you find this ruby web spider interesting and want more information then simply follow below links.

No comments:

Post a Comment