priyanka pathak on rails world: Anemone

Hello Guys,

Anemone is a free, multi-threaded ruby web spider framework. It is useful for collecting information about websites. It's crawl sites with initial level. With Anemone you can write task to generate statistics on a site just by giving it the URL. Anemone supports the nokogiri for HTML and XML parsing.

Lets see the simple example.. so you can get the idea how it works

First of all we have to install the anemone gem by
gem install anemone
It will install anemone along with dependencies robots, nokogiri.

require 'anemone'

desc "crawl the website data at initial level"

task :crawl_website => :environment do
Anemone.crawl("http://priyankapathak.wordpress.com/") do |anemone|
anemone.on_every_page do |page|
puts page.url

# store the visited pages in file system or db
end
end

end

As an above example, that will take a domain as 'http://priyankapathak.wordpress.com', and start tracing every page. If you want to store traced pages than just write the code to store at db or file
system. Invoke above task by rake crawl_website --trace

There are many other inbuilt methods available with anemone. like

after_crawl - run a block on the PageHash (a data-structure of all the crawled pages) after the crawl is finished
focus_crawl - use a block to select which links to follow on each page
on_every_page - run a block on each page as they are encountered
on_pages_like - given one or more RegEx patterns, run a block on every page with a matching URL
skip_links_like - given one or more RegEx patterns, skip the any link that matches patten

If you find this ruby web spider interesting and want more information then simply follow below links.

http://anemone.rubyforge.org/information-and-examples.html

https://github.com/chriskite/anemone

priyanka pathak on rails world

Monday, November 21, 2011

Anemone - web crawler

No comments:

Post a Comment