Hello Guys,
Anemone is a free, multi-threaded ruby web spider framework. It is useful for collecting information about websites. It's crawl sites with initial level. With Anemone you can write task to generate statistics on a site just by giving it the URL. Anemone supports the nokogiri for HTML and XML parsing.
Lets see the simple example.. so you can get the idea how it works
First of all we have to install the anemone gem by
gem install anemone
It will install anemone along with dependencies robots, nokogiri.
gem install anemone
It will install anemone along with dependencies robots, nokogiri.
require 'anemone'
desc "crawl the website data at initial level"
task :crawl_website => :environment do
Anemone.crawl("http://priyankapathak.wordpress.com/") do |anemone|
anemone.on_every_page do |page|
puts page.url
Anemone.crawl("http://priyankapathak.wordpress.com/") do |anemone|
anemone.on_every_page do |page|
puts page.url
# store the visited pages in file system or db
end
end
end
end
end
As an above example, that will take a domain as 'http://priyankapathak.wordpress.com', and start tracing every page. If you want to store traced pages than just write the code to store at db or file
system. Invoke above task by rake crawl_website --trace
There are many other inbuilt methods available with anemone. like
system. Invoke above task by rake crawl_website --trace
There are many other inbuilt methods available with anemone. like
- after_crawl - run a block on the PageHash (a data-structure of all the crawled pages) after the crawl is finished
- focus_crawl - use a block to select which links to follow on each page
- on_every_page - run a block on each page as they are encountered
- on_pages_like - given one or more RegEx patterns, run a block on every page with a matching URL
- skip_links_like - given one or more RegEx patterns, skip the any link that matches patten
If you find this ruby web spider interesting and want more information then simply follow below links.
No comments:
Post a Comment