Writing a spider to spider an existing
spider's site may seem convoluted, but it can prove useful when you're looking
for location-based services. This hack walks through creating a framework for
full-site spidering, including additional filters to lessen your
load.
In this hack, you'll learn how to write a spider that crawls
the Yahoo! group of portals. The choice of Yahoo! was
obvious; because it is one of the largest Internet portals in existence, it can
serve as an ideal example of how one goes about
writing a portal spider.
But before we get to the gory details of code, let's define
what exactly a portal spider is. While many may argue with such a
classification, I maintain that a portal spider
is a script that automatically downloads all documents from a preselected range
of URLs found on the portal's site or a group of sites, as is the case with
Yahoo!. A portal spider's main job is to walk from one document to another,
extract URLs from downloaded HTML, process said URLs, and go to another
document, repeating the cycle until it runs out of URLs to visit. Once you
create code that describes such basic behavior, you can add additional
functionality, turning your general portal spider into a specialized one.
Although writing a script that walks from one Yahoo! page to
another sounds simple, it isn't, because there is no general pattern followed by
all Yahoo! sites or sections within those sites. Furthermore, Yahoo! is not a
single site with a nice link layout that can be described using a simple
algorithm and a classic data structure. Instead, it is a collection of well over
30 thematic sites, each with its own document layout, naming conventions, and
peculiarities in page design and URL patterns. For example, if you check links
to the same directory section on different Yahoo! sites, you will find that some
of them begin with http://www.yahoo.com/r, some
begin with http://uk.yahoo.com/r/hp/dr, and
others begin with http://kr.yahoo.com.
If you try to look for patterns, you will soon find yourself
writing long if/ elsif/else sections that are hard to maintain and need
to be rewritten every time Yahoo! makes a small change to one of its sites. If
you follow that route, you will soon discover that you need to write hundreds of
lines of code to describe every kind of behavior you want to build into your
spider.
This is particularly frustrating to programmers who expect to
write code that uses elegant algorithms and nicely structured data. The hard
truth about portals is that you cannot expect elegance and ease of spidering.
Instead, prepare yourself for a lot of detective work and writing (and throwing
away) chunks of code in a hit-and-miss fashion. Portal spiders are written in an organic, unstructured
way, and the only rule you should follow is to keep things simple and add
specific functionality only once you have the general behavior working.
Okaywith taxonomy and general advice behind us, we can get to
the gist of the matter. The spider in this hack is a relatively simple tool for
crawling Yahoo! sites. It makes no assumptions about the layout of the sites; in
fact, it makes almost no assumptions whatsoever and can easily be adapted to
other portals or even groups of portals. You can use it as a framework for
writing specialized spiders.
Save the following code to a file called yspider.pl: