Spider the Yahoo Catalog

Writing a spider to spider an existing spider's site may seem convoluted, but it can prove useful when you're looking for location-based services. This hack walks through creating a framework for full-site spidering, including additional filters to lessen your load.
In this hack, you'll learn how to write a spider that crawls the Yahoo! group of portals. The choice of Yahoo! was obvious; because it is one of the largest Internet portals in existence, it can serve as an ideal example of how one goes about writing a portal spider.
But before we get to the gory details of code, let's define what exactly a portal spider is. While many may argue with such a classification, I maintain that a portal spider is a script that automatically downloads all documents from a preselected range of URLs found on the portal's site or a group of sites, as is the case with Yahoo!. A portal spider's main job is to walk from one document to another, extract URLs from downloaded HTML, process said URLs, and go to another document, repeating the cycle until it runs out of URLs to visit. Once you create code that describes such basic behavior, you can add additional functionality, turning your general portal spider into a specialized one.
Although writing a script that walks from one Yahoo! page to another sounds simple, it isn't, because there is no general pattern followed by all Yahoo! sites or sections within those sites. Furthermore, Yahoo! is not a single site with a nice link layout that can be described using a simple algorithm and a classic data structure. Instead, it is a collection of well over 30 thematic sites, each with its own document layout, naming conventions, and peculiarities in page design and URL patterns. For example, if you check links to the same directory section on different Yahoo! sites, you will find that some of them begin with http://www.yahoo.com/r, some begin with http://uk.yahoo.com/r/hp/dr, and others begin with http://kr.yahoo.com.

If you try to look for patterns, you will soon find yourself writing long if/ elsif/else sections that are hard to maintain and need to be rewritten every time Yahoo! makes a small change to one of its sites. If you follow that route, you will soon discover that you need to write hundreds of lines of code to describe every kind of behavior you want to build into your spider.

This is particularly frustrating to programmers who expect to write code that uses elegant algorithms and nicely structured data. The hard truth about portals is that you cannot expect elegance and ease of spidering. Instead, prepare yourself for a lot of detective work and writing (and throwing away) chunks of code in a hit-and-miss fashion. Portal spiders are written in an organic, unstructured way, and the only rule you should follow is to keep things simple and add specific functionality only once you have the general behavior working.

Okaywith taxonomy and general advice behind us, we can get to the gist of the matter. The spider in this hack is a relatively simple tool for crawling Yahoo! sites. It makes no assumptions about the layout of the sites; in fact, it makes almost no assumptions whatsoever and can easily be adapted to other portals or even groups of portals. You can use it as a framework for writing specialized spiders.

Save the following code to a file called yspider.pl: