Yahoo Slurp robots.txt wildcard support
Search engine crawler Yahoo! Slurp is without a doubt the most active spider when it comes to hitting your robots.txt files. The search engine was also famous for misinterpreting web site architecture, print files and session ids for a while.
Wildcard directive specifications example:
- Removing session id’s which cause duplicate page indexing and eventually filters selected page: Disallow: /*?sessionid
- Prevent from crawling print pages, PDF exports and the like: Disallow: /*_print*.html
- Use of the $ directive at the end of URL’s, to prevent the crawler from considering them as prefix instead. For example, Disallow: /*.pdf$ will prevent from indexing files ending in “pdf” throughout your site. Without the “$”, Yahoo slurp would simply disregard all files containing “pdf” in their path.
Not really a breakthrough but not necessarily a gadget either, some companies might find the Yahoo Slurp Wildcards useful to protect clients and internal confidential information.
This new feature comes only a few months after upgrading their crawler, and launching new tools such as the site explorer and arguably tends to confirm that the Sunnyvale Company wants to improve its communication with the webmaster community. Such strategy has always been successful for Mountain View based company Google. It is doubtful that Yahoo would generate similar enthousiasm though but their improved communication and new webmaster oriented tools and upgrades are always welcome anyway.