|

Home :: Articles
:: Googlebot/2.1 Mozilla/5.0 not obeying robots.txt |
 |
Host 6 Domains on 1 Bluehost Account $6.95 Per Month
Googlebot/2.1 Mozilla/5.0 not obeying robots.txt Last Updated: 2004-11-02 03:09:49
Is Google in for another November surprise? The new Googlebot user agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" has been heavily crawing sites over the past few months. We have seen no correlation between this user agent and pages getting indexed in Google. In fact, a few of our sites that record every user agent show all indexing still comes from the old Googlebot user agent "Googlebot/2.1 (+http://www.google.com/bot.html)". Further we have noticed that not only does the new Googlebot Mozilla user agent not obey robots.txt, it seems to favorably access the disallowed portions of the site listed in robots.txt over the rest of the site that is not disallowed in robots.txt.
Cloaking detection aside, we believe that the new Googlebot Mozilla user agent is mostly a new spam filter, but not for spam in the traditional sense... Google already does a reasonable job at eliminating this. Rather, we believe the new Mozilla Googlebot user agent to be a sort of type-matching system to identify certain patterns within a site and classify these patterns as acceptable or unacceptable. The new bot could be trying to emulate a real person by identifying if a site has "undesireable" features. From this it could rate the indexing of the rest of the site based on any undesireable features it finds.
Why attack files in robots.txt more heavily? The purpose of robots.txt is for webmasters to be able to tell a search engine spider to not visit a certain portion of the website. As webmasters, most of our traffic comes from search engines. But our robots.txt file tells the search engines that we do not want them sending any traffic to these portions of our site. We don't want these portions indexed. Why don't we want search engine traffic to come to these parts of our site? Perhaps they have no value to the user. Perhaps the cost in search engine bandwidth to these sections exceeds the value to the user. Maybe the disallowed portions only interact with the user to provide backend functions to the site. Perhaps we are trying to not expose portions of our site to the search engines that may hint at its underlying structure and help patterns to be found that would help indicate what type of website it is. Thus, if a search engine knows the content of your site that you don't want it to see, it may hint at the quality of the information contained in the portions of your site that you do want the search engines to see.
Google claims to have indexed over 4 billion web pages. A good portion of these indexed pages are duplicate content, or other types of content that have no real use to an actual user. Why index all of this content? Why not create a better quality control to prevent over-indexing the web? If the "undesireable content" could be separated from the "quality content" and removed from the search results, wouldn't the Google user have a better searching experience? Is this the purpose of the new Mozilla Googlebot user agent? Use the contact form to send us your thoughts/experiences with the new Mozilla Googlebot user agent. We'll post them right here.
More Articles... Census Data Resources Google Toolbar PageRank not Displaying Affiliate Data Feeds Future date with PHP Lookup domain names from an IP address How to change web hosts Mozilla Googlebot directs regular Googlebot/2.1 Evaluating Web Hosting Reviews Web Host Review Google's oligarchy of websites Google: 301 Redirects reappear in index after site banned MySQL Select Random Row Fast SEO - The Other Side Search Engine Submission Tips SEO after Google's Florida Update
Return to Article Menu
|
|

Gregg Website Tools
|