Gregg website development, website maintenance and e-commerce solutions
Services Portfolio Pricing Free Quote Company Contact Us FAQs
Articles

Services
:: Website Development
:: E-Commerce Development
:: Website Management
:: Website Maintenance
:: Website Upgrades
:: Website Search
:: Custom Scripts
:: Search Engine Optimization
:: WEBSITE HOSTING
:: WEBSITE DIRECTORY

Website Tools
:: Website Search
:: Keyword Density
:: HTTP Header Viewer
:: Reverse DNS Lookup
:: DNS Lookup
:: Newsletter

Portfolio

Pricing

Free Quote

Contact Us

Company

FAQs

Links

PowWeb Hosting - $7.77/month!
Recent Articles
Census Data Resources
Google Toolbar PageRank not Displaying
Affiliate Data Feeds
Future date with PHP
Lookup domain names from an IP address
How to change web hosts
Mozilla Googlebot directs regular Googlebot/2.1
Evaluating Web Hosting Reviews
Web Host Review
Google's oligarchy of websites
Google: 301 Redirects reappear in index after site banned
Googlebot/2.1 Mozilla/5.0 not obeying robots.txt
MySQL Select Random Row Fast
SEO - The Other Side
Search Engine Submission Tips
SEO after Google's Florida Update
Read more at Matt Gregg Blog


Home  ::  Articles  ::  Googlebot/2.1 Mozilla/5.0 not obeying robots.txt

Host 6 Domains on 1 Bluehost Account $6.95 Per Month

Googlebot/2.1 Mozilla/5.0 not obeying robots.txt
Last Updated: 2004-11-02 03:09:49

Is Google in for another November surprise? The new Googlebot user agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" has been heavily crawing sites over the past few months. We have seen no correlation between this user agent and pages getting indexed in Google. In fact, a few of our sites that record every user agent show all indexing still comes from the old Googlebot user agent "Googlebot/2.1 (+http://www.google.com/bot.html)". Further we have noticed that not only does the new Googlebot Mozilla user agent not obey robots.txt, it seems to favorably access the disallowed portions of the site listed in robots.txt over the rest of the site that is not disallowed in robots.txt.

Cloaking detection aside, we believe that the new Googlebot Mozilla user agent is mostly a new spam filter, but not for spam in the traditional sense... Google already does a reasonable job at eliminating this. Rather, we believe the new Mozilla Googlebot user agent to be a sort of type-matching system to identify certain patterns within a site and classify these patterns as acceptable or unacceptable. The new bot could be trying to emulate a real person by identifying if a site has "undesireable" features. From this it could rate the indexing of the rest of the site based on any undesireable features it finds.

Why attack files in robots.txt more heavily? The purpose of robots.txt is for webmasters to be able to tell a search engine spider to not visit a certain portion of the website. As webmasters, most of our traffic comes from search engines. But our robots.txt file tells the search engines that we do not want them sending any traffic to these portions of our site. We don't want these portions indexed. Why don't we want search engine traffic to come to these parts of our site? Perhaps they have no value to the user. Perhaps the cost in search engine bandwidth to these sections exceeds the value to the user. Maybe the disallowed portions only interact with the user to provide backend functions to the site. Perhaps we are trying to not expose portions of our site to the search engines that may hint at its underlying structure and help patterns to be found that would help indicate what type of website it is. Thus, if a search engine knows the content of your site that you don't want it to see, it may hint at the quality of the information contained in the portions of your site that you do want the search engines to see.

Google claims to have indexed over 4 billion web pages. A good portion of these indexed pages are duplicate content, or other types of content that have no real use to an actual user. Why index all of this content? Why not create a better quality control to prevent over-indexing the web? If the "undesireable content" could be separated from the "quality content" and removed from the search results, wouldn't the Google user have a better searching experience? Is this the purpose of the new Mozilla Googlebot user agent? Use the contact form to send us your thoughts/experiences with the new Mozilla Googlebot user agent. We'll post them right here.


More Articles...
Census Data Resources
Google Toolbar PageRank not Displaying
Affiliate Data Feeds
Future date with PHP
Lookup domain names from an IP address
How to change web hosts
Mozilla Googlebot directs regular Googlebot/2.1
Evaluating Web Hosting Reviews
Web Host Review
Google's oligarchy of websites
Google: 301 Redirects reappear in index after site banned
MySQL Select Random Row Fast
SEO - The Other Side
Search Engine Submission Tips
SEO after Google's Florida Update


Return to Article Menu



Gregg Website Tools
Search Enable Website
Add a search feature to any website.
Keyword Density
Check the keyword density of your website.
HTTP Header Viewer
View the HTTP headers of any webpage.
Reverse DNS Lookup
Get the host name of an IP Address.
DNS Lookup
Get the IP Address of a website.
Spider Simulator
Simulate a SE Spider.
Subscribe Today
Gregg Website Development Newsletter
Featured Sites
EyeBike
EyeBike.com offers an online search engine and price comparison for bicycle parts featuring many online vendors.

Lake Tahoe Real Estate
iPods
Hesperia Real Estate
Host 6 Domains on 1 Bluehost Account $6.95 Per Month
Services :: Portfolio :: Pricing :: Free Quote :: Company :: Contact Us :: FAQs
We're located in the High Desert, Southern California
Copyright © 2004, Gregg Website Development, Privacy Policy
PO Box 400308, Hesperia, CA 92340