Unhandled Perception
From the mind of a developer.

Friday, March 07, 2008

How GoogleBot interacts with your website

If I'm [googlebot] indexing for regular web search, and I see links to MP3s and videos, I probably won't download those. Similarly, if I see a JPG, I will treat it differently than an HTML or PDF link. For instance, JPG is much less likely to change frequently than HTML, so I will check the JPG for changes less often to save bandwidth. Meanwhile, if I'm looking for links as Google Scholar, I'm going to be far more interested in the PDF article than the JPG file. Downloading doodles (like JPGs) and videos of skateboarding dogs is distracting for a scholar—do you agree?
---
After actually downloading a file, I use the Content-Type header to check whether it really is HTML, an image, text, or something else. If it's a special data type like a PDF file, Word document, or Excel spreadsheet, I'll make sure it's in the valid format and extract the text content. Maybe it has a virus; you never know. If the document or data type is really garbled, there's usually not much to do besides discard the content.

Very interesting read on how and what GoogleBot will do when accessing your website.

Link:
http://googlewebmastercentral.blogspot.com/...

Labels: ,

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home