|
Sandbox Effect For Google, Msn, Yahoo
Search engine listing delays have come to be
called the Google Sandbox effect are actually true in practice at each of four
top tier search engines in one form or another. MSN, it seems has the shortest
indexing delay at 30 days. This article is the second in a series following the
spiders through a brand new web site beginning on May 11, 2005 when the site was
first made live on that day under a newly purchased domain name. First Case
Study Article
Previously we looked at the first 35 days and detailed the crawling behavior of
Googlebot, Teoma, MSNbot and Slurp as they traversed the pages of this new site.
We discovered the each robot spider displays distinctly different behavior in
crawling frequency and similarly differing indexing patterns.
For reference, there are about 15 to 20 new pages added to the site daily, which
are each linked from the home page for a day. Site structure is non-traditional
with no categories and a linking structure tied to author pages listing their
articles as well as a "related articles" index varied by linking to relevant
pages containing similar content.
So let's review where we are with each spider crawling and look at pages crawled
and compare pages indexed by engine.
The AskJeeves spider, Teoma has crawled most of the pages on the site, yet
indexes no pages 60 days later at this writing. This is clearly a site aging
delay that's modeled on Google's Sandbox behavior. Although the Teoma spider
from Ask.com has crawled more pages on this site than any other engine over a 60
day period and appears to be tired of crawling as they've not returned since
July 13 - their first break in 60 days.
In the first two days, Googlebot gobbled up 250 pages and didn't return until 60
days later, but has not indexed even a single page in 60 days since they made
that initial crawl. But Googlebot is showing a renewed interest in crawling the
site since this crawling case study article was published on several high
traffic sites. Now Googlebot is looking at a few pages each day. So far no more
than about 20 pages at a decidedly lackluster pace, a true "Crawl" that will
keep it occupied for years if continued that slowly.
MSNbot crawled timidly for the first 45 days, looking over 30 to 50 pages daily,
but not until they found a robots.txt file, which we'd neglected to post to the
site for a week and then bobbled the ball as we changed site structure, then
failed to implement robots.txt in new subdomains until day 25 - and THEN MSNbot
didn't return until day 30. If little else were discovered about initial crawls
and indexing, we have seen that MSNbot relies heavily on that robots.txt file
and proper implementation of that file will speed crawling.
MSNbot is now crawling with enthusiasm at anywhere between 200 to 800 pages
daily. As a matter of fact, we had to use a "crawl-delay" command in the
robots.txt file after MSNbot began hitting 6 pages per second last week. The MSN
index now shows 4905 pages 60 days into this experiment. Cached pages change
weekly. MSNbot has apparently found that it likes how we changed the page
structure to include a new feature which links to questions from several other
article pages.
Slurp gets strangely inactive then alternately hyperactive for periods of time.
The Yahoo crawler will look at 40 pages one day and then 4000 the next, then
simply look at the home page for a few days and then jump back in for 3000 pages
the next day and back to only reviewing robots.txt for two days. Consistency is
not a curse suffered by Slurp. Yahoo now shows 6 pages in their index, one an
errors page and another is a "index/of" page as we have not posted a home page
to several subdomains. But Slurp has crawled easily 15,000 pages to date.
Lessons learned in the first 60 days on a new site follow:
1) Google crawls 250 pages on first discovery of links to site. Then they don't
return until they find more links and crawl slowly. Google has failed to index
new domain for 60 days.
2) Yahoo looks for errors pages and once they find bad links will crawl them
ceaselessly until you tell them to stop it. Then won't crawl at all for weeks
until crawling heavily one day and lightly the next in random fashion.
3) MSNbot requires robots.txt files and once they decide they like your site,
may crawl too fast, requiring "crawl-delay" instructions in that robots.txt
file. Implement immediately.
4) Bad bots can strain resources and hit too many pages too quickly until you
tell them to stay out. We banned 3 bots outright after they slammed our servers
for a day or two. Noted "aipbot" crawled first then "BecomeBot" came along and
then "Pbot" from Picsearch.com crawled heavily looking for image files we don't
have. Bad bots, stay out. Best to implement robots.txt exclusions for all but
top engines if their crawlers strain your server resources. We considered
excluding the Chinese search engine named Baidu.com when they began crawling
heavily early on. We don't expect much traffic from China, but why exclude one
billion people? Especially since Google is rumored to be considering a possible
purchase of Baidu.com as entry to Chinese market.
The bottom line is that we've discovered all engines seem to delay indexing of
new domain names for at least thirty days. Google so far has delayed indexing
THIS new domain for 60 days since first crawling it. AskJeeves has crawled
thousands of pages, while indexing none of them. MSN indexes faster than all
engines but requires robots.txt file. Yahoo's Slurp crawls on again off again
for 60 days, but indexes only six of total 15,000 or more pages crawled to date.
We seem to have settled that there is a clear indexing delay, but whether this
site specifically is "Sandboxed" and whether delays apply universally is less
clear. Many webmasters claim that they have been indexed fully within 30 days of
first posting a new domain. We'd love to see others track spiders through new
sites following launch to document their results publicly so that indexing and
crawling behavior are proven.
Mike Banks Valentine is a search engine optimization specialist
who operates WebSite101 Ecommerce Tutorial
and will continue reports of case study chronicling search indexing of
Publish101 Article Resource http://www.seoptimism.com/SEO_Contact.htm |
|