Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

2005 Analysis of Google's Sandbox

Tags: sandbox

Google's infamously and arguably mis-titled"Sandbox Effect" has been an observed phenomenon since early 2004. Although many continue to argue and debate the causes and effects of this unusual algorithmic element, there is virtually no debate on its existence. At one time, the best explanation of the Sandbox was:

"The penalty or devaluation in the Google SERPs of sites with SEO efforts begun after March of 2004."

However, with time, observation and refinement, a new, more detailed and accurate definition can be made:

"The observed phenomenon of a site whose rankings in the Google SERPs are vastly, negatively disparate from its rank in other search engines (including Yahoo!, MSN & Teoma) and in Google's own allin: results for the same queries."

This penalization system is known to be unpredictable and particularly difficult to analyze or understand due to the ways in which it behaves. This article attempts to sum up the experience of many SEOs and websites in the field that have fallen under the effect of the sandbox. I have had the particular privilege of analyzing many dozens of sites affected by the filter thanks (primarily) to e-mail contacts with users ranking particularly high on the sandbox detection tool here at SEOmoz. Although I cannot reveal most of these sources by URL or name, the observed effects should be familiar to many in the SEO business who have started optimization of new websites since March of 2004.

List of Observed "Sandbox" Phenomenon

The Sandbox Effect has been noted to affect many unique aspects of rankings in the SERPs. This list is a collection of the most commonly mentioned and obvious factors weighing into the observations.

The Sandbox is known to affect...

  1. ...entire top-level domains, rather than simply web pages, directories or sub-domains.
  2. ...a higher percentage of newly registered (post-2003) websites than those registered prior to that year. There are, however, several examples that show exceptions to this rule.
  3. ...most commonly, those websites which have had search optimization tactics performed on them, specifically on-page optimization of text, titles, meta data, etc. and external link building efforts. There are exceptions to this rule as well, most often from sites which have received a high number of external links over a short period of time.
  4. ...websites primarily in the English language. While reports exist of sandbox-like factors affecting some other languages, it is noticeably absent particularly from Italian & Dutch language websites targeting searches at Google.it and Google.nl.
  5. ...rankings for all levels of difficulty. Despite rumors that the sandbox only targets highly competitive keyword phrases, the most heavily "boxed" sites I've reviewed could not rank successfully for even the most non-competitive terms. Several sites even had unique terms found on virtually no other sites prior to the existence of the domain, and yet were outranked by pages at other sites mentioning or linking to them.
  6. ...rankings only at the Google search engine. Sites that are most heavily penalized will often be ranking in the top 3 results at the other major search engines (Yahoo!, MSN, AskJeeves), yet will be ranked in the 100's or worse at Google.
  7. ...rankings only in standard SERPs. A search using alllinanchor:, allintext:, allintitle:, or allinurl: will return the site in its 'normal' position. This effect is also perceived when using the "-asdf trick" where the search phrase is proceeded by 16-20 negative modifiers such as -asdf. See an example here.
  8. ...low quality, affiliate, spam and sites carrying AdSense more often than those without these features. This could very well be the intended result of the system, and would therefore be only natural. However, these sites certainly are not universally affected, nor are they alone in being affected - the most prominent examples of sandboxed sites are often purely "white-hat", natural, organic sites.
  9. ...commercial and private sector sites only. There has never been a reported case of a .gov, .mil, .edu or other official use TLD affected by the sandbox.
  10. ...rankings for anywhere from 1 month to more than 1 year. Examples have been shown to me of sites that seem to 'never' escape the sandbox, though these sites are often of the "low quality" variety described above.

The sandbox has also been observed to typically release sites into "normal" rankings en masse, which is to say that there have been virtually no examples of a single site "escaping" by itself. It appears that certain updates in Google's search engine release many sites all at once. Speculation about this centers around Google wishing to avoid the appearance of manually reviewing sites one by one, although other reasons have been proposed as well.

Technological Explanations for the Sandbox

Several theories have evolved over time to explain how Google flags websites to be put into the sandbox, and why the effect is not universal. The following are either extremely popular, or have stood up to most evidence and appear to be logical explanations:

Over-Optimization Flagging

Many suspect that Google initially identifies websites to be "sandboxed" by analyzing commonly optimized components like the backlinks structure, on-page stuffing of keyword terms or phrases, and rate at which inbound links come to the site. There can be little doubt that through careful analysis, Google has an excellent idea of what natural text and link structures and a very good idea of what these look like for spam sites and can thus distinguish between the two. Many unique criteria have been mentioned in regards to what can trigger these situations:

  1. Rate of Inbound Links

    As mentioned in the recent Google patent, the rate at which new links to a website or page are found can be measured and compared against historical data to determine whether the page/site has become particularly relevant or whether this is an effort to spam. The key to using this data would be comparing how a popular page that is picked up by the blogosphere or news websites differs from a page that simply has purchased or created thousands of links in order to manipulate the search engines. There are examples that appear to have fallen afoul of this portion of the sandbox filter despite using purely natural techniques.
  2. Over-Optimized Anchor Text

    Similar to the rate of inbound links above, Google also has a very good idea of what constitutes natural anchor text across the structure of many dozens, hundreds or thousands of links. When these structures appear over-optimized, which is to say particularly focused on specific commercial phrases or terms, it is suspected that this can trigger "sandboxing".
  3. On-Page Over Optimization

    Keyword Stuffing or over-targeting of particular terms on the page or across the site has been named as a possible culprit for being sandboxed. This particular rationale is often used to explain why so many 'SEOd' sites get filtered, while many non-optimized sites do not (although there are plenty of exceptions on both sides).

Commercial Keyword Targeting

There have been suggestions, although these have largely lacked solid evidence, that by targeting specific, commercial search terms, your website may be more likely to fall afoul of the filter. There have been many, many examples of sites that have been filtered despite targeting non-commercial phrases and largely non-competitive terms that my personal opinion is that the sandbox does not discriminate based on the targeted terms/phrases.

Natural Text Analysis

Many patents and white papers have been written by the major search engines on the subject of analyzing and differentiating naturally, human written text from computer-aided or generated text, commonly used by spam websites. This has motivated many to believe that Google is conducting this analysis with an eye towards "sandboxing" non-natural text. Luckily, for those SEOs who write their own content, this should be a relatively easy problem to work around, as false positives when conducting automated text analysis are highly unlikely. Sadly, experience has shown that many sites whose text is entirely human-written, never duplicated and of generally high quality still experience the sandbox phenomenon. My personal opinion on this issue is that text analysis of any kind is not to blame for the sandbox, though low quality text could pre-dispose a site to penalties or make it more difficult to "escape" the sandbox.

Manual Review

Thanks to Henk Van Ess' Search Bistro and his exposé therein of Eval.Google, the theory that Google manually reviews new websites that gain any significant number of links, are sent a certain quantity of visitors, or trigger some other set of parameters are flagged for manual review has skyrocketed. This belief may not be unfounded, however, as a Google representative at SES NYC, Craig Manning, pointed out that Google will indeed review sites like ChristpoherReeve.org or Tsunami.Blogspot.com to check if the large number of links and high rankings they have achieved are indeed warranted. Craig noted that this was a way to keep low quality sites from ranking well via link-bombing techniques.

Many in the industry, however, reject the manual review idea behind the sandbox as being too convenient and not efficient enough for Google. They point to Google's overarching theme of their technology as being completely fair and automated, which would certainly preclude human judgments from affecting the search landscape as completely as the sandbox phenomenon does.

Manual review is certainly an explanation that fits all the facts, it accounts for both inconsistencies in the application of the sandbox, the widely varying escape times, and even the proclivity of sites exhibiting common SEO traits (optimized pages, links, etc.) to be more susceptible to penalization.

Major Myths, Red Herrings & Exceptions

For every rule, there is an exception and with Google's sandbox, where few hard rules exist, this is even more true. However, it's important to point out certain exemptions and factors that are outside the normal understanding of the phenomenon. A short list follows:

  • Escaping the Sandbox in only a few weeks - I have never directly observed this phenomenon, but it has been reported to me once, and appeared in the forums another time. For the site I have knowledge of, I can say that it was of very high quality in terms of design, content and usability. I have no way of knowing whether these factors influenced its quick "escape".
  • Extensions - It had been reported that .org websites were less vulnerable to sandboxing than .com, .net or other domains (.info, .tv, etc.). However, based on my experience, and the direct experience of SEOmoz.org, I can say with near certainty that this is not an accurate representation of reality. The only domains never observed to be affected are .mil (military), .gov (government) and .edu (educational).
  • "Trusted Links" to Escape - Rumors have abounded that specific links from places like DMOZ, .gov or .edu websites, or even major news sites like CNN, Reuters or the AP can "release" a site from the sandbox. While these types of links can correlate with a high quality site (occasionally an indicator that the sandbox stint will be shorter), they do not dictate an immediate release. I've personally seen several sites obtain links like these and remain in the "box" for several months afterwards.
  • Have an "in" at Google - Myths have been circulating the SEO forums that suggest having a relationship with "someone" at Google can get you un-sandboxed. I believe this to be largely false, with the singular exception of a colleague who showed his website to Matt Cutts specifically after a session in NYC and had it promptly un-"boxed" about 2 weeks after the show. Whether this is a direct result of the conversation or simple coincidence, I cannot be sure. I do admit to eavesdropping while waiting in line to speak with Matt (I couldn't help it!). My guess is that Google wrongly penalized the site and rectified the mistake. I'm guessing the $1500 ticket to attend an SES conference was a bargain for the site owner.

Possible Solutions & Suggestions

Although many suggestions have emerged as to how to prevent entry of a new site into Google's sandbox, very few have panned out. Several, such as the recommendation to use a subdomain of an existing TLD have met with mixed success, while others including the "don't get links" advice are self-defeating. The best advice pieces I've seen, mixed with those new sites that have dodged the sandbox are listed below:

  1. Target "Topical Phenomena" & a Non-Commercial Audience - If you know that you're building a great website that's going to earn a lot of links, the #1 piece of advice I can give is to target your site towards non-profit/educational at first. I'd also highly recommend targeting something newsworthy and interesting to a massive audience. For example, if you're creating a site on real estate in Boston, start with a news/blog site that tracks trends and information about the real estate market, rather than pushing a single service. Offer a tool like Google Maps integrated with the MLS or Craigslist real estate listings or another great piece of information that would be likely to earn lots of natural incoming links.

    The idea behind this strategy is to legitimize the link gain you'll experience at the beginning of the site and, with some luck, avoid the sandbox by being a "topical phenomenon". This strategy is difficult and takes not only hard work and dedication, but out-of-the-box thinking. However, if you're shooting for high traffic and high links quickly, this is the best way to dodge the sandbox.
  2. Build Natural Links & Avoid Getting Blogrolled - One of the most common elements suspected for sandboxing completely"natural" sites is their addition to blogrolls. These links are sitewides on URLs that frequently have many thousands of pages in Google's index and it appears on the surface that they can cause the link problems that lead to sandboxing. The best way to avoid this is to watch your logs for referring URLs and request to be removed from any blogrolls that are sent to you. With some luck, the sympathetic blogger will understand and remove you. It seems ridiculous to have to go to these extents to avoid sandboxing, but in the commercial reality of the web, it may, in fact, help you in both the short and long run. Naturally, if you aren't running a blog on your site, it's much easier to "stay off the rolls", but you also miss inclusion in great blog directories and traffic sources that can earn you high quality links (i.e. Technorati, Blogwise, etc).
  3. Get Noticed in the News - Although this is exceptionally difficult, being picked up by major news services and syndicated to online news portals and newspapers is a great way to avoid the sandbox. It seems that the legitimacy of these link sources can actually help to actively prevent sandboxing. This effect has been noted to me on 2 occasions, and fits with the "topical phenomena" theory.
  4. Build Exceptional Quality Sites - This tip seems highly suspicious, but it also fits the facts. The sites that have been "escaping" Google's filter are often those that are outstanding sources of information for their target group, offer top-notch usability and information architecture, and employ the highest level of professional design. If your site looks like a Fortune 500 website, you're on the right track. It could be simple coincidence that these types of sites have not been "boxed", or it could be the manual review system in play. In either case, this method of site building isn't just good for the sandbox, it also will build links quickly, achieve consumer trust and be a more profitable venture overall. There's simply no reason not to attempt this.
  5. Don't Rely on Google's Traffic - If you know that your site will likely be sandboxed, you can always opt to dodge Google entirely and instead shoot for traffic from other sources. The best way to do this is to target highly or moderately competitive terms at Yahoo!, MSN & Ask that get thousands of searches each day at those engines. Although your link building and site building will require great strength to compete, it's much faster and easier to target these engines with a good site, than to go after Google. You can also look for alternative traffic sources like ads on the top SERPs at Google for the terms, traffic from alternative search sources like Wikipedia, Technorati or topical communities (blogs & forums). If you opt for this methodology, get creative, but don't get sloppy. You still want to build smart and naturally - after all, the Google sandbox is finite in length, and eventually, you will want to "escape".

Predictions for the Future and Conclusions

The Google Sandbox's continued existence and impact on the SERPs is difficult for those outside of Google to see. There are those who would argue that Google has gotten both more and less spammy in its SERPs, and equivalent folks on both sides of the increased/decreased relevancy debate. What has emerged from the last 18 months of study, observation and testing is that the Sandbox is almost certainly designed to help reduce the amount of spam and manipulation of links to boost rankings in Google's search engine.

As such, it would appear that the best way to avoid the penalty is to avoid techniques common to spam and manipulation. Sadly, since Google is erring on the side of less spam, many legitimate websites are being wiped out of the SERPs for long periods. It's important both to avoid over-diagnosis of the Sandbox and to be aware of the filter's qualities to make identification easier. Although many in the search community advise "sitting back and waiting", I personally do not approve of this approach. While it is important not to obsess over Google's rankings during a sandbox period, it's also important to experiment, grow and push your site and brand. "Sitting back and waiting" is never good advice in the web promotion space.

The future likely holds more of the same from Google and the sandbox. Despite webmasters' frustration, my opinion is that Google's engineers and quality raters are pleased with the success of the filter and are not planning to remove it in the near future. For the long term, however, I predict that much more sophisticated spam filters and link analysis techniques will emerge to replace the sandbox. The current sloppiness of the filter means that many websites that Google would like to have in their results are being caught improperly, and filtration is on a constant evolution at the search engines.

Several people have also predicted that Yahoo! or MSN may take up similar techniques to help stop spam. This phenomenon could seriously undermine new SEO/Ms and new campaigns, but it is a possibility. My recommendation is not to discount this possibility and launch projects or at least holding sites and their promotional efforts ASAP. The web environment right now is still relatively friendly to new sites, but will certainly become more competitive and unforgiving with time, no matter what saerch engine filters exist.

Additional Resources & Tools:

  • The Sandbox Detection Tool (login required)
  • SEOmoz's own experience in the Sandbox
  • Original Sandbox article from SEOmoz
  • Updated Sandbox Theories from 02/05
Courtesy of SEOMoz


This post first appeared on Google SEO Lab: Search Engine Optimisation Resourc, please read the originial post: here

Share the post

2005 Analysis of Google's Sandbox

×

Subscribe to Google Seo Lab: Search Engine Optimisation Resourc

Get updates delivered right to your inbox!

Thank you for your subscription

×