Definitive Guide to Removing All Google Analytics Spam

This is a PROVEN WORKING SOLUTION for how to remove referral spam in your Google Analytics. Bothered by referral spam from auto-seo-service.com, resell-seo-services.com or other domain that redirects to semalt.com? Implement a Valid Hostname filter as described in #2 below.

A Valid Hostname Filter is still the best technique for dealing with ‘ghost referrals’. The Spam Crawler Filters need to be updated on a daily basis to be effective as view filters — they would be more useful when used in a Custom Segment to clean up reporting after-the-fact.

This article still describes the techniques, but I refer you to carloseo.com for up-to-date filter expressions. Carlos and I have shared a lot over the years in our battle for clean reporting, and I know I am leaving you in good hands.
For updated filter expressions: https://carloseo.com/removing-google-analytics-spam/

Mike Sullivan

How to Prevent and Remove Spam – Overview

  1. BEFORE YOU START: Make an Unfiltered View!
  2. Implement a Valid Hostname Filter to eliminate ghost visits
  3. Implement Spam Crawler Filters to eliminate the targeted spam visits
  4. Create a Custom Segment with these filters to use for reporting

All the information you need (and more!) is provided below in this step-by-step guide. Filter expressions are updated within a day or two as needed.

Urban Myths and Bad Advice:

  • DO NOT use the Referral Exclusion List – Why?
  • Google Analytics bounce rate DOES NOT affect search rankings
  • Using .htaccess rules or WordPress plugins will NOT eliminate any of the ghost referrals

This is a long article — I have included a lot of background and detail to explain why this solution is so effective, and to dispel a number of urban myths around referral spam. Heed the advice to set up an unfiltered view before you begin — there is no recovery from a bad filter (filtered traffic is gone forever); do not risk your analytics to a typo.


Background: The Many Faces of Referral Spam

The problem of fake references in Google Analytics has changed significantly over the past few years. In 2014, we had some bots that visited your website and left referrals in your analytics. In December 2014, the attacks began taking advantage of a weakness in Google’s new Measurement Protocol that allowed direct attacks on the Google Analytics tracking servers without having to actually visit your website. I tagged these non-existent visits as ‘ghost’ spam, and the label stuck. This attack was a lot easier than crawling the web looking for new websites. There were a lot of different types of attacks, from many referral sources, leading to a lot of confusion in the industry.

Others joined in on the fun in March 2015, and the spammers enjoyed playing with new domains and techniques. We have had fake organic search terms and fake events injected into our analytics, too. Enterprising individuals popped offering hundreds or even thousands of visits from real webmasters obtained using these techniques.

At the beginning of 2016, there were still lots of players pushing through Google’s defenses, even if only for a few days. As of June, 2016, the trend continued with most ghost spam changing in a few days, but the crawlers seem to run for months.

In late 2016, the spam evolved again, this time focused on inserting a fake Language, and using a rotating series of fake and real sources. Another blitz also used valid hostnames on some of the traffic, indicating the spammer is working to get around the common protections people have deployed.

As of summer 2017, Google routinely stops new spam attacks within a day, but the spammers persist.


BEFORE YOU BEGIN: Create an Unfiltered View

Before you start hacking away at your Google Analytics settings, the best practice for implementing new filters starts like this – create an UNFILTERED VIEW, and a TEST VIEW.

create-new-view-in-ga

1. Make sure you always have an Unfiltered view in your property — that has absolutely no filters. This will ensure you always have the raw, unmodified data should things go wrong. There is no ‘undo’ for a bad filter.

2. Don’t create new filters directly in your main view. Create a new Test view that mirrors your main view in every other respect, and then add the filter(s) there first. Watch it for a few days and compare with the Unfiltered view to make sure it is doing what it should.

3. If you’re happy with the new filter based on this test, then go ahead and add the ‘existing filter’ to your main view.


1. New Website?

In early 2015, I coined the term “ghost referral” to identify the worst offenders because they actually NEVER VISIT YOUR SITE. Using some software magic, they post fake hits to Google’s tracking service using a random series of tracking IDs. When they pick a series of numbers that includes your tracking ID, Google records a referral visit from their source in your reports, even though they know nothing about your website and never visited it. A number of people have seen ‘traffic’ to Google Analytics accounts that have never been used…

When you create a Google Analytics Account, you also get a Property and a View. The Property gives you the tracking id (e.g. UA-1234567-1) that you use in the code snippet on your website. You can create 50 Properties in your Account, and they are given -1, -2, -3, … -50 extensions. Most ghost referral spam hits the default ‘-1’ Property, although some are now hitting -2, and -3 properties as well.

You can significantly reduce the ghost spam simply by creating and using a second, third or fourth (or tenth) Property. You don’t have to actually use them all. Caution: changing your tracking code on your website will leave the historical data in the old property, so this is really only useful for new websites, or if you are willing to abandon your old data.

new-property


2. Implement a Valid Hostname Filter for Ghost Visits

THIS IS THE SINGLE MOST EFFECTIVE SOLUTION TO ELIMINATE  FAKE SPAM TRAFFIC!

“Ghost” traffic never actually visits your website — it is injected into the Google Analytics tracking servers and appears in their reports. Javascript filters, WordPress plugins and .htaccess methods are useless at blocking the traffic because there is no traffic to your website. You have no choice but to create a Google Analytics filter to exclude them because they ONLY exist in Google Analytics. The biggest problem with this ghost traffic is that they change as quickly as they appear, so you could be continuously building filters for them.

Source versus Hostname

Real visits to your website from a referral link have TWO server names available: the Source that the link is from, and the Hostname that the landing page is pointing to (your server). In most cases, the Hostname should always be your server, regardless of where the traffic came from.

source-hostname

For example, here is a sample of the Source and Referral Path (page with the link on it) pointing to this article. Notice the Hostname is always my server.

referral-source-link-hostname-page

Ghost visits send traffic to a random series of tracking ID numbers — they don’t know your server name! They use blank (“(not set)“) or fake hostname values (like ‘google.com’). That means you can eliminate ALL of them simply by filtering to INCLUDE only the valid hostname — your server.

A. Identify Your Valid Hostnames

STEP CAREFULLY.  Valid hostnames are websites that you have configured to use your Google Analytics tracking ID (e.g. UA-12345678-1). They may include ecommerce shopping carts or telephone call tracking services linked from your website.

valid-hostname-identificationStart with a multi-year report showing just hostnames (Audience > Technology > Network > hostname), then identify the valid ones — the servers where you REMEMBER configuring with your tracking ID (hint: google.com is NOT one of them).

UPDATE: if you have alternate domains that redirect to your main website domain, do NOT include those redirected domains. If you can type in one hostname/URL and it changes to display a page on a different domain, then it is NOT a valid hostname.

FYI – googleweblight is a new service from Google that servers your pages to mobile networks in some parts of the world. It usually appears with your hostname in front, and it’s ok.

I do not have any tracking codes installed on google.com, mozilla.org, huffingtonpost.com or any of the other sites that appear in the report. I never configured my tracking code ON those sites — they are ghost visits!

IMPORTANT: If you see GOAL CONVERSIONS or REVENUE from (not set) hostnames, you need to dig into why. Maybe they are Event-based call logging and are not associated with pageview (which has a hostname value). You may need to adjust your filters and/or tracking code snippets. 

only-my-hostnamesB. Create the Filter Expression

Create a filter expression that captures all of the domains that you consider to be valid. TEST, TEST, TEST! Then move to production when you are sure you have it all. You may find it easier to play with an Advanced Segment, so you can see the effect of your filter without risking any data loss. See #3 below.

Many people have a problem composing the filter expression because it is Regex (regular expressions), so lets keep it really simple in this case.

For your filter expression, simply enter your valid hostname. If you have more than one, separate them by a vertical bar ( | ). If you have a third-party payment service like checkout.shopify.com, you may need to enter it as well.

Note: if you can’t see the “Include” radio button in the Filter page, look BELOW the Exclude section (which is expanded when it is selected). When you select Include, the Exclude section will collapse and the Include section will expand as in the image.

It is not necessary to enter all of the subdomains (like www and help) – Regex will perform a partial match by default, so I keep the expression shorter by simplifying to just the root domains.

Note: in proper Regex, you should ‘escape’ the dots (\.), but since a dot matches any character and the likelihood and impact of a false match is negligible, I sometimes leave them out to keep it simple.

analyticsedge.com|youtube.com|fastspring.com

IMPORTANT: do NOT start or end the expression with a vertical bar ( | ) use them only between domains.


3. Implement Spam Crawler Filters

Some spammers actually crawl the web and visit your site, and others have figured out what your hostname is, so the Valid Hostname filter won’t keep everything out. For these, you will need to specifically exclude their visits by naming them in a filter.

Note: if you are technically capable, you could block these sources using classic spam blocking techniques like using .htaccess rules. To learn a little more about these alternatives, you can read the article by Carlos Escalera.

Do NOT visit the referring site, since this is an invitation to get a virus or Trojan infection on your computer, or otherwise satisfy the desire of the spammer. I recommend you do a quick Google search first, to see if you can trust it. Spammers are quickly identified, and you’ll usually see indications in the first page of search results.

Creating a New Filter

spam-referral-filterYou can exclude them from your reports in Google Analytics by creating a filter. You identify a “unique signature” that identifies them (and only them), and then create a filter based on that.

Most spam can be eliminated by filtering on Campaign Source. Most people try filtering on Referral, and that filter doesn’t always work because some spammers have used utm codes to stuff values into the Source and Medium, imitating a referral. Note that some of the spam now requires you to filter on the Language Settings field  (Spam Crawler 5 below).

campaign-source

Read Google’s instructions on making filters.

The latest filter expression I recommend at at the top of this article. Note that I take some shortcuts in my expressions to save space (there is a limit to the number of characters). I have not yet found any false matches for valid referrals in any of the web properties I have worked with, but you should be cautious of being too aggressive — I have seen some people recommend filtering on simple words like buy|cheap|motor|money|seo which will simply match far too many valid domains to be recommended.

As you discover new spammers you will have to add to the filters. Remember: these filters will exclude everything that matches, so be careful with your expressions, and TEST, TEST, TEST first.


4. Create a Custom Segment

To eliminate spam immediately from all of your reports, even historical reporting, you need to use a Custom Segment. If you have prepared the filter expressions above, you’ve already done all the hard work. If you skipped to this section, go back and start at Step 1.

Start with a copy of my segment from the Google Analytics Solution Gallery [2016-12-22], and modify it to suit, or follow these instructions:

In Google Analytics, open your Reporting view, and click +Add Segment.

add-segment

Then click New Segment and enter a name like “All Users (No Spam)“. If you have multiple websites in your account, you should include the website in the name, like “All Users (AnalyticsEdge)“.

eliminate-spam-create-segment-new

Select the Advanced > Conditions tab on the left. Create a new entry for the valid hostnames:

  • Sessions > Include
  • Hostname > matches regexyour valid hostnames expression (#1 above)

Then click + Add Filter and add the expressions for the Spam Crawlers:

  • [+Add Filter]
  • Sessions > Exclude
  • Source > matches regexspam crawler expression #1    OR
  • Sessions > Exclude
  • Source > matches regexspam crawler expression #2    OR
  • repeat for the rest of the filter expressions

Save and Apply the Segment

The easiest way to test is to use your new segment in combination with the default All Users segment, comparing the Sessions counts. You can find your new segment listed in the Custom grouping. You can select BOTH your new segment AND the All Users segment to compare.

Sharing your Segment Definition With Coworkers

segment-collaborationGoogle Analytics segments are normally account-specific, but a new feature allows you to share it with other people that have access to the same Google Analytics view. When editing the segment, click the link in the upper right corner and allow Collaborators to apply/edit the segment,


5. Turn On Google’s Bots & Spiders Option

bots-and-spidersGoogle Analytics has a simple checkbox you can use to exclude easy-to-identify bots and spiders, but you have to enable it for every View you use. In your Google Analytics Admin section, navigate to each View you use, select View Settings, and check the box to Exclude all hits from known bots and spiders.

This feature has recently started affecting referral spam as well (e.g. horoskop-baran.pl / referral), so TURN IT ON!

 

What made me an expert about spam?

My name is Mike Sullivan; I am a Google Analytics Certified professional, and a Top Contributor in the Google Analytics community forum. I have been working extensively with the Google Analytics API since 2010, providing customized reporting solutions. I founded Analytics Edge in 2013, making a suite of free and inexpensive Excel report automation add-ins and connectors.

Spam was hounding my customers, so I dug into the problem with all the tools at my disposal and thought I’d share what I learned. I wrote this Definitive Guide, coining the term “ghost referrals”, to help resolve the confusion surrounding the various spam types and the different techniques required to deal with them.

I hope this article has helped you.