Definitive Guide to Removing All Google Analytics Spam

This is a PROVEN WORKING SOLUTION to remove referral spam in your Google Analytics. Updated for 2021: Bothered by referral spam from the likes of bottraffic.live, bot-traffic.xyz or other domains? Implement a Custom Segment as described in #3 below or, if you catch the spam traffic on the day it happens, a Spam Crawler Filter (#4).

A lot has changed since the peak of Google Analytics referral spam in 2016. In those days, a Valid Hostname Filter emerged as the best technique for dealing with ‘ghost referrals’, where fake spam traffic was easily identified by incorrect values in the hostname field. Since then, spammers learned that they could get the Google Analytics tracking codes from real web sites and populate the fake traffic with the appropriate hostname. This means that Spam Crawler Filters became the preferred solution. To be effective, they need to be updated on the day the spam occurs…something that usually does not happen. That leaves a Custom Segment as the only remaining solution to clean up reporting after-the-fact, and that will continue to be your fallback solution.

This article describes each of these techniques. The actual filter expressions will change over time and tend to be property-specific, so I will not publish or maintain a list like I used to. Right now, it is one incident…we’ll see if it continues.

Other techniques have evolved, but they require setting up special tracking hits to identify real traffic to your website. If spam becomes ‘a thing’ in 2021 like it was in 2016, I will elaborate more on those options, but for most people, the Custom Segment will work for now.

Mike Sullivan

Do NOT visit the spam site, since this is an invitation to get a virus or Trojan infection on your computer, or otherwise satisfy the desire of the spammer. I recommend you do a quick Google search first, to see if you can trust a questionable source. Spammers are quickly identified, and you’ll usually see indications in the first page of search results or in a Twitter search. You don’t have to read any of it — identify it as spam, and then don’t give them your time or attention. Just focus on getting rid of it.

How to Prevent and Remove Spam – Overview

  • New property? Consider Google Analytics 4 first
  • Create a Custom Segment to use for reporting
  • Implement Spam Crawler Filters to get rid of spam on the day it occurs
  • Implement a Valid Hostname Filter (optional, no longer effective)

All the information you need (and more!) is provided below in this step-by-step guide.

Avoid Urban Myths and Bad Advice:

  • DO NOT use the Referral Exclusion List – Why?
  • DO NOT waste your time with .htaccess rules or WordPress plugins
  • Google Analytics bounce rate DOES NOT affect search rankings

This is a long article — I have included a lot of background and detail to explain why this solution is so effective, and to dispel a number of urban myths around referral spam. Heed the advice to set up an unfiltered view before you create any view filters — there is no recovery from a bad filter (filtered traffic is gone forever); do not risk your analytics to a typo.


Background: The Many Faces of Referral Spam

The problem of fake references in Google Analytics has changed significantly over the past few years. In 2014, we had some bots from semalt and buttons-for-website that visited your website and left fake referrals in your analytics. In December 2014, the attacks began taking advantage of a weakness in Google’s new Measurement Protocol that allowed direct attacks on the Google Analytics tracking servers without having to actually visit your website. This is a lot easier than crawling the web looking for new websites. There were a lot of different types of attacks, from many referral sources, leading to a lot of confusion in the industry.

[Side note: the new Measurement Protocol for Google Analytics 4 includes new security secrets to make it much harder for spammers to take advantage of that attack vector.]

Ranksonic joined in on the fun in March 2015, and the spammers enjoyed playing with new domains and techniques. We have had fake organic search terms (www.get-free-social-traffic.com) and fake events (event-tracking) injected into our analytics, too. Enterprising individuals popped up on Fiverr offering hundreds or even thousands of visits from real webmasters obtained using these techniques.

At the beginning of 2016, there were still lots of players pushing through Google’s defenses, even if only for a few days. As of June, 2016, the trend continued with most ghost spam changing in a few days, but the crawlers seem to run for months.

In late 2016, the spam evolved again (ILoveVitaly), this time focused on inserting a fake Language, and using a rotating series of fake and real sources. Another blitz also used valid hostnames on some of the traffic, indicating the spammer is working to get around the common protections people have deployed.

As of summer 2017, Google routinely stopped new spam attacks within a day, but the spammers persisted. Google finally got the better of them by fall 2017 and it has been mostly  quiet until early 2021, when a single-day event seemed to get past all of their protections and wrecked havoc to many GA accounts around the globe.


1. New Website?

In 2021, the default is a new Google Analytics 4 property. Even though GA 4 is still considered by many to be a work in progress, you would be well advised to start your new site with GA 4 because that is clearly the future for GA. View filters cannot be applied to GA 4, but it is inherently more protected from spam by design, and you can filter the reports if needed.

When you create a classic ‘Universal Analytics’ Google Analytics Account, you also get a Property and a View. The Property gives you the Universal Analytics tracking id (e.g. UA-1234567-1) that you use in the code snippet on your website. You can create 50 Properties in your Account, and they are given -1, -2, -3, … -50 extensions.

In early 2015, I coined the term “ghost referral” to identify the worst offenders because they actually NEVER VISITED YOUR SITE. Using some software magic, they posted fake hits to Google’s tracking service using a random series of tracking IDs. When they picked a series of numbers that includes your tracking ID, Google recorded a referral visit from their source in your reports, even though they knew nothing about your website and never visited it. A number of people have seen ‘traffic’ to Google Analytics properties that have never been used…


2. Turn On Google’s Bots & Spiders Option

bots-and-spidersGoogle Analytics (Universal Analytics) has a simple checkbox you can use to exclude easy-to-identify bots and spiders, but you have to enable it for every View you use. In your Google Analytics Admin section, navigate to each View you use, select View Settings, and check the box to Exclude all hits from known bots and spiders.

This feature has recently started affecting referral spam as well (e.g. horoskop-baran.pl / referral), so TURN IT ON!

 


3. Create a Custom Segment

To eliminate spam immediately from all of your reports, even historical reporting, you need to use a Custom Segment. You need to prepare a filter expression to eliminate the spam that you have received, and it can be as simple as combining all of the spam sources in a list separated by vertical bars:

e.g.  bottraffic|bot-traffic

Looking for an updated filter expressions? Check out: carloseo.com

Read Google’s instructions on making filters expressions. Important: do NOT start or end the expression with a vertical bar (i.e.  bottraffic|bot-traffic| ).

You may notice that I take some shortcuts in my expressions to save space and complexity (there is also a limit to the number of characters you can use), like dropping the ‘.live’ domain extension, but you should be cautious of being too aggressive at shortening them — I have seen some people recommend filtering on simple words like buy|cheap|motor|money|seo which will may match valid traffic in the future.

As you discover new spam entries you will have to add to the filters. Remember: these filters will exclude everything that matches, so be careful with your expressions, and TEST, TEST, TEST.

In Google Analytics, open your Reporting view, and click +Add Segment.

add-segment

Then click New Segment and enter a name like “All Users (No Spam)“. Segments are owned by the login account, so if you use anything website-specific in your expression, include the website name in the title like “All Users (analyticsedge.com)“.

eliminate-spam-create-segment-new

This image includes a Valid Hostname filter (discussed below) and a Spam Crawler Filter.

Select the Advanced > Conditions tab on the left. Create a new entry for the Spam Crawlers:

  • Sessions > Exclude
  • Source > matches regexspam crawler expression #1    OR
  • Sessions > Exclude
  • Source > matches regexspam crawler expression #2    OR
  • repeat for any further filter expressions

Save and Apply the Segment

TESTING – The easiest way to test is to use your new segment in combination with the default All Users segment, comparing the Sessions counts. You can find your new segment listed in the Custom grouping. You can select BOTH your new segment AND the All Users segment to compare.

Sharing your Segment Definition With Coworkers

segment-collaborationGoogle Analytics segments are normally account-specific, but a new feature allows you to share it with other people that have access to the same Google Analytics view. When editing the segment, click the link in the upper right corner and allow Collaborators to apply/edit the segment,


4. Implement Spam Crawler Filters

Google Analytics allows you to create View Filters that filter the traffic before it gets into your view. Usually this is used to isolate a subdomain or specific portion of a website or type of traffic to a view of its own, but it can also be used to prevent future repeat spam offenders from getting included in your reports. Note that this only works for future traffic, and Google’s defenses tend to prevent repeat offenders, so this approach is really of limited value these days.

BEFORE YOU BEGIN: Create an Unfiltered View

Before you start hacking away at your Google Analytics filter settings, the best practice for implementing new filters starts like this – create an UNFILTERED VIEW, and a TEST VIEW.

1. Make sure you always have an Unfiltered view in your property — that has absolutely no filters. This will ensure you always have the raw, unmodified data should things go wrong. There is no ‘undo’ for a bad filter.

2. Don’t create new filters directly in your main view. Create a new Test view that mirrors your main view in every other respect, and then add the filter(s) there first. Watch it for a few days and compare with the Unfiltered view to make sure it is doing what it should.

3. If you’re happy with the new filter based on this test, then go ahead and add the ‘existing filter’ to your main view.

CREATING THE SPAM FILTER

Some spammers actually crawl the web and visit your site, and others have figured out what your hostname is, so the Valid Hostname filter won’t keep that stuff out. For these, you will need to specifically exclude their visits by naming them in a filter. Filters ONLY work on future visits, but I have found GA will reprocess a view’s data at the end of the day, so a new filter can remove spam if you catch it on the day it occurred.

Creating a New Filter

spam-referral-filterYou can exclude them from your reports in Google Analytics by creating a filter. You identify a “unique signature” that identifies them (and only them), and then create a filter based on that.

Most spam can be eliminated by filtering on Campaign Source. Most people try filtering on Referral, and that filter doesn’t always work because some spammers have used utm codes to stuff values into the Source and Medium, imitating a referral. Note that some of the spam now requires you to filter on the Language Settings field  (Spam Crawler 5 below).

campaign-source

Looking for an updated filter expressions? Check out: carloseo.com

Read Google’s instructions on making filters.

As you discover new spammers you will have to add to the filters. Remember: these filters will exclude everything that matches, so be careful with your expressions, and test with a segment first.


5. Implement a Valid Hostname Filter for Ghost Visits

Most spam in Google Analytics since 2017 uses valid hostnames, but just in case they go back to their old ways, here’s what worked back in 2015-2016:

“Ghost” traffic never actually visits your website — it is injected into the Google Analytics tracking servers and appears in their reports. Because of that, Javascript filters, WordPress plugins and .htaccess methods are useless at blocking the traffic because there is no traffic to your website. You have no choice but to create a Google Analytics filter or segment to exclude them because they ONLY exist in Google Analytics. The biggest problem with this ghost traffic is that they change as quickly as they appear, so you could be continuously building filters for them. Instead, look for some characteristic that they all share.

Source versus Hostname

Real visits to your website from a referral link have TWO server names available: the Source that the link is from, and the Hostname that the landing page is pointing to (your server). In most cases, the Hostname should always be your server, regardless of where the traffic came from.

source-hostname

For example, here is a sample of the Source and Referral Path (page with the link on it) pointing to this article. Notice the Hostname is always my server.

referral-source-link-hostname-page

Ghost visits send traffic to a random series of tracking ID numbers — they don’t know your server name! They use blank (“(not set)“) or fake hostname values (like ‘google.com’). That means you can eliminate ALL of them simply by filtering to INCLUDE only the valid hostname — your server.

A. Identify Your Valid Hostnames

STEP CAREFULLY.  Valid hostnames are websites that you have configured to use your Google Analytics tracking ID (e.g. UA-12345678-1). They may include ecommerce shopping carts or telephone call tracking services linked from your website.

valid-hostname-identificationStart with a multi-year report showing just hostnames (Audience > Technology > Network > hostname), then identify the valid ones — the servers where you REMEMBER configuring with your tracking ID (hint: google.com is NOT one of them).

UPDATE: if you have alternate domains that redirect to your main website domain, do NOT include those redirected domains. If you can type in one hostname/URL and it changes to display a page on a different domain, then it is NOT a valid hostname.

FYI – googleweblight is a new service from Google that serves your pages to mobile networks in some parts of the world. It usually appears with your hostname in front, and it’s ok.

I do not have any tracking codes installed on google.com, mozilla.org, huffingtonpost.com or any of the other sites that appear in the report. I never configured my tracking code ON those sites — they are ghost visits!

IMPORTANT: If you see GOAL CONVERSIONS or REVENUE from (not set) hostnames, you need to dig into why. Maybe they are Event-based call logging and are not associated with pageview (which has a hostname value). You may need to adjust your filters and/or tracking code snippets. 

only-my-hostnamesB. Create the Filter Expression

Create a filter expression that captures all of the domains that you consider to be valid. TEST, TEST, TEST! Then move to production when you are sure you have it all. You may find it easier to play with an Advanced Segment, so you can see the effect of your filter without risking any data loss. See #3 below.

Many people have a problem composing the filter expression because it is Regex (regular expressions), so lets keep it really simple in this case.

For your filter expression, simply enter your valid hostname. If you have more than one, separate them by a vertical bar ( | ). If you have a third-party payment service like checkout.shopify.com, you may need to enter it as well.

Note: if you can’t see the “Include” radio button in the Filter page, look BELOW the Exclude section (which is expanded when it is selected). When you select Include, the Exclude section will collapse and the Include section will expand as in the image.

It is not necessary to enter all of the subdomains (like www and help) – Regex will perform a partial match by default, so I keep the expression shorter by simplifying to just the root domains.

Note: in proper Regex, you should ‘escape’ the dots (\.), but since a dot matches any character and the likelihood and impact of a false match is negligible, I sometimes leave them out to keep it simple.

analyticsedge.com|youtube.com|fastspring.com

IMPORTANT: do NOT start or end the expression with a vertical bar ( | ) use them only between domains.


What made me an expert about spam?

My name is Mike Sullivan; I have an engineering degree, and worked for decades in I.T. and software development. I was a Google Analytics Certified professional, and a Top Contributor in the Google Analytics community forum when spam became a problem. I have been working extensively with the Google Analytics API since 2010, providing customized reporting solutions. I founded Analytics Edge in 2013, making a suite of free and inexpensive Excel report automation add-ins and connectors.

Spam was hounding my customers, so I dug into the problem with all the tools at my disposal and thought I’d share what I learned. I wrote this Definitive Guide, coining the term “ghost referrals”, to help resolve the confusion surrounding the various spam types and the different techniques required to deal with them.

I hope this article has helped you.