Data Sampling Problems In Annual Reports

Google Analytics uses statistical sampling to improve responsiveness for large queries. There is nothing wrong with statistical sampling — it will usually produce numbers close to the actual, but not always. It starts to display problems with long term trend charts typically shown in annual reports. The Analytics Edge connectors for Google Analytics include a feature that can dramatically reduce sampling errors in your Excel reports.



There Is (Almost) Nothing Wrong With Sampling

First of all, sampling is a tried and true technique, and there are plenty of ways to get more data. Sampling is triggered in a query involving over 500,000 sessions, and the accuracy suffers if you go higher. By itself, sampling really isn’t much of a problem. The numbers display an estimate of the traffic based on a statistical sample. If your query is simple, you get a pretty good estimate.

Problem arise when you include more than 1 dimension in your query — Google’s algorithms tend to flatten the curves. This is especially visible when you create long-term trend reports of something like Source  (time is one dimension, Source is the second). At this time of year, you see a lot of that in annual website reports. If you have traffic that is very dynamic, with peaks and valleys from month-to-month, you will see that sampled reports tend to level-off those peaks and valleys. In the worst case, you may see a flat line over the whole report, which is simply misleading (a.k.a. “evil”).

Little Bites

If you have over 40,000 sessions per month on your website, then a query involving the entire year will push you into the danger zone. To reduce sampling error, the easiest method is to use a shorter time period. Making 12 monthly queries is the direct approach, especially since you want to report monthly values in your trend chart anyways.

minimize-sampling-optionThe Analytics Edge connectors for Google Analytics (both Free and Pro versions) include a checkbox on the Options tab that does this automatically for you. When you refresh the query, it will break the time period of the query into months, run separate queries for each month, and join the results together. Even on very high traffic sites, you will never get a flat trend line because each monthly period is sampled independently.

Ultimate Accuracy


refresh-macroIf you are not interested in a trend but need the absolute most accurate total possible, then include the Date in your query and check the minimize sampling option. Separate queries will be run for each day in the date range, providing the most accurate numbers available (without a Premium account).

The Analytics Edge Core Add-in can remove the Date column, combine the duplicate rows (totalling the metrics), sort the results, and keep the top 10 list you are interested in; all before it writes the results to your worksheet [this is what simple Excel report automation is all about]. Note that when you use this technique, you shouldn’t use averages or rate metrics since the sum of the daily averages is a meaningless measure. Instead, you would query the components of the average (e.g. bounce rate is bounces / sessions) and use Analytics Edge to calculate the new average (sum of daily bounces / sum of daily sessions).

The free Excel report Annual Website Report demonstrates both of these approaches. If you have a unique reporting challenge, don’t be afraid to ask how Analytics Edge can solve it.