Reading Sitemaps to Get a Full Listing of Pages

When analyzing website traffic, there are times you might want a full listing of the website’s pages…even those that never get any traffic or impressions. Google Analytics and Google Search Console will only report on pages were there is data, so these unrecognized pages continue to be overlooked. This article shows how the Analytics Edge Core Add-in can read the website’s sitemap files to extract a full listing of pages.

In most websites, a sitemap.xml file is created to tell search engines what pages are available on your site and when they were last updated. This file is a simple XML file and must be openly available for the search engines (and you) to read. If you open one in your browser, you may actually see a stylized interpretation of the contents, not the actual XML.

The actual contents of the file is XML markup language (structured text), and the main sitemap.xml file may actually contain a list of other sitemap files that contain pages from different sections of your website. The structure is consistent, though, so that the search engines can read them — each URL is enclosed by <loc></loc> tags.

Reading Text Files From The Web

This is one of those ‘did you know?‘ moments where I tell you that the Read Text File function in the Analytics Edge Core Add-in can ‘read’ web content with an ftp or http address. Simply reference the web address, and the add-in will pull down the contents of the URL. If we read the sitemap.xml file, we will get a blob of text that could appear in a number of forms. Sometimes the <loc> tags and URLs will be nicely formatted on their own lines, but optimized sites will deliver the whole file as one really long line.

 

The Analytics Edge Core Add-in has a number of text transformation functions that can be used to extract the data. I have included a workbook with example macros to get you started — they should work, but since every site is a little different, you might need to tweak things for your site.  Download the workbook: sitemap-contents.xlsx

Here is a quick run through of how it works:

First Macro: Get the Sitemap

The workbook uses a Setup worksheet so you can easily change the sitemap you want to process, so the macro starts by reading that value.

Then it uses the Read Text File function to read that web file. Since the file is not a table with a header row, we need to select appropriate option.

Since the function is usually used to read comma or tab-separated files, you can pick a delimiter. To keep things simple, choose New line so you ‘preserve’ the line feeds in the original file.

Extracting the URLs

Since each URL we want is identified by <loc>…</loc> tags, to get the URLs out of the file, one per row, we can use the Split function to split what we have into rows using the ‘<loc>’ tag. There may be some extra rows containing other data, so we can keep only the URLs we want by Filtering for the </loc> tags. Finally, we can Split into columns on the </loc> tag and use Arrange to keep only the first column (the content between the tags).

One extra step, just in case, is to strip any leading or trailing spaces or tab characters from the URL using the Convert function. Then we can use Write to Worksheet to deliver our results into Excel.

Sitemaps Containing Sitemaps

Sometimes the sitemap file contains a list of other sitemaps — this is typical in WordPress sites, grouping web pages by date or type.

In this case, we can simply reference the results and feed the list of URLs to a second macro that uses a Repeat Macro to loop through all of the sitemaps.

This macro does all the same things as the first macro, with just a few changes.

Instead of starting by reading the Setup worksheet to get the sitemap URL, it uses a Repeat Macro to read (and process) each line of the Sitemaps list downloaded by the first macro.

Then, because some URLs may not actually contain any content (sitemaps can contain dead references or special lookup functions that may not work), we include a Filter function that stops the macro loop if there is no content to process. This prevents the repeat macro from stopping on an error, and it just goes on to the next URL in the list.

Finally, the repeat macro ends with Append to Worksheet so the results from one sitemap URL are appended to the previous one.

List of Web Pages Extracted from the Sitemap

Your final result is a simple list of web pages…which should be all of the web pages available on your website. You can compare this list with the results from Google Analytics of Google Search connectors to see which pages are virtually never seen.

Now you can get a report of pages with no impressions. Analytics Edge makes it possible.