How To Redact or Remove PII In Piwik Pro Analytics

In my LinkedIn teaser post regarding data minimization and the redaction of Personally Identifiable Information (PII) within Piwik Pro Analytics, by utilizing the solution we’ll be discussing in this blog, Piotr Korzeniowski, the current CEO of Piwik Pro, affirmed that Piwik Pro permits the collection and storage of personal data and PII, including instances where such information gets collected through the URL when present. 

However, there is an exception for entities within the healthcare industry bound by HIPAA regulations. In such cases, where the definition of PII is much broader and termed “Protected Health Information” (PHI), a signed Business Associate Agreement (BAA) with Piwik Pro becomes requisite. It is important to note that this provision is exclusively accessible to non-core Piwik Pro customers.

In that post, Piotr Korzeniowski presented solid points, including that embracing the principle of data minimization and the redaction of PII is the right thing to do. 

Nonetheless, I harbour the belief that certain types of PII are too sensitive for collection, which isn’t exclusively restricted to Piwik Pro alone, but to other vendors as well, for example, information such as credit card details, passwords, and social security numbers (SSN) doesn’t appear healthy to get transmitted to whatever web analytics tools that you’re using. Collaborating with developers to avert the presence of such data within user browser URLs in the first place is paramount.

In scenarios where “sensitive” Personally Identifiable Information (PII) data points are present in the page URL within any analytics tool deployed on your website, there is a potential risk of accidental exposure of visitors’ personal data to external individuals when multiple agencies are working in the Piwik Pro Analytics property or your web analytics tool.

With all these covered, if under any circumstances where your business operates as a covered entity subject to privacy legislations or guidelines preventing the exposure of PII to analytics vendors’ servers, or if your organization champions privacy and advocates for data minimization, or even if you do not want a scenario where external agencies are being able to see PII collected through page URLs in your Piwik Pro Analytics property, the redaction solution and methodologies detailed in this article, holds substantial value for you and your organization’s privacy strategy on the utilization of Piwik Pro analytics.

Allow me a few moments to enhance clarity by redefining the title of this article. Firstly, the focal point of this piece revolves around the redaction of Personally Identifiable Information (PII) within the dimensions of Piwik Pro analytics page URLs, as illustrated below:

The “Page URL” dimension is commonly used in Piwik Pro Analytics reports created in the Piwik Pro tool, Google Sheets, or Looker Studio alongside various “metrics” and “dimensions“.

Furthermore, I will introduce a Piwik Pro privacy configuration that enables you to “Redact the User IP Address.” I must emphasize exercising caution when employing this secondary solution since Piwik Pro offers an alternative option to “Anonymize IP Address,” which yields a less significant impact compared to full IP redaction. I am working on creating a comprehensive article delving into the nuances of IP Anonymization/Masking and IP Redaction.

It is essential to understand that implementing PII redaction does not necessarily equate to staying privacy compliant. Instead, this process represents a single facet of the technical, practical and organizational measures employed to demonstrate a commitment to user privacy and regulatory compliance.

Within the confines of this article, I will unveil a JavaScript-based solution that caters to the redaction of more than eight distinct types of PII, which may appear in over forty (40) distinct variations. The solution’s versatility allows for customization, including new PII types or pertinent variations niched to your business. However, to ensure comprehensibility, I will initially explain the concept of PII and their examples.

Additionally, If your preference of handling PII in Piwik Pro Analytics leans towards a method of removal (an approach I do not entirely endorse) instead of redaction, in that case, a supplementary solution exists within this article.

As for Google Analytics (GA4) users, a considerable spoiler awaits you in this blog post, an announcement likely to excite you.

What is Personally Identifiable Information (PII) and Their Examples?

To refresh our understanding, let’s take a moment to delve into the concept of PII. PII is an acronym for “Personally Identifiable Information.” It refers to data that can singularly, or in combination, be employed to directly identify, establish contact with, or precisely locate an individual. An intriguing aspect of PII lies in its nuanced definition, which varies across diverse privacy regulations such as GDPR, DPA, HIPAA, LGPD, and others. Despite these disparities, common threads unite them. Hence, seeking guidance from legal or privacy experts is recommended to legally interpret what Personally Identifiable Information means within the context of the privacy guidelines covering your business.

Several instances of PII have been universally agreed upon within the digital realm. These include but are not limited to the following:

  • Email addresses
  • Home or mailing addresses
  • Phone numbers
  • Accurate geographic coordinates (like GPS data, with exceptions noted)
  • Complete names or usernames
  • Social Security numbers (SSN)
  • Details of credit or debit cards

For instance, approximately eight distinct types of personally identifiable information are detectable in the URL below.

https://example.com/page?email=test@example.com&TEL=123-456-7890&PAssword=secretpassword&mob=234d45er&firstname=John&lname=Doe&address=123+Main+St&postcode=12345&po%20box=livinglife&fn=mary&state=AZ&drive=lambogini&lat=ideyforyou&lon=makeiask&amexcard=374245455400126&visa=4263982640269299&mastercard1=2222420000001113&mastercard2=2223000048410010&ss=050-64-8120

In the following sections of this article, I will delve further into the techniques for redacting these sensitive data variables, explicitly focusing on the application within Piwik Pro analytics.

Identifying PII Leaks in Piwik Pro Analytics URLs: How to Proceed

The process encompasses several approaches: the easy, the seamless, and the more intricate method. However, I won’t delve into the intricacies of each technique; instead, I will provide you with a generic and popular way of doing this within the Piwik Pro Analytics user interface.

One widely adopted approach involves navigating to the “pages” performance report housed within the “Behaviour” report category, situated in the “Reports” section of Piwik Pro Analytics.

Subsequently, you can input a fragment of your PII directly into the search field or leverage “quick filters“, which offer the capability to use regular expressions (regex) for refining your report filter conditions.

Using the “Quick Filters” feature makes it possible to use regex patterns to find pages with potential PII leaks quickly.

This action will unveil pages where user-identifiable information was present in the URL and collected via the page URL dimension.

What Data Can Be Redacted Using This Methodology? 

The JavaScript-based solution that we’ve created allows you to redact various categories of PII data that might traverse into your Piwik Pro Analytics via the strings in the page URLs. This encompasses:

  • Email addresses (whether encoded or decoded)
  • Phone numbers
  • Passwords
  • Names
  • Addresses
  • Geographical coordinates
  • Postal codes
  • Credit card particulars (Mastercard, Amex, and Visa)
  • Social Security numbers

With a comprehensive understanding of the particulars, let’s delve into the solution and its implementation.

PII Redaction in Piwik Pro: The Solution and How It Works

For most Piwik Pro implementations, particularly those seeking to enhance data minimization and preemptively secure sensitive information from exposure, I strongly endorse using Version 1 of this redaction methodology. This approach is well-suited for redacting PII within your Piwik Pro analytics property.

You will be using the code below for this approach. However, I recommend consulting the instructions beneath the code to understand how to proceed with the implementation wholly.

function redactURL() {
  var url = window.location.href;

  // Redact emails
  url = url.replace(/(\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b|\b[A-Za-z0-9._%+-]+%40[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b)/gi, '[REDACTED_EMAIL]');

  // Redact phone numbers
  url = url.replace(/\b(\+\d{1,2}\s?)?(\(\d{3}\)|\d{3})[\s.-]?\d{3}[\s.-]?\d{4}\b/gi, '[REDACTED_PHONE]');

  // Redact phone numbers two (with equal sign)
  url = url.replace(/(\btel=|telephone=|phone=|mobile=|mob=)([^&\s]*)/gi, '$1[REDACTED_PHONE2]');

  // Redact passwords
  url = url.replace(/(\bpassword=|passwd=|pass=)([^&\s]*)/gi, '$1[REDACTED_PASSWORD]');

  // Redact names
  url = url.replace(/(\bfirstname=|lastname=|fullname=|fname=|lname=|surname=|first_name=|last_name=|fn=|ln=|username=)([^&\s]*)/gi, '$1[REDACTED_NAME]');

  // Redact addresses
  url = url.replace(/(\baddress=|street=|road=|drive=|pobox=|po%20box=|po_box=|address_street=|address_name=)([^&\s]*)/gi, '$1[REDACTED_ADDRESS]');

  // Redact other location information
  url = url.replace(/(\baddress_country_code=|address_state=|lat=|lon=)([^&\s]*)/gi, '$1[REDACTED_LOCATION]');

  // Redact zip codes
  url = url.replace(/(\bpostcode=|zipcode=|zip=|address_zip=)([^&\s]*)/gi, '$1[REDACTED_ZIPCODE]');

  // Redact credit card (Visa/MC)
  url = url.replace(/[=,;]\s*(\d{4}[-\s+]*){3}\d{4}($|[,;:/?&#])/gi, '[REDACTED_PAYMENT_CARD]');

  // Redact credit card (Amex)
  url = url.replace(/[=,;]\s*\d{4}[-\s+]*\d{6}[-\s+]*\d{5}[-\s+]*($|[,;:/?&#])/gi, '[REDACTED_PAYMENT_CARD]');

  // Redact social security number
  url = url.replace(/[=,;]\s*\d{3}[-\s+]*\d{2}[-\s+]*\d{4}($|[,;:/?&#])/gi, '[REDACTED_SSN]');

  return url;
}

// Update The URL in Piwik Pro
window._paq = window._paq || [];
_paq.push(["setCustomUrl", redactURL()]);


// Example usage
var redactedURL = redactURL();

console.log('Redacted URL:', redactedURL);

Integrating this code into the Piwik Pro tracking configuration requires higher placement of the redaction code in the Piwik Pro tracking configuration script following the visual guidance outlined below.

Instrumentation 1: Placing the redaction Javascript code in the same script tag as the Piwik Pro config tag but above it.

Instrumentation 2: Placing the Javascript code above the Piwik Pro config tag, but not in the same script tag.

While for cases where you implemented your Piwik Pro in Google Tag Manager (GTM) through a template or website plugin. In that case, I suggest copying the Piwik Pro PII redacting JavaScript code and placing it as high above the Piwik Pro tracking script injected by the plugin, and for the GTM template use cases, ensure there is a sequence or priority in place that helps in executing the JavaScript code through a tag that gets fired before tag template, thereby activating your redacting solution before the Piwik Pro tracking script gets initialized on the website.

Customizing The Code, Which You Probably Won’t Need:

Flexibility is embedded within the JavaScript code, permitting customization by appending newly identified query keys unique to your business and not included in the JavaScript code.

Testing If The PII Redaction Solution Works In Piwik Pro Analytics:

Using the Piwik Pro Analytics “Track Debugger“, you can further test whether your solution works in real-time.

Navigate to the “Settings” view, and find “Tracker Debugger” under “Personal Tools“. Click to open.

In the “Tracker Debugger”, locate your visit and see if the redaction is working, ensure why testing the instrumentation, and ensure PII exists in the URL.

Or using the page performance report to see if the solution works.

What other solutions exist?

Suppose your privacy needs also extend to the redaction of IP addresses. In that case, you should explore the extra privacy settings that I’ll be covering in the following section below.

Lastly, for rare scenarios where you might need to remove PII from the URL over redacting (which I recommend always) of such sensitive user data, I encourage you to continue reading to uncover an alternative strategy tailored to scenarios necessitating such needs.

Redacting PII in Piwik Pro: IP Address Redaction (Masking/Anonymization Included)

In cases where your needs include the redaction of IP addresses in Piwik Pro, you can achieve this when using the code I shared earlier and by leveraging some in-built Piwik Pro privacy feature that deters Piwik Pro from accessing and logging user IP information and that’s simply what the section of this article is pretty all about.

It’s essential to recognize that while this approach ensures a stricter privacy layer, it also eliminates the availability of location-related data in your analytics property. Exploring Piwik Pro’s IP “masking/anonymization” capabilities is advisable and mostly recommended instead of the IP redaction approach. 

The Piwik Pro IP Masking is a privacy setting that achieves IP address masking by masking/anonymizing parts of the user address and doing its best to preserve location data in Piwik Pro analytics to a considerable extent, depending on your configuration.

Example of when “Level 2” masking/anonymization of website visitors IP address is activated.

Here is how Piwik Pro Analytics stores and processes the user IP address when redaction is activated.

To turn off IP address collection or mask them for anonymization in Piwik Pro, go to the “Administration” view.

Select your Piwik Pro project and go to the “Privacy” tab. Scroll down to the “IP addresses” section.

You can choose to disable IP address collection, which prevents Piwik Pro Analytics from collecting and storing the user’s IP address if you need to redact it.

* Be careful not to enable this at the global settings level if it’s only intended for a specific Piwik Pro property.

I prefer using IP anonymization/masking rather than full redaction. However, the choice depends on your organization’s needs and the privacy regulations your business must follow.

Piwik Pro Analytics provides an easy way to mask IP addresses. You can turn on IP address data collection and enable the “Mask IP addresses” setting. Additionally, you can choose the masking level and select which location data gets collected before applying the masking.

To check if your configuration is functioning properly, you can utilize the “Tracker Debugger” feature in Piwik Pro. This feature lets you view real-time data collection and ensure everything works correctly. It’s important to keep in mind that this configuration does not apply retroactively, including the PII redaction code mentioned in our blog post.

Preferring PII Removal Over Redaction in Piwik Pro: (The Promised Supplementary Solution)

I acknowledge that this article focuses on the redaction of Personally Identifiable Information (PII) within URLs, thus preventing its transmission to Piwik Pro servers and concealing it from visibility within your reports, mainly via the page URL dimension. However, should your requirement lean toward completely removing this PII rather than redaction, the code provided below should help identify and remove the PII from the URL.

I personally advocate for and recommend the redaction approach over complete removal. This preference stems from the fact that redaction keeps you aware of the presence of PII in the URL, even if it wasn’t captured because it got redacted. 

This awareness further provides useful insights into the nature of the PII type, empowering collaboration with your marketing and development teams to leverage alternative strategies to avert the presence of personally identifiable information in the user’s browser right from the outset.

On the other hand, adopting the removal approach results in a transformed and cleaned-up URL devoid of any visible redacted PII. However, this path bears the consequence of rendering you unaware of any PII leakage within the browser URL, and it can also be problematic in cases where PII are available as a search term in the page URL.

For reference, presented below is the code required to implement the removal approach.

// Get the current page's URL
var currentURL = window.location.href;

// List of query string keys to check for
var queryKeys = [
  "tel", "telephone", "phone", "mobile", "mob", "password", "passwd", "pass", "firstname",
  "lastname", "fullname", "fname", "lname", "surname", "first_name", "last_name",
  "fn", "ln", "username", "address", "street", "road", "drive", "pobox",
  "po%20box", "po_box", "address_street", "address_name", "address_country_code",
  "address_state", "state", "lat", "lon", "postcode", "zipcode", "zip", "address_zip"
];

// Email regex pattern
var emailRegex = /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b|\b[A-Za-z0-9._%+-]+%40[A-Za-z0-9.-]+\.[A-Za-z]{2,}/g;

// Phone number regex pattern
var phoneRegex = /\b\d{10,}\b/g;

// Credit card regex patterns
var visaMastercardRegex = /[,;]\s*(\d{4}[-\s+]*){3}\d{4}($|[,;:/?&#])/g;
var amexRegex = /[,;]\s*\d{4}[-\s+]*\d{6}[-\s+]*\d{5}[-\s+]*($|[,;:/?&#])/g;

// Social security regex pattern
var ssnRegex = /[,;]\s*\d{3}[-\s+]*\d{2}[-\s+]*\d{4}($|[,;:/?&#])/g;

// Function to remove specified query key from URL
function removeQueryString(url, key) {
  var regex = new RegExp('[?&]' + key + '(=[^&]*)?|^' + key + '(=[^&]*)?&?', 'i');
  return url.replace(regex, '');
}

// Function to remove matches from regex pattern in URL
function removeRegexMatches(url, regex) {
  return url.replace(regex, '');
}

// Convert the URL to lowercase for case-insensitive matching
var transformedURL = currentURL.toLowerCase();

// Iterate through the query keys and remove them from the URL if present
for (var i = 0; i < queryKeys.length; i++) {
  var key = queryKeys[i];
  transformedURL = removeQueryString(transformedURL, key);
}

// Remove email matches from the URL
transformedURL = removeRegexMatches(transformedURL, emailRegex);

// Remove phone number matches from the URL
transformedURL = removeRegexMatches(transformedURL, phoneRegex);

// Remove credit card matches from the URL
transformedURL = removeRegexMatches(transformedURL, visaMastercardRegex);
transformedURL = removeRegexMatches(transformedURL, amexRegex);

// Remove social security matches from the URL
transformedURL = removeRegexMatches(transformedURL, ssnRegex);

// Update The URL in Piwik Pro
window._paq = window._paq || [];
_paq.push(["setCustomUrl", transformedURL]);

// Log the new URL without the specified query keys, emails, phone numbers, credit card numbers, and social security numbers
console.log(transformedURL);

Instrumentation should follow a similar approach as outlined in the redaction methodology use case.

Concluding Insights and an Unveiled Secret

At last, we’ve reached this article’s reflection and concluding part. So far, we’ve explored Piwik Pro’s standpoint on PII collection through URLs and any method. We delved into the fundamental notion of Personally Identifiable Information (PII) and a known method to identify if they’re present in your Piwik Pro Analytics property.

The article gave us a platform to take a detailed look at redacting sensitive user data in Piwik Pro and the PII types that the JavaScript-based solution will help redact. We shared and explained the different code versions we’ve developed, each serving specific purposes, and highlighted how you customize the solution shared in this blog post.

In our pursuit of equipping you with an adept approach to handling the PII of your website users and ensuring data transmission to Piwik Pro servers aligns with data minimization principles, we not only expounded on the process of redaction but also presented an alternative strategy involving PII removal. Yet, as elaborated in our earlier discourse, I discourage adopting the latter path (the removal methodology).

Piwik Pro offers many privacy settings, allowing you to ensure compliance with privacy regulations and compliant management of user data within Piwik Pro analytics. At DumbData, we’ve also taken strides to cover some privacy innovations of Piwik Pro by crafting articles series on seamlessly integrating Piwik Pro with consent management platforms such as Osano, CookieYes, and CookieFirst.

And now, for the promised spoiler for Google Analytics 4 users, the JavaScript code employed in this article has the capability of identifying more types of PII that appear in multiple variations than the solutions that are out there, and it’s usable in Google Analytics 4, with an application just with little customization applied to the code.

If you’re eager to learn how to set this up and the intricacies behind it, fear not! I’m prepared to reveal how you can customize and utilize the JavaScript code shared here to redact Personally Identifiable Information or PHI within Google Analytics 4 (with an emphasis on PII conveyed through page URLs). 

Should this revelation intrigue you, and if you want to explore further insights about Piwik Pro, why not subscribe to the DumbData newsletter? I’ll ensure you’ll be the first to learn when this insightful feature goes live. Additionally, you’ll remain up-to-date on our latest website publications and valuable additions to the DumbData resource hub.

Until then, stay happy collecting and using data the compliant way.

You might also enjoy

More
articles