Analyze causes of Personally Identifiable Information (PII) leak

In this scenario, you receive an alert that sensitive consumer data (specifically first and last names) appears in a view visible to the entire organization.

This information is originally intended only for specific functional purposes, such as account creation, invoicing, and shipping. However, through a series of transformations and the creation of an analytics view, the Personally Identifiable Information (PII) leaks into a broader analytics schema.

In this tutorial, you use data lineage to trace the flow of sensitive data back to the process that moves it from a trusted to a non-trusted location.

Get started

To complete the use case, first set up the environment and run the data transformations. Use the prerequisites and setup page to connect a remote repository to Dataform. This repository contains the code necessary to set up the dataset and transform the data.

After you set up the environment, use BigQuery and Lineage Explorer to identify where PII crosses a security boundary.

Analyze personal information leak with Lineage Explorer

After you prepare the dataset, trace the personal information leak using the BigQuery Lineage tab.

In this example, you trace the user_email column from the public view back to its source:

  1. In Google Cloud console, go to the BigQuery page.
  2. Use the search field to find the order_status_stats table.
  3. Click the Lineage tab.
  4. In the Lineage Explorer pane, do the following:
    1. In the Column Level Lineage section, select the user_email column name from the list.
    2. In the Direction section, select the Upstream direction.
    3. Click Apply.
  5. Follow the graph back one step. The graph shows that the email is pulled from the status_counts_by_user_v intermediate view.
  6. Click the process node between the view and its upstream dependencies. The process node shows that a join operation occurs between anonymized order data and a table containing identity information.

The lineage proves that personal information crosses from a restricted functional table into a broader analytics schema, where unauthorized users can see it.

For more information on visualizing data with data lineage graph, see Lineage graph view.