In this scenario, you receive an alert that sensitive consumer data (specifically first and last names) appears in a view visible to the entire organization.
This information is originally intended only for specific functional purposes, such as account creation, invoicing, and shipping. However, through a series of transformations and the creation of an analytics view, the Personally Identifiable Information (PII) leaks into a broader analytics schema.
In this tutorial, you use data lineage to trace the flow of sensitive data back to the process that moves it from a trusted to a non-trusted location.
Get started
To complete the use case, first set up the environment and run the data transformations. Use the prerequisites and setup page to connect a remote repository to Dataform. This repository contains the code necessary to set up the dataset and transform the data.
After you set up the environment, use BigQuery and Lineage Explorer to identify where PII crosses a security boundary.
Analyze personal information leak with Lineage Explorer
After you prepare the dataset, trace the personal information leak using the BigQuery Lineage tab.
In this example, you trace the user_email column from the public view back to its source:
- In Google Cloud console, go to the BigQuery page.
- Use the search field to find the
order_status_statstable. - Click the Lineage tab.
- In the Lineage Explorer pane, do the following:
- In the Column Level Lineage section, select the
user_emailcolumn name from the list. - In the Direction section, select the Upstream direction.
- Click Apply.
- In the Column Level Lineage section, select the
- Follow the graph back one step. The graph shows that the email is pulled from the
status_counts_by_user_vintermediate view. - Click the process node between the view and its upstream dependencies. The process node shows that a join operation occurs between anonymized order data and a table containing identity information.
The lineage proves that personal information crosses from a restricted functional table into a broader analytics schema, where unauthorized users can see it.
For more information on visualizing data with data lineage graph, see Lineage graph view.