How to build a centralized and secured Data Catalog on Google Cloud Platform

Tue 08 2022

How to build a centralized and secured Data Catalog on Google Cloud Platform

by bernt & torsten

Time is money, to organize your data in a Data Catalog is one of the first steps to give visibility to organizations’ data and to save time by employees to find data in the organization.

My main data work is focused on providing data solutions using GCP, in this article, I am going to focus on the GCP Data Catalog service, The GCP Data Catalog is a fully managed and highly scalable data discovery and metadata management service.

In order to build a Data Catalog, the metadata that describes the data is the most fundamental objective with Data Catalog, the reason is to make the data searchable so employees are able to find the information they are looking for and, if they don’t have access to the data itself, at least they can know the specific information that we do want them to know about that data.

Some of the information we want to capture with the metadata are:

Who created a table or who is responsible for a file
When that information was generated and when it was updated
How much storage it occupies
If it contains sensitive information and what kind and so on.

Another aspect of a Data Catalog is setting permissions that grant access identities, here are some examples of permission roles:

A Data Analyst can discover data in the GCP project but not access data in that project. This is granted through the Data Catalog Viewer role, which provides metadata read access to all data assets and read access to all Tag Templates and Tags.
Data Curator can discover and access data in GCP projects, granted through the Project Viewer role. It can also create and attach tags by using tag templates that reside in any of the projects (Data Catalog TagTemplate User role), and also edit the attached tags (Data Catalog Tag Editor role).
Data Governor can discover and access data in projects, granted through the Project Viewer role. Besides, it can access Tag Templates and create/edit them in both projects, thanks to the Data Catalog TagTemplate Owner role, which enables to create/update/delete tag templates and the associated tags.

Find the Datasets you want to make discoverable

For starting using Data Catalog, you just need to open the GCP console and enable the Data Catalog API. After you enable the Data Catalog API data sources will become discoverable automatically. The data that are immediately discoverable includes BigQuery (datasets, tables and BQML models), Pub/Sub, and Dataproc Metastore data. If you use Google Data Loss Prevention (DLP), the DLP scan results will be available in the Data Catalog.

Apart from Google Cloud data sources, you can also integrate on-premises data sources with your Data Catalog, such as all the typical RDBMS (MySQL, PostgreSQL, SQL Server, Teradata, etc.), BI tools (Looker, Qlik and Tableau) and Hive data sources.

When we have all our metadata ingested, we can use Data Catalog for two main proposes:

Make our (meta)data discoverable through search.
Enrich our data with additional business metadata through tags.

How to search for data

To discover data use the Data Catalog’s search bar to discover all the assets that employees have metadata level access to. These searches can be quite simple, being able to search for a substring of characters that returns all the assets related in some way to that substring. For more advanced queries, we will have to stick to the Data Catalog search syntax, whose Qualifiers (operators) can be found in Data Catalog’s documentation.

Use entry groups

Entries are contained in an entry group, which is a set of logically related entries together with IAM policies that specify the users who can create, edit, and view those entries.

We can create entry groups with custom entries (e.g. files coming from sources like MySQL, Postgres or SQL Server DDBB, other data lakes & warehouses such as Redshift, Teradata, etc.). We can also create entry groups formed by Google Cloud Storage (GCS) file sets that we define.

GCS file sets can only be formed by a subset of objects of the same bucket, but you can have more than one file in an entry group. These file sets permit (complex) wildcarding for defining them. For example, we could define a file set by specifying the pattern gs://data-emy/2021_*/**/file?.[xls,csv]. These pattern’s meaning is:

* — Match any number of characters at that directory level
** — Match any number of characters across directory boundaries.
? — Match a single character. E.g. gs://bucket/??.txt only matches objects with two characters followed by .txt.
[] — Match any of the range of characters.

With all this, we could say that the specified pattern includes any file in CSV or XML format that is called file? (file1, file2, …, file9), inside any subdirectory from those folders inside the bucket whose name starts with “2021_”.

Use Tags & Templates

Data Catalog Tags are business metadata to make information discoverable and complete for the rest of the users. These tags can be applied at the column or table level:

Tags can also be automatically created when executing a Cloud Data Loss Prevention (DLP) job. The auto-generated tag will contain as many fields as InfoTypes (sensitive information types such as email addresses or credit card numbers).

In order to achieve metadata consistency, the Data Governor can create tag templates. These templates are formed by metadata fields, key-value pairs which can be of type string, double, boolean, enumeration, or date and time, and can be required or not.

Data Catalog tag templates will allow our Data Curator users to create homogeneous tags with all the information (metadata) we need to have a robust and consistent Data Catalog.

Make use of Policy tags to restricting BigQuery dataset table columns

The Policy Tags applies to BigQuery and gives you the control to access BigQuery tables at the column level, please. By applying these policy tags, we can hide BigQuery columns for users that shouldn’t see that information. To define column fine-grained access to our BigQuery table, we need to carry out three steps:

Define a taxonomy in Data Catalog and create our policy tags inside it, Defining taxonomy is creating different levels of security where policy tags will reside.
Apply the policy tags to the columns that we want to restrict access to, as this point we are controlling who has access to the data itself, not to the metadata, this should be applied in BigQuery and not in Data Catalog. The way that we apply policy tags is by editing the schema of the corresponding table and applying the policy tags to the specific columns.
Control the access by applying the IAM “Fine-Grained Reader” role at the taxonomy level, this role is specifically and solely to cover this use case: restricting access at the BigQuery column level. Note: Column-level security is enforced in addition to existing dataset ACLs. A user needs both dataset permission and policy tag permission in order to access data protected by column-level security.

A good practice is to define a great taxonomy that can serve as a basis for all BigQuery tables. From this article you should be able to enable the Google Data Catalog, making it easy to find data in your organization.