Skip to content

This repository guides users through the fundamental concepts and practical implementation of Google Cloud Dataplex, GCP's intelligent data fabric.

Notifications You must be signed in to change notification settings

janaom/dataplex-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 

Repository files navigation

Dataplex

Dataplex is Google Cloud's intelligent data fabric that unifies distributed data across data lakes, data warehouses, and data marts. It provides centralized data management, governance, and discovery without moving or duplicating data.

Key capabilities:

🏗️ Organize data using Lakes → Zones → Assets hierarchy

🔍 Automatic metadata discovery and cataloging

✅ Data quality monitoring and validation

🔗 Data lineage tracking across BigQuery and Cloud Storage

🔐 Unified security and access control

This tutorial will guide you through setting up and using Dataplex to manage your data estate on GCP.

I am using the same BQ table as in the previous tutorial bq-data-masking-example.

Data Catalog

Data Catalog in Dataplex provides a unified discovery platform that helps both technical and non-technical users quickly find and access data across the organization through searchable metadata. It ensures data quality consistency and regulatory compliance while reducing unnecessary costs, and enables organizations to trace data lineage to understand where data originated, how it was transformed, and who used it. Users can add rich text table descriptions, assign data stewards for metadata management, and establish clear ownership to improve trust and confidence in data assets. Additionally, Data Catalog integrates with Sensitive Data Protection to automatically identify and tag sensitive data using tag templates, centralizing governance and reducing search friction across the organization.

Entry details

Here you can find all info about your data: Entry type; Platform; System; Creation time; Last modification time; Labels; Description; Contacts; Fully Qualified Name. Here is an example of the table users under dataset bq_data_masking_demo.

image

Overview lets you provide additional description for the entry. Usage lets you see usage of the entry. Aspects lets you attach metadata to an entry.

image

The number of tags shows tags associated with columns in the table. In our case this table has 4 tags masking sensitive data.

image

Data Profile

Dataplex Universal Catalog makes it easier to understand and analyze your data by automatically profiling your BigQuery tables.

Profiling is like getting a detailed health report for your data. It gives you key statistics, such as common values, how the data is spread out (distribution), and how many entries are missing (null counts). This information speeds up your analysis.

Data profiling automatically detects sensitive information and lets you set access control policies. It recommends data quality check rules to ensure your data stays reliable.

Here is an example of Data Profile of users table.

image

Data Lineage

Data Lineage in Dataplex provides automated tracking of data flows across your Google Cloud environment. It visualizes how data moves between BigQuery tables, Cloud Storage buckets, and other GCP services, showing upstream sources and downstream consumers. This enables teams to understand data dependencies, assess the impact of schema changes, ensure compliance, and troubleshoot data quality issues by tracing data back to its origin.

20260202_173055

In this example, I created additional tables and joined them to demonstrate data lineage and trace data origins. On the right, you can see the BigQuery queries used to join these tables. Here is an example:

CREATE OR REPLACE TABLE `elt-project-482220.bq_data_masking_demo.users_with_purchases` AS
SELECT 
  u.id, 
  u.first_name, 
  u.last_name, 
  p.item, 
  p.amount, 
  p.purchase_date
FROM `elt-project-482220.bq_data_masking_demo.users` AS u
INNER JOIN `elt-project-482220.bq_data_masking_demo.user_purchases` AS p 
  ON u.id = p.user_id;

You can also view detailed information about each table, filter connections by column name, and use upstream/downstream directions with time range filters.

Glossaries

Use a business glossary to establish a standardized vocabulary for your data assets, which reduces ambiguity and improves data discovery and governance across your organization. By creating a common language for data using Dataplex Universal Catalog business glossary, you can achieve the following:

Define a clear hierarchy of business categories and terms.

Link concepts using synonyms and show relationships between terms.

Search for data resources based on business concepts, not just technical names.

Dataplex Universal Catalog business glossary helps streamline data discovery and reduce ambiguity, resulting in better governance, more accurate analysis, and faster insights.

Here is an example of 'Customer Transaction Data Glossary'.

image

When you connect terms to the tables, they appear in Related entries.

image

And in Glossary Terms in table Details in Dataplex.

image

Business terms can also be connected under the table schema in Dataplex.

image

Dataplex Lake/Zone Concept

Dataplex organizes data using a hierarchical structure: Lakes contain Zones, and Zones contain Assets (data stored in BigQuery or Cloud Storage).

Lake: A lake is a logical domain that groups related data together based on business function, department, or data domain (e.g., "Customer Data Lake", "Finance Lake", "Marketing Lake"). Lakes provide:

High-level organizational boundaries
Centralized governance and access control
A way to manage related datasets as a unified domain

Zone: Zones are subdivisions within a lake that organize data by processing stage, data quality level, or functional area. Common zone patterns include:

Raw Zone - Ingested data in original format, unprocessed
Curated Zone - Cleaned, validated, and transformed data
Processed/Analytics Zone - Business-ready data for analytics and reporting

Zones can be either:

Raw zones - contain unstructured or semi-structured data (typically Cloud Storage)
Curated zones - contain structured, quality-controlled data (typically BigQuery tables)

Asset: Assets are the actual data resources (BigQuery datasets/tables or Cloud Storage buckets) attached to zones.

Benefits:

Clear data lifecycle management (raw → curated → analytics)
Consistent governance policies applied at lake or zone level
Better data discovery and organization
Simplified access control management

Lake My data mesh was created here as an example.

image

And then 2 Zones: Curated Zone for BQ dataset and Raw Zone for GCS bucket.

image

There are options to see the details of the Zone and send alerts based on specific rules.

image

Data Quality

Dataplex Universal Catalog lets you define and measure the quality of the data in your BigQuery tables. You can automate the data scanning, validate data against defined rules, and log alerts if your data doesn't meet quality requirements. Auto data quality lets you manage data quality rules and deployments as code, improving the integrity of data production pipelines.

You can use predefined quality rules or build custom rules.

Dataplex Universal Catalog provides monitoring, troubleshooting, and Cloud Logging alerting that's integrated with auto data quality. More about Data Quality scans here.

Creating and using a data quality scan consists of the following steps:

Define data quality rules
Configure rule execution
Analyze data quality scan results
Set up monitoring and alerting
Troubleshoot data quality failures

Here is an example where we scan users table. First, there is an option to schedule the scans.

image

Then we have to configure validation rules. Use SQL to create your own rules or use built-in rule types.

image

Here is an example of built-in rules.

image

It is possible to export scan results to a BigQuery table, as well as receive notification reports via email.

image

All details and rules are visible inside the scan.

image

Each scan is visible in Dataplex.

image

Each job provides detailed results.

image

Each scan is visible in the Dataplex Data Quality view for the table.

image

Results are published in a BQ table as well.

image

And here is AutoDQ Email Report.

image

Custom Aspect Types

To create custom Aspects types, run these commands in your Cloud Shell terminal:

ACCESS_TOKEN=$(gcloud auth print-access-token)

curl --request POST \
  "https://dataplex.googleapis.com/v1/projects/your-project-id/locations/your-location/aspectTypes?aspectTypeId=data-stewardship-info" \   #e.g. location: europe-west1
  --header "Authorization: Bearer $ACCESS_TOKEN" \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --data '{
    "displayName": "Data Stewardship Information",
    "description": "Metadata related to data ownership and stewardship.",
    "metadataTemplate": {
      "name": "DataStewardshipTemplate",
      "type": "record",
      "recordFields": [
        {
          "name": "data_owner_email",
          "type": "string",
          "annotations": {
            "displayName": "Data Owner Email",
            "description": "Email address of the data owner."
          },
          "index": 1,
          "constraints": { "required": true }
        },
        {
          "name": "steward_team",
          "type": "string",
          "annotations": {
            "displayName": "Stewardship Team",
            "description": "Team responsible for data stewardship."
          },
          "index": 2
        },
        {
          "name": "last_reviewed_date",
          "type": "datetime",
          "annotations": {
            "displayName": "Last Reviewed Date",
            "description": "Date when the data asset was last reviewed for governance."
          },
          "index": 3
        }
      ]
    }
  }' \
  --compressed

You will create a new Aspect type.

Screenshot 2026-02-19 171807

Check Template inside the details.

Screenshot 2026-02-19 171821

Example: Glossaries, Aspects, Terms

Here is an example demonstrating how elements connect within Dataplex.

20260219_213521

Within the Dataplex Business Glossary, we have established the 'Financial Transaction Data Glossary'. This glossary is organized into two primary categories (or sub-glossaries):

'Raw Financial Transaction Data - Source System Layer'
'Transformed Financial Transaction Data - Analytics Layer'
Screenshot (183) Screenshot (184)

The first one focuses on the raw data. You can see all the fields (terms) pulled directly from that source table, like ID1 and String1.

Screenshot (185)

We use dbt for data transformation. Following this process, the second category reflects a new table created from the raw data.

Screenshot (186)

When examining any specific term, you will find detailed information: a description, an overview, attached Aspects (like my 'Data Stewardship Info'), related entries, and related terms. For example, a term in the transformed layer is linked as a 'Related Term' to its corresponding field, such as ID1, in the raw data.

Screenshot (187)

Next, we will see the transformed data table.

Screenshot (188)1

As you can see here, the asset is enriched with critical governance information. Here you can see who is data governor, business owner, data classification, data lifecycle, if it's encrypted, has PII, data owner email, etc.

Screenshot (189)1

Additionally, the Glossary Terms panel illustrates the direct semantic connections, showing exactly which business terms are mapped to this particular table.

Screenshot (190)1

Example: Lakes, Zones

20260219_223005

Under the Manage tab in Dataplex, I have configured one Data Lake containing two distinct assets.

Screenshot (191)1

This Data Lake is segmented into two Zones:

bq-data-prod: A curated zone designed for a BigQuery dataset.
raw-data-gcs: A raw zone designated for a Cloud Storage (GCS) bucket.
Screenshot (192)1

The details view aggregates metrics from both zones, such as the total discovered file size.

Screenshot (193)1

Within each zone, specific assets are registered. For example, the bq-data-prod curated zone contains the BigQuery dataset named fin-data-prod.

Screenshot (194)1

This view also displays the discovery configuration and status, showing that 10,000 rows have been discovered in this asset.

Screenshot (195)1

Inside raw-data-gcs zone we have GCS bucket elt-prod.

Screenshot (196)1

Similarly, the raw-data-gcs zone contains the GCS bucket elt-prod, with a discovered data size of 718.72KiB.

Screenshot (197)1

❗Important info

Critical Location Requirement: Location configuration in Dataplex follows a strict hierarchy. When you create a Data Lake, you must specify its location (e.g., europe-west1), and this choice determines what data you can include. Zones within that lake can be configured as either regional (e.g., europe-west1) or multi-regional (e.g., EU), but their underlying data assets must align with the zone's location constraints.

Important: If your lake is in europe-west1 and you create a zone in that same region, you cannot attach assets from the broader EU multi-region. Attempting to add a BigQuery dataset with location EU to a europe-west1 zone will fail with the error: BigQuery dataset location EU is invalid, allowed regions are {EUROPE-WEST1}. The zone's initial location setting strictly defines which regional or multi-regional assets can be discovered and managed.

Always pay attention to location compatibility: you cannot attach an aspect created in europe-west1 (Belgium) to a BigQuery table created in EU. You'll need to recreate the aspect in the EU multi-region to match your table's location. Same for the glossaries and terms, created in europe-west1 they won't be visible for the table from EU.

Screenshot 2026-02-20 135615

About

This repository guides users through the fundamental concepts and practical implementation of Google Cloud Dataplex, GCP's intelligent data fabric.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published