Dwarves
Memo
Type ESC to close search bar

Google Data Fusion

Introduction

Google Data Fusion is a fully managed, cloud-native data integration service that enables users to efficiently build and manage ETL/ELT data pipelines. It is designed to streamline data engineering tasks for enterprise users and is built on top of the open-source project CDAP

. Key features and benefits of Google Data Fusion include:

Overall, Google Data Fusion simplifies data engineering tasks and enables users to focus on data analytics and deriving insights for better customer service and operational efficiency

Use Cases

Google Cloud Data Fusion is a powerful tool for building and managing data pipelines, and its use cases are diverse and expanding. Here are some of the most common and impactful ways organizations are using it:

Data Integration and ETL:

Data Migration and Modernization:

Real-time Data Processing and Analytics:

Additional Use Cases:

These are just a few examples, and the possibilities for using Data Fusion are vast. Its flexibility, scalability, and ease of use make it a valuable tool for organizations looking to unlock the potential of their data.

Pricing

Google Data Fusion have some pricing tier in the following table:

Comparison between the tiers:

How to setup Data fusion:

Prerequisites:

  1. Google Cloud Platform Account: Ensure that you have a Google Cloud Platform (GCP) account. If you don’t have one, you can sign up here.
  2. Enable the Cloud Data Fusion API: In the GCP Console, navigate to the API & Services > Dashboard. Search for “Cloud Data Fusion API” and enable it.
  3. Install and Configure Google Cloud SDK: Install the Google Cloud SDK on your local machine. After installation, run gcloud init to set up your credentials and project.

Steps to Set Up Google Cloud Data Fusion:

  1. Create a Cloud Storage Bucket: Create a Cloud Storage bucket to store the artifacts and metadata required by Cloud Data Fusion. Replace [BUCKET_NAME] with your desired bucket name.

    bashCopy code
    gsutil mb -l [REGION] gs://[BUCKET_NAME]
    
    
  2. Create a Cloud Data Fusion Instance: Use the following command to create a Cloud Data Fusion instance. Replace [INSTANCE_NAME] with your desired instance name, and [REGION] with the region where you want to deploy the instance.

    bashCopy code
    gcloud data-fusion instances create [INSTANCE_NAME] \\
      --region=[REGION] \\
      --zone=[ZONE] \\
      --network=[NETWORK_NAME] \\
      --subnet=[SUBNET_NAME] \\
      --bucket-uri=gs://[BUCKET_NAME]/[DIRECTORY]
    
    
    • -zone: Specify the zone for the instance.
    • -network: Specify the VPC network name.
    • -subnet: Specify the subnet within the network.
    • -bucket-uri: Specify the Cloud Storage bucket URI.
  3. Access Cloud Data Fusion UI: After the instance is created, you can access the Cloud Data Fusion UI using the generated endpoint. Navigate to the URL displayed in the command output.

    bashCopy code
    gcloud data-fusion instances describe [INSTANCE_NAME] --region=[REGION]
    
    
  4. Connect to the Cloud Data Fusion UI: Open the provided URL in a web browser to access the Cloud Data Fusion UI. Log in using your Google Cloud credentials.

  5. Explore and Create Pipelines: Once you’re in the Cloud Data Fusion UI, you can start exploring and creating ETL pipelines using the visual interface.

Remember to replace placeholders like [REGION], [BUCKET_NAME], [INSTANCE_NAME]

How to setup a simple ETL with Google Data Fusion ( with GCP console):

In this demo we’ll setup a simple ETL pipeline to import, transform load a csv file to a Google Big query data table

Step 1: go to the Big Query console page and create a dataset and table:

Step 2: Go to your Google Data Fusion instance

Step 3: Preparing you CSV file on GCS

Step 4: Start designing your first pipeline

Step 5: Deploy and Run your pipeline

And that’s it, you’re done. Congrat on your first data pipeline with Google Data Fusion.