Redshift batch source

This page describes how to load data from an Amazon Redshift instance into Google Cloud with Cloud Data Fusion. The Redshift source connector lets you sync tables from your Redshift dataset to your destination, such as BigQuery. The connector also lets you create a configurable SQL query.

Before you begin

  • Cloud Data Fusion versions 6.9.0 and later support the Redshift source.
  • When you configure the Redshift source connector, you can select an existing, reusable connection, or create a new, one-time connection. For more information, see Manage connections . When you reuse a connection, note the following:

    • You don't have to provide credentials.
    • The existing connection provides the schema and table name information that's used to generate the import query.

Configure the plugin

  1. Go to the Cloud Data Fusion web interface and click Studio.

  2. Check that Data Pipeline - Batchis selected (not Realtime).

  3. In the Sourcemenu, click Redshift. The Redshift node appears in your pipeline. If you don't see the Redshift source on the Studiopage, deploy the Redshift source connector from the Cloud Data Fusion Hub .

  4. To configure the source, go to the Redshift node and click Properties.

  5. Enter the following properties. For a complete list, see Properties .

    1. Enter a label for the Redshift node—for example, Redshift tables .
    2. Enter the connection details. You can set up a new, one-time connection, or an existing, reusable connection.

      New connection

      To add a one-time connection to Redshift, follow these steps:

      1. Keep Use connectionturned off.
      2. In the JDBC driver namefield, enter the name of the driver. Redshift supports two types of JDBC drivers: CData and Amazon . For more information, see Upload a JDBC driver .
      3. In the Hostfield, enter the endpoint of the Redshift cluster—for example, cdf-redshift-new.example-endpoint.eu-west-1.redshift.amazonaws.com .
      4. Optional: In the Portfield, enter a database port number—for example, 5439 .
      5. If your Redshift database requires authentication, do the following:

        1. In the Usernamefield, enter the name for the database.
        2. In the Passwordfield, enter the password for the database.
        3. Optional: In the Argumentsfield, enter key value arguments. To use the CData driver, provide the connection arguments, such as RTK or OEMKey, if applicable.
        4. In the Namefield, enter a name—for example, SN-PC-Source-01-01-2024 .
        5. Enter the target database name in the Databasefield—for example, datafusiondb .

      Reusable connection

      To reuse an existing connection, follow these steps:

      1. Turn on Use connection.
      2. Click Browse connections.
      3. Click the connection name.

      4. Optional: If a connection doesn't exist and you want to create a new reusable connection, click Add connectionand refer to the steps in the New connectiontab on this page.

    3. In the Import queryfield, enter a query using the schema and table names from your Redshift source—for example, Select * from "public"."users" .

    4. Optional: Enter Advancedproperties, such as a bounding query or number of splits. For all property descriptions, see Properties .

  6. Optional: Click Validateand address any errors found.

  7. Click Close. Properties are saved and you can continue to build your data pipeline in the Cloud Data Fusion web interface.

Properties

Property Supports macros for automation Required property Description
Label
No Yes The name of the node in your data pipeline.
Use connection
No No Browse for a connection to the source. If Use connectionis turned on, you don't need to provide credentials.
Connection
Yes Yes Name of the connection to use. If Use connectionis selected, this field appears. Database and table information is provided by the connection.
JDBC driver name
Yes Yes Name of the JDBC driver to use. If Use connectionisn't selected, this field appears.
Host
Yes Yes The endpoint of the Amazon Redshift cluster. If Use connectionisn't selected, this field appears.
Port
Yes No Port that Redshift is running on. If Use connectionisn't selected, this field appears.
Username
Yes No User identity for connecting to the specified database. If Use connectionisn't selected, this field appears.
Password
Yes No Password to use to connect to the specified database. If Use connectionisn't selected, this field appears.
Connection arguments
Yes No A list of arbitrary string key-value pairs as connection arguments. These arguments are passed to the JDBC driver as connection arguments for JDBC drivers that might need additional configurations. If Use connectionisn't selected, this field appears.
Reference name
No Yes Uniquely identifies this source for lineage, annotating metadata, and other services.
Database
Yes Yes Redshift database name. To select data, click Browse database.
Import query
Yes Yes The SELECT query to use to import data from the specified table.
Bounding query
Yes No SQL query that returns the min and max values from the splitBy field. For example, SELECT MIN(id),MAX(id) FROM table . Not required if numSplits is set to one.
Split column
Yes No Field name that's used to generate splits. Not required if numSplits is set to one.
Number of splits
Yes No Number of splits to generate.
Size
Yes No The number of rows to fetch at a time per split. Larger fetch size can result in faster import, with the tradeoff of higher memory usage. If unspecified, the default is 1000 .

Data type mappings

The following table is a list of Redshift data types with corresponding CDAP types:

Redshift data type CDAP schema data type
bigint long
boolean boolean
character string
character varying string
date date
double precision double
geometry bytes
hllsketch string
integer int
json string
numeric (precision, scale)/ decimal (precision, scale) decimal
numeric with precision 0 string
real float
smallint int
super string
text string
time [ (p) ] without time zone time
time [ (p) ] with time zone string
timestamp [ (p) ] without time zone timestamp
timestamp [ (p) ] with time zone timestamp
varbyte byte
xml string

Best practices

The following best practices apply when you connect to a Redshift cluster from Google Cloud.

Use IP address allowlists

To prevent access from unauthorized sources and restrict access to specific IP addresses, enable access controls on the Redshift cluster.

If you use Redshift access controls, to access the cluster in Cloud Data Fusion, follow these steps:

  1. Obtain the external IP addresses of the services or machines on Google Cloud that must connect to the Redshift cluster, such as the Proxy Server IP (see Viewing IP addresses ). For Dataproc clusters, obtain the IP addresses of all master and child nodes.
  2. Add the IP addresses to an allowlist in the security groups by creating the inbound rules for the Google Cloud machine IP addresses.

  3. Add the connection properties in Wrangler and test them:

    1. Open the Cloud Data Fusion instance in the web interface.
    2. Click Wrangler > Add connectionand create the new connection for Redshift.
    3. Enter all connection properties.
    4. Click Test connectionand resolve any issues.

To create multiple splits, use bounding queries

For multiple splits, use bounding queries to manage the multi-node cluster. In scenarios where you extract data from Redshift and distribute the load uniformly across each node, configure a bounding query in the Redshift source connector properties.

  1. In your Cloud Data Fusion pipeline on the Studiopage, go to the Redshift node and click Properties.
  2. In the Advancedproperties, specify the following:

    1. Enter the number of splits to create.
    2. Enter the fetch size for each split.
    3. Enter a bounding query to apply to the multi-node Redshift cluster.
    4. Enter the Split columnfield name.

For example, assume you have the following use case:

  • You have a table that contains 10 million records.
  • It has a unique ID column called id .
  • The Redshift cluster has 4 nodes.
  • Objective: To take advantage of the cluster's potential, you plan to generate multiple splits. To achieve this, use the following property configurations:

    • In the Bounding queryfield, enter the following query:

       SELECT MIN(id), MAX(id) FROM tableName 
      

      In this query, id is the name of the column where the splits are applied.

    • In the Split columnfield, enter the column name, id .

    • Enter the number of splits and fetch size. These properties are interconnected, letting you calculate splits based on a fetch size, or the other way around. For this example, enter the following.

    • In the Number of splitsfield, enter 40 . In this example, where the table has ten million records, creating 40 splits results in each split containing 250,000 records.

    • In the Fetch sizefield, enter 250,000 .

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: