This page describes how to load data from an Amazon Redshift instance into Google Cloud with Cloud Data Fusion. The Redshift source connector lets you sync tables from your Redshift dataset to your destination, such as BigQuery. The connector also lets you create a configurable SQL query.
Before you begin
- Cloud Data Fusion versions 6.9.0 and later support the Redshift source.
-
When you configure the Redshift source connector, you can select an existing, reusable connection, or create a new, one-time connection. For more information, see Manage connections . When you reuse a connection, note the following:
- You don't have to provide credentials.
- The existing connection provides the schema and table name information that's used to generate the import query.
Configure the plugin
-
Go to the Cloud Data Fusion web interface and click Studio.
-
Check that Data Pipeline - Batchis selected (not Realtime).
-
In the Sourcemenu, click Redshift. The Redshift node appears in your pipeline. If you don't see the Redshift source on the Studiopage, deploy the Redshift source connector from the Cloud Data Fusion Hub .
-
To configure the source, go to the Redshift node and click Properties.
-
Enter the following properties. For a complete list, see Properties .
- Enter a label for the Redshift node—for example,
Redshift tables
. -
Enter the connection details. You can set up a new, one-time connection, or an existing, reusable connection.
New connection
To add a one-time connection to Redshift, follow these steps:
- Keep Use connectionturned off.
- In the JDBC driver namefield, enter the name of the driver. Redshift supports two types of JDBC drivers: CData and Amazon . For more information, see Upload a JDBC driver .
- In the Hostfield, enter the endpoint of the Redshift
cluster—for example,
cdf-redshift-new.example-endpoint.eu-west-1.redshift.amazonaws.com
. - Optional: In the Portfield, enter a database port
number—for example,
5439
. -
If your Redshift database requires authentication, do the following:
- In the Usernamefield, enter the name for the database.
- In the Passwordfield, enter the password for the database.
- Optional: In the Argumentsfield, enter key value arguments. To use the CData driver, provide the connection arguments, such as RTK or OEMKey, if applicable.
- In the Namefield, enter a name—for example,
SN-PC-Source-01-01-2024
. - Enter the target database name in the Databasefield—for example,
datafusiondb
.
Reusable connection
To reuse an existing connection, follow these steps:
- Turn on Use connection.
- Click Browse connections.
-
Click the connection name.
-
Optional: If a connection doesn't exist and you want to create a new reusable connection, click Add connectionand refer to the steps in the New connectiontab on this page.
-
In the Import queryfield, enter a query using the schema and table names from your Redshift source—for example,
Select * from "public"."users"
. -
Optional: Enter Advancedproperties, such as a bounding query or number of splits. For all property descriptions, see Properties .
- Enter a label for the Redshift node—for example,
-
Optional: Click Validateand address any errors found.
-
Click Close. Properties are saved and you can continue to build your data pipeline in the Cloud Data Fusion web interface.
Properties
Property | Supports macros for automation | Required property | Description |
---|---|---|---|
Label
|
No | Yes | The name of the node in your data pipeline. |
Use connection
|
No | No | Browse for a connection to the source. If Use connectionis turned on, you don't need to provide credentials. |
Connection
|
Yes | Yes | Name of the connection to use. If Use connectionis selected, this field appears. Database and table information is provided by the connection. |
JDBC driver name
|
Yes | Yes | Name of the JDBC driver to use. If Use connectionisn't selected, this field appears. |
Host
|
Yes | Yes | The endpoint of the Amazon Redshift cluster. If Use connectionisn't selected, this field appears. |
Port
|
Yes | No | Port that Redshift is running on. If Use connectionisn't selected, this field appears. |
Username
|
Yes | No | User identity for connecting to the specified database. If Use connectionisn't selected, this field appears. |
Password
|
Yes | No | Password to use to connect to the specified database. If Use connectionisn't selected, this field appears. |
Connection arguments
|
Yes | No | A list of arbitrary string key-value pairs as connection arguments. These arguments are passed to the JDBC driver as connection arguments for JDBC drivers that might need additional configurations. If Use connectionisn't selected, this field appears. |
Reference name
|
No | Yes | Uniquely identifies this source for lineage, annotating metadata, and other services. |
Database
|
Yes | Yes | Redshift database name. To select data, click Browse database. |
Import query
|
Yes | Yes | The SELECT
query to use to import data from the
specified table. |
Bounding query
|
Yes | No | SQL query that returns the min and max values from the splitBy
field. For example, SELECT MIN(id),MAX(id) FROM table
. Not required if numSplits
is set to one. |
Split column
|
Yes | No | Field name that's used to generate splits. Not required if numSplits
is set to one. |
Number of splits
|
Yes | No | Number of splits to generate. |
Size
|
Yes | No | The number of rows to fetch at a time per split. Larger fetch size
can result in faster import, with the tradeoff of higher memory
usage. If unspecified, the default is 1000
. |
Data type mappings
The following table is a list of Redshift data types with corresponding CDAP types:
Redshift data type | CDAP schema data type |
---|---|
bigint
|
long
|
boolean
|
boolean
|
character
|
string
|
character varying
|
string
|
date
|
date
|
double precision
|
double
|
geometry
|
bytes
|
hllsketch
|
string
|
integer
|
int
|
json
|
string
|
numeric
(precision,
scale)/ decimal
(precision, scale) |
decimal
|
numeric
with precision 0 |
string
|
real
|
float
|
smallint
|
int
|
super
|
string
|
text
|
string
|
time
[ (p) ] without time zone |
time
|
time
[ (p) ] with time zone |
string
|
timestamp
[ (p) ] without time zone |
timestamp
|
timestamp
[ (p) ] with time zone |
timestamp
|
varbyte
|
byte
|
xml
|
string
|
Best practices
The following best practices apply when you connect to a Redshift cluster from Google Cloud.
Use IP address allowlists
To prevent access from unauthorized sources and restrict access to specific IP addresses, enable access controls on the Redshift cluster.
If you use Redshift access controls, to access the cluster in Cloud Data Fusion, follow these steps:
- Obtain the external IP addresses of the services or machines on Google Cloud that must connect to the Redshift cluster, such as the Proxy Server IP (see Viewing IP addresses ). For Dataproc clusters, obtain the IP addresses of all master and child nodes.
-
Add the IP addresses to an allowlist in the security groups by creating the inbound rules for the Google Cloud machine IP addresses.
-
Add the connection properties in Wrangler and test them:
- Open the Cloud Data Fusion instance in the web interface.
- Click Wrangler > Add connectionand create the new connection for Redshift.
- Enter all connection properties.
- Click Test connectionand resolve any issues.
To create multiple splits, use bounding queries
For multiple splits, use bounding queries to manage the multi-node cluster. In scenarios where you extract data from Redshift and distribute the load uniformly across each node, configure a bounding query in the Redshift source connector properties.
- In your Cloud Data Fusion pipeline on the Studiopage, go to the Redshift node and click Properties.
-
In the Advancedproperties, specify the following:
- Enter the number of splits to create.
- Enter the fetch size for each split.
- Enter a bounding query to apply to the multi-node Redshift cluster.
- Enter the Split columnfield name.
For example, assume you have the following use case:
- You have a table that contains 10 million records.
- It has a unique ID column called
id
. - The Redshift cluster has 4 nodes.
-
Objective: To take advantage of the cluster's potential, you plan to generate multiple splits. To achieve this, use the following property configurations:
-
In the Bounding queryfield, enter the following query:
SELECT MIN(id), MAX(id) FROM tableName
In this query,
id
is the name of the column where the splits are applied. -
In the Split columnfield, enter the column name,
id
. -
Enter the number of splits and fetch size. These properties are interconnected, letting you calculate splits based on a fetch size, or the other way around. For this example, enter the following.
-
In the Number of splitsfield, enter
40
. In this example, where the table has ten million records, creating 40 splits results in each split containing 250,000 records. -
In the Fetch sizefield, enter
250,000
.
-
What's next
- Look through the Cloud Data Fusion plugins .