Database batch source

This page provides guidance about configuring the Database batch source plugin in Cloud Data Fusion.

You can use this generic source whenever you need to read from a database. For example, you might use it to create daily snapshots of a generic database table and write the output to BigQuery.

Configure the plugin

  1. Go to the Cloud Data Fusion web interface and click Studio.
  2. Check that Data Pipeline - Batchis selected (not Realtime).
  3. In the Sourcemenu, click Database.
  4. To configure the plugin, hold the pointer over the plugin node and click Properties.
  5. Enter the following properties. For more information about each property, see Properties .

    1. Enter a Labelfor the database node—for example, database tables .
    2. Enter the connection details. You can set up a new, one-time connection, or an existing, reusable connection.

      New connection

      To add a one-time connection to the database, follow these steps:

      1. Keep Use connectionturned off.
      2. Enter the following connection properties:
        1. In the JDBC driver name field, enter the name of the driver, if there is one. Otherwise, keep No JDBC pluginsselected.
        2. In the Connection stringfield, enter the JDBC connection string, including the database name.
        3. Optional: if your database requires authentication, enter database username and password credentials.
        4. Optional: if your JDBC driver needs additional configurations, in the Connection argumentsfield, enter key-value arguments for the connection.

      Reusable connection

      To reuse an existing connection, follow these steps:

      1. Turn on Use connection.
      2. Click Browse connections.
      3. Select the connection.

      4. Optional: if a connection doesn't exist and you want to create a new, reusable connection, click Add connectionand refer to the steps in the New connectiontab on this page.

    3. Optional: to test connectivity, click Get schema. This schema is used in place of whatever schema comes back from the query. It must match the schema that comes back from the query, except that you can mark fields as nullable and it can contain a subset of the fields.

    4. In the Import queryfield, enter a SELECT query for importing data from the specified table—for example, select id, name, email, phone from users; .

    5. Optional: in the Bounding queryfield, enter the minimum and maximum values to read—for example, SELECT * FROM table WHERE $CONDITIONS .

    6. Optional: in the Split-by field namefield, enter the name of the field that generates the splits.

    7. Optional: in the Number of splits to generatefield, enter a number—for example, 2 .

    8. Optional: in the Fetch sizefield, enter a number—for example, 1000 .

    9. Optional: enter Advanced properties, such as changing the case of the column names.

  6. Optional: click Validateand address any errors found.

  7. Click Close. Properties are saved and you can continue to build your data pipeline in the Cloud Data Fusion Studio.

Properties

Property Macro enabled Required Description
Label
No Yes The name of the node in your data pipeline.
Use connection
No No Browse for a connection to the source. If Use connectionis selected, you don't need to provide credentials.
Connection
Yes Yes Name of the connection to use. If Use connectionis selected, this field appears. Database and table information is provided by the connection.
JDBC driver name
Yes Yes The JDBC driver to use.
Default is No JDBC plugins.
Connection string
Yes Yes JDBC connection string, including database name.
Username
Yes No User identity for connecting to the database. Required for databases that need authentication. Optional for databases that don't require authentication.
Password
Yes No Password to use to connect to the specified database. Required for databases that need authentication. Optional for databases that don't require authentication.
Connection arguments
Yes No A list of arbitrary string tag/value pairs as connection arguments. For JDBC drivers that need additional configuration, these arguments are passed to the JDBC driver as connection arguments in the following format: key1=value1;key2=value .
Reference name
No Yes A name that uniquely identifies this source for lineage and annotating metadata. It's usually the name of the table or the view.
Get schema
No No The schema of records output by the source. It's used in place of the schema that's returned by the query. It must match the schema that comes back from the query, except that it lets you mark fields as nullable and it can contain a subset of the fields.
Import query
Yes Yes The SELECT query for importing data from the specified table. You can specify an arbitrary number of columns to import, or import all columns using * . The query should contain the $CONDITIONS string. For example, SELECT * FROM table WHERE $CONDITIONS . The $CONDITIONS string is replaced by splitBy field limit specified in the Bounding queryfield. The $CONDITIONS string isn't required if the Number of splitsfield is set to 1.
Bounding query
Yes No The bounding query that returns the minimum and maximum of the values from the split column. For example, SELECT MIN(id),MAX(id) FROM table . Not required if the number of splits is set to 1.
Split-by field name
Yes No The field name used to generate splits. Not required if the number of splits to generate is set to 1.
Number of splits to generate
Yes No Number of splits to generate.
Default is 1.
Fetch size
Yes No The number of rows to fetch at a time per split. A greater fetch size can cause faster import, with the trade-off of higher memory usage.
Default is 1000.
Enable auto-commit
No No Whether to enable auto-commit for queries run by this source. Keep this set to Falseunless you use a JDBC driver that causes an error when the commit operation is run.
Default is False.
Column name case
Yes No Sets the letter case of the column names returned from the query. You can choose uppercase or lowercase. By default, or for any other input, the column names aren't modified and letter case returned from the database is used. Setting this property provides predictability of column name capitalization across different databases, but it might cause column name conflicts if multiple column names are the same when the case is ignored.
Default is No change.
Transaction isolation level
Yes No The transaction isolation level for queries run by this sink. For more information, see the setTransactionIsolation() . The Phoenix JDBC driver throws an exception if the Phoenix database doesn't have transactions enabled and this field is set to true . For drivers like that, set this field to TRANSACTION_NONE .
Default is TRANSACTION_SERIALIZABLE.
Pattern to replace
No No The pattern to replace in the field name in the table (typically used with the Replace withproperty). If the Replace withproperty isn't set, the pattern is removed from the field name.
Replace with
No No The string that's replaced in the field name in the table. You must also configure the Pattern to replacefield.
Output schema
No No Specifies the schema that's output. Only columns defined in the schema are included in the output record.

Best practices

Check if a more specific plugin is available for your database. For example, if you have an Oracle database source, instead use the Oracle database batch source plugin because it's designed to work with the Oracle schema.

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: