Use MLTransform to scale data

Scaling data is an important preprocessing step for training machine learning (ML) models, because it helps to ensure that all features have a similar weight or influence on the model. The following are benefits of scaling data:

Improved convergence of gradient descent algorithms:Many machine learning algorithms, such as linear regression and neural networks, use gradient descent to optimize their parameters. Gradient descent iteratively moves the parameters of the model in the direction that reduces the loss function. If the features aren't scaled, features with larger ranges can have a much larger impact on the gradient, making it difficult for the model to converge. Scaling the features helps to ensure that all features contribute equally to the gradient, which can lead to faster and more stable convergence.
Uniformity in features:If one feature has a much larger range than the other features, it can dominate the model and make it difficult for the model to learn from the other features. This lack of uniformity can cause poor performance and biased predictions. Scaling the features brings all of the features into a similar range.

To scale your dataset using Apache Beam, use MLTransform with one of the following transforms:

ScaleTo01 : Calculates the minimum and maximum of an entire dataset, and then scales the dataset between 0 and 1 based on minimum and maximum values.
ScaleToZScore : Calculates the mean and variance of an entire dataset, and then scales the dataset based on those values.
ScaleByMinMax : Scales the data in a dataset, taking minimum and maximum values as input parameters.

For each data processing transform, MLTransform runs in both write mode and read mode. For more information about using MLTransform , see Preprocess data with MLTransform in the Apache Beam documentation.

MLTransform in write mode

When MLTransform is in write mode, it produces artifacts, such as minimum, maximum, and variance, for different data processing transforms. These artifacts allow you to ensure that you're applying the same artifacts, and any other preprocessing transforms, when you train your model and serve it in production, or when you test its accuracy.

MLTransform in read mode

In read mode, MLTransform uses the artifacts generated in write mode to scale the entire dataset.

Import the required modules

To use MLTransfrom , install tensorflow_transform and the Apache Beam SDK version 2.53.0 or later.

   
pip  
install  
apache_beam [ 
interactive ] 
> = 
 2 
.53.0  
--quiet 
   
pip  
install  
tensorflow-transform  
--quiet

  import 
  
 os 
 import 
  
 tempfile 
 import 
  
 apache_beam 
  
 as 
  
 beam 
 from 
  
 apache_beam.ml.transforms.base 
  
 import 
 MLTransform 
 from 
  
 apache_beam.ml.transforms.tft 
  
 import 
 ScaleTo01 
 from 
  
 apache_beam.ml.transforms.tft 
  
 import 
 ScaleByMinMax 
 from 
  
 apache_beam.ml.transforms.tft 
  
 import 
 ScaleToZScore

  artifact_location_scale_to_01 
 = 
 tempfile 
 . 
 mkdtemp 
 ( 
 prefix 
 = 
 'scale_to_01_' 
 ) 
 artifact_location_scale_to_zscore 
 = 
 tempfile 
 . 
 mkdtemp 
 ( 
 prefix 
 = 
 'scale_to_zscore_' 
 ) 
 artifact_location_scale_by_min_max 
 = 
 tempfile 
 . 
 mkdtemp 
 ( 
 prefix 
 = 
 'scale_by_min_max_' 
 )

  # data used in MLTransform's write mode 
 data 
 = 
 [ 
 { 
 'int_feature_1' 
 : 
 11 
 , 
 'int_feature_2' 
 : 
 - 
 10 
 }, 
 { 
 'int_feature_1' 
 : 
 34 
 , 
 'int_feature_2' 
 : 
 - 
 33 
 }, 
 { 
 'int_feature_1' 
 : 
 5 
 , 
 'int_feature_2' 
 : 
 - 
 63 
 }, 
 { 
 'int_feature_1' 
 : 
 12 
 , 
 'int_feature_2' 
 : 
 - 
 38 
 }, 
 { 
 'int_feature_1' 
 : 
 32 
 , 
 'int_feature_2' 
 : 
 - 
 65 
 }, 
 { 
 'int_feature_1' 
 : 
 63 
 , 
 'int_feature_2' 
 : 
 - 
 21 
 }, 
 ] 
 # data used in MLTransform's read mode 
 test_data 
 = 
 [ 
 { 
 'int_feature_1' 
 : 
 29 
 , 
 'int_feature_2' 
 : 
 - 
 20 
 }, 
 { 
 'int_feature_1' 
 : 
 - 
 5 
 , 
 'int_feature_2' 
 : 
 - 
 11 
 }, 
 { 
 'int_feature_1' 
 : 
 5 
 , 
 'int_feature_2' 
 : 
 - 
 44 
 }, 
 { 
 'int_feature_1' 
 : 
 29 
 , 
 'int_feature_2' 
 : 
 - 
 12 
 }, 
 { 
 'int_feature_1' 
 : 
 20 
 , 
 'int_feature_2' 
 : 
 - 
 53 
 }, 
 { 
 'int_feature_1' 
 : 
 70 
 , 
 'int_feature_2' 
 : 
 - 
 8 
 } 
 ]

Scale the data between 0 and 1

Scale the data so that it's in the range of 0 to 1. To scale the data, the transform calculates minimum and maximum values on the whole dataset, and then performs the following calculation:

x = (x - x_min) / (x_max - x_min)

To scale the data, use the ScaleTo01 data processing transform in MLTransform .

  # MLTransform in write mode. 
 with 
 beam 
 . 
 Pipeline 
 () 
 as 
 pipeline 
 : 
 data_pcoll 
 = 
 pipeline 
 | 
 "CreateData" 
>> beam 
 . 
 Create 
 ( 
 data 
 ) 
 transformed_pcoll 
 = 
 ( 
 data_pcoll 
 | 
 "MLTransform" 
>> MLTransform 
 ( 
 write_artifact_location 
 = 
 artifact_location_scale_to_01 
 ) 
 . 
 with_transform 
 ( 
 ScaleTo01 
 ( 
 columns 
 = 
 [ 
 'int_feature_1' 
 , 
 'int_feature_2' 
 ]) 
 ) 
 ) 
 transformed_pcoll 
 | 
 "Print" 
>> beam 
 . 
 Map 
 ( 
 print 
 )

Row(int_feature_1=array([0.10344828], dtype=float32), int_feature_2=array([1.], dtype=float32))
Row(int_feature_1=array([0.5], dtype=float32), int_feature_2=array([0.58181816], dtype=float32))
Row(int_feature_1=array([0.], dtype=float32), int_feature_2=array([0.03636364], dtype=float32))
Row(int_feature_1=array([0.12068965], dtype=float32), int_feature_2=array([0.4909091], dtype=float32))
Row(int_feature_1=array([0.46551725], dtype=float32), int_feature_2=array([0.], dtype=float32))
Row(int_feature_1=array([1.], dtype=float32), int_feature_2=array([0.8], dtype=float32))

In the this dataset, the following are the minimum and maximum values for the columns:

int_feature_1 : 5 and 63.
int_feature_2 : -65 and -10

In the output for the column int_feature_1 , the data is scaled between 0 and 1 by using the values 5 and 63 . 5 is scaled to 0 , and 63 is scaled to 1 . The remaining values are scaled between 0 and 1 by using the formula x = (x - x_min) / (x_max - x_min) .

  # MLTransform in read mode 
 with 
 beam 
 . 
 Pipeline 
 () 
 as 
 pipeline 
 : 
 data_pcoll 
 = 
 pipeline 
 | 
 "CreateData" 
>> beam 
 . 
 Create 
 ( 
 test_data 
 ) 
 transformed_pcoll 
 = 
 ( 
 data_pcoll 
 | 
 "MLTransform" 
>> MLTransform 
 ( 
 read_artifact_location 
 = 
 artifact_location_scale_to_01 
 ) 
 ) 
 transformed_pcoll 
 | 
 "Print" 
>> beam 
 . 
 Map 
 ( 
 print 
 )

Row(int_feature_1=array([0.41379312], dtype=float32), int_feature_2=array([0.8181818], dtype=float32))
Row(int_feature_1=array([-0.1724138], dtype=float32), int_feature_2=array([0.9818182], dtype=float32))
Row(int_feature_1=array([0.], dtype=float32), int_feature_2=array([0.38181818], dtype=float32))
Row(int_feature_1=array([0.41379312], dtype=float32), int_feature_2=array([0.96363634], dtype=float32))
Row(int_feature_1=array([0.25862068], dtype=float32), int_feature_2=array([0.21818182], dtype=float32))
Row(int_feature_1=array([1.1206896], dtype=float32), int_feature_2=array([1.0363636], dtype=float32))

MLTransform learned in write mode that int_feature_1 ranges from 5 to 63.

In read mode, when it encounters 29 in test_data for int_feature_1 , it scales it by using the following formula:

 (value - min) / (max - min)

The following calculation shows the formula with the values:

 (29 - 5) / (63 - 5) = 0.41379312

Twenty-nine is scaled based on the minimum and maximum values generated in write mode.

Scale by using the z-score

Similar to ScaleTo01 , use ScaleToZScore to scale the values by using the z-score .

  # MLTransform in write mode 
 with 
 beam 
 . 
 Pipeline 
 () 
 as 
 pipeline 
 : 
 data_pcoll 
 = 
 pipeline 
 | 
 "CreateData" 
>> beam 
 . 
 Create 
 ( 
 data 
 ) 
 transformed_pcoll 
 = 
 ( 
 data_pcoll 
 | 
 "MLTransform" 
>> MLTransform 
 ( 
 write_artifact_location 
 = 
 artifact_location_scale_to_zscore 
 ) 
 . 
 with_transform 
 ( 
 ScaleToZScore 
 ( 
 columns 
 = 
 [ 
 'int_feature_1' 
 , 
 'int_feature_2' 
 ]) 
 ) 
 ) 
 transformed_pcoll 
 | 
 "Print" 
>> beam 
 . 
 Map 
 ( 
 print 
 )

Row(int_feature_1=array([-0.76950264], dtype=float32), int_feature_2=array([1.401755], dtype=float32))
Row(int_feature_1=array([0.3974355], dtype=float32), int_feature_2=array([0.2638597], dtype=float32))
Row(int_feature_1=array([-1.0739213], dtype=float32), int_feature_2=array([-1.2203515], dtype=float32))
Row(int_feature_1=array([-0.7187662], dtype=float32), int_feature_2=array([0.01649117], dtype=float32))
Row(int_feature_1=array([0.2959626], dtype=float32), int_feature_2=array([-1.3192989], dtype=float32))
Row(int_feature_1=array([1.8687923], dtype=float32), int_feature_2=array([0.8575442], dtype=float32))

  # MLTransform in read mode 
 with 
 beam 
 . 
 Pipeline 
 () 
 as 
 pipeline 
 : 
 data_pcoll 
 = 
 pipeline 
 | 
 "CreateData" 
>> beam 
 . 
 Create 
 ( 
 test_data 
 ) 
 transformed_pcoll 
 = 
 ( 
 data_pcoll 
 | 
 "MLTransform" 
>> MLTransform 
 ( 
 read_artifact_location 
 = 
 artifact_location_scale_to_zscore 
 ) 
 ) 
 transformed_pcoll 
 | 
 "Print" 
>> beam 
 . 
 Map 
 ( 
 print 
 )

Row(int_feature_1=array([0.14375328], dtype=float32), int_feature_2=array([0.9070179], dtype=float32))
Row(int_feature_1=array([-1.5812857], dtype=float32), int_feature_2=array([1.3522812], dtype=float32))
Row(int_feature_1=array([-1.0739213], dtype=float32), int_feature_2=array([-0.28035107], dtype=float32))
Row(int_feature_1=array([0.14375328], dtype=float32), int_feature_2=array([1.3028076], dtype=float32))
Row(int_feature_1=array([-0.31287467], dtype=float32), int_feature_2=array([-0.7256144], dtype=float32))
Row(int_feature_1=array([2.2239475], dtype=float32), int_feature_2=array([1.5007024], dtype=float32))

Scale by using ScaleByMinMax

Use ScaleByMinMax to scale your data into the range of [min_value, max_value] .

  min_value 
 = 
 1 
 max_value 
 = 
 10 
 # MLTransform in write mode 
 with 
 beam 
 . 
 Pipeline 
 () 
 as 
 pipeline 
 : 
 data_pcoll 
 = 
 pipeline 
 | 
 "CreateData" 
>> beam 
 . 
 Create 
 ( 
 data 
 ) 
 transformed_pcoll 
 = 
 ( 
 data_pcoll 
 | 
 "MLTransform" 
>> MLTransform 
 ( 
 write_artifact_location 
 = 
 artifact_location_scale_by_min_max 
 ) 
 . 
 with_transform 
 ( 
 ScaleByMinMax 
 ( 
 columns 
 = 
 [ 
 'int_feature_1' 
 , 
 'int_feature_2' 
 ], 
 min_value 
 = 
 min_value 
 , 
 max_value 
 = 
 max_value 
 ) 
 ) 
 ) 
 transformed_pcoll 
 | 
 "Print" 
>> beam 
 . 
 Map 
 ( 
 print 
 )

Row(int_feature_1=array([1.9310346], dtype=float32), int_feature_2=array([10.], dtype=float32))
Row(int_feature_1=array([5.5], dtype=float32), int_feature_2=array([6.2363634], dtype=float32))
Row(int_feature_1=array([1.], dtype=float32), int_feature_2=array([1.3272727], dtype=float32))
Row(int_feature_1=array([2.086207], dtype=float32), int_feature_2=array([5.418182], dtype=float32))
Row(int_feature_1=array([5.1896553], dtype=float32), int_feature_2=array([1.], dtype=float32))
Row(int_feature_1=array([10.], dtype=float32), int_feature_2=array([8.200001], dtype=float32))

  # MLTransform in read mode 
 with 
 beam 
 . 
 Pipeline 
 () 
 as 
 pipeline 
 : 
 data_pcoll 
 = 
 pipeline 
 | 
 "CreateData" 
>> beam 
 . 
 Create 
 ( 
 test_data 
 ) 
 transformed_pcoll 
 = 
 ( 
 data_pcoll 
 | 
 "MLTransform" 
>> MLTransform 
 ( 
 read_artifact_location 
 = 
 artifact_location_scale_by_min_max 
 ) 
 ) 
 transformed_pcoll 
 | 
 "Print" 
>> beam 
 . 
 Map 
 ( 
 print 
 )

Row(int_feature_1=array([4.7241383], dtype=float32), int_feature_2=array([8.363636], dtype=float32))
Row(int_feature_1=array([-0.5517242], dtype=float32), int_feature_2=array([9.836364], dtype=float32))
Row(int_feature_1=array([1.], dtype=float32), int_feature_2=array([4.4363637], dtype=float32))
Row(int_feature_1=array([4.7241383], dtype=float32), int_feature_2=array([9.672727], dtype=float32))
Row(int_feature_1=array([3.3275862], dtype=float32), int_feature_2=array([2.9636364], dtype=float32))
Row(int_feature_1=array([11.086206], dtype=float32), int_feature_2=array([10.327272], dtype=float32))

Use MLTransform to scale data Stay organized with collections Save and categorize content based on your preferences.