Introducing Google AI Edge Portal : Benchmark Edge AI at scale. Sign-up to request access during private preview.

Transfer Learning for the Audio Domain with TensorFlow Lite Model Maker

Copyright 2024 The AI Edge Authors.

Licensed under the Apache License, Version 2.0 (the "License");

  # you may not use this file except in compliance with the License. 
 # You may obtain a copy of the License at 
 # 
 # https://www.apache.org/licenses/LICENSE-2.0 
 # 
 # Unless required by applicable law or agreed to in writing, software 
 # distributed under the License is distributed on an "AS IS" BASIS, 
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
 # See the License for the specific language governing permissions and 
 # limitations under the License.

Run in Google Colab

View source on GitHub

Download notebook

In this colab notebook, you'll learn how to use the TensorFlow Lite Model Maker to train a custom audio classification model.

The Model Maker library uses transfer learning to simplify the process of training a TensorFlow Lite model using a custom dataset. Retraining a TensorFlow Lite model with your own custom dataset reduces the amount of training data and time required.

It is part of the Codelab to Customize an Audio model and deploy on Android .

You'll use a custom birds dataset and export a TFLite model that can be used on a phone, a TensorFlow.JS model that can be used for inference in the browser and also a SavedModel version that you can use for serving.

Installing dependencies

 sudo  
apt  
-y  
install  
libportaudio2 
 pip  
install  
tflite-model-maker

Import TensorFlow, Model Maker and other libraries

Among the dependencies that are needed, you'll use TensorFlow and Model Maker. Aside those, the others are for audio manipulation, playing and visualizations.

  import 
  
 tensorflow 
  
 as 
  
 tf 
 import 
  
 tflite_model_maker 
  
 as 
  
 mm 
 from 
  
 tflite_model_maker 
  
 import 
 audio_classifier 
 import 
  
 os 
 import 
  
 numpy 
  
 as 
  
 np 
 import 
  
 matplotlib.pyplot 
  
 as 
  
 plt 
 import 
  
 seaborn 
  
 as 
  
 sns 
 import 
  
 itertools 
 import 
  
 glob 
 import 
  
 random 
 from 
  
 IPython.display 
  
 import 
 Audio 
 , 
 Image 
 from 
  
 scipy.io 
  
 import 
 wavfile 
 print 
 ( 
 f 
 "TensorFlow Version: 
 { 
 tf 
 . 
 __version__ 
 } 
 " 
 ) 
 print 
 ( 
 f 
 "Model Maker Version: 
 { 
 mm 
 . 
 __version__ 
 } 
 " 
 )

The Birds dataset

The Birds dataset is an education collection of 5 types of birds songs:

White-breasted Wood-Wren
House Sparrow
Red Crossbill
Chestnut-crowned Antpitta
Azara's Spinetail

The original audio came from Xeno-canto which is a website dedicated to sharing bird sounds from all over the world.

Let's start by downloading the data.

  birds_dataset_folder 
 = 
 tf 
 . 
 keras 
 . 
 utils 
 . 
 get_file 
 ( 
 'birds_dataset.zip' 
 , 
 'https://storage.googleapis.com/laurencemoroney-blog.appspot.com/birds_dataset.zip' 
 , 
 cache_dir 
 = 
 './' 
 , 
 cache_subdir 
 = 
 'dataset' 
 , 
 extract 
 = 
 True 
 )

Explore the data

The audios are already split in train and test folders. Inside each split folder, there's one folder for each bird, using their bird_code as name.

The audios are all mono and with 16kHz sample rate.

For more information about each file, you can read the metadata.csv file. It contains all the files authors, lincenses and some more information. You won't need to read it yourself on this tutorial.

  # @title [Run this] Util functions and data structures. 
 data_dir 
 = 
 './dataset/small_birds_dataset' 
 bird_code_to_name 
 = 
 { 
 'wbwwre1' 
 : 
 'White-breasted Wood-Wren' 
 , 
 'houspa' 
 : 
 'House Sparrow' 
 , 
 'redcro' 
 : 
 'Red Crossbill' 
 , 
 'chcant2' 
 : 
 'Chestnut-crowned Antpitta' 
 , 
 'azaspi1' 
 : 
 "Azara's Spinetail" 
 , 
 } 
 birds_images 
 = 
 { 
 'wbwwre1' 
 : 
 'https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Henicorhina_leucosticta_%28Cucarachero_pechiblanco%29_-_Juvenil_ 
 %2814037225664% 
 29.jpg/640px-Henicorhina_leucosticta_%28Cucarachero_pechiblanco%29_-_Juvenil_ 
 %2814037225664% 
 29.jpg' 
 , 
 #   Alejandro Bayer Tamayo from Armenia, Colombia 
 'houspa' 
 : 
 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/House_Sparrow%2C_England_-_May_09.jpg/571px-House_Sparrow%2C_England_-_May_09.jpg' 
 , 
 #    Diliff 
 'redcro' 
 : 
 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Red_Crossbills_%28Male%29.jpg/640px-Red_Crossbills_%28Male%29.jpg' 
 , 
 #  Elaine R. Wilson, www.naturespicsonline.com 
 'chcant2' 
 : 
 'https://upload.wikimedia.org/wikipedia/commons/thumb/6/67/Chestnut-crowned_antpitta_ 
 %2846933264335% 
 29.jpg/640px-Chestnut-crowned_antpitta_ 
 %2846933264335% 
 29.jpg' 
 , 
 #   Mike's Birds from Riverside, CA, US 
 'azaspi1' 
 : 
 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Synallaxis_azarae_76608368.jpg/640px-Synallaxis_azarae_76608368.jpg' 
 , 
 # https://www.inaturalist.org/photos/76608368 
 } 
 test_files 
 = 
 os 
 . 
 path 
 . 
 abspath 
 ( 
 os 
 . 
 path 
 . 
 join 
 ( 
 data_dir 
 , 
 'test/*/*.wav' 
 )) 
 def 
  
 get_random_audio_file 
 (): 
 test_list 
 = 
 glob 
 . 
 glob 
 ( 
 test_files 
 ) 
 random_audio_path 
 = 
 random 
 . 
 choice 
 ( 
 test_list 
 ) 
 return 
 random_audio_path 
 def 
  
 show_bird_data 
 ( 
 audio_path 
 ): 
 sample_rate 
 , 
 audio_data 
 = 
 wavfile 
 . 
 read 
 ( 
 audio_path 
 , 
 'rb' 
 ) 
 bird_code 
 = 
 audio_path 
 . 
 split 
 ( 
 '/' 
 )[ 
 - 
 2 
 ] 
 print 
 ( 
 f 
 'Bird name: 
 { 
 bird_code_to_name 
 [ 
 bird_code 
 ] 
 } 
 ' 
 ) 
 print 
 ( 
 f 
 'Bird code: 
 { 
 bird_code 
 } 
 ' 
 ) 
 display 
 ( 
 Image 
 ( 
 birds_images 
 [ 
 bird_code 
 ])) 
 plttitle 
 = 
 f 
 ' 
 { 
 bird_code_to_name 
 [ 
 bird_code 
 ] 
 } 
 ( 
 { 
 bird_code 
 } 
 )' 
 plt 
 . 
 title 
 ( 
 plttitle 
 ) 
 plt 
 . 
 plot 
 ( 
 audio_data 
 ) 
 display 
 ( 
 Audio 
 ( 
 audio_data 
 , 
 rate 
 = 
 sample_rate 
 )) 
 print 
 ( 
 'functions and data structures created' 
 )

Playing some audio

To have a better understanding about the data, lets listen to a random audio files from the test split.

  random_audio 
 = 
 get_random_audio_file 
 () 
 show_bird_data 
 ( 
 random_audio 
 )

Training the Model

When using Model Maker for audio, you have to start with a model spec. This is the base model that your new model will extract information to learn about the new classes. It also affects how the dataset will be transformed to respect the models spec parameters like: sample rate, number of channels.

YAMNet is an audio event classifier trained on the AudioSet dataset to predict audio events from the AudioSet ontology.

It's input is expected to be at 16kHz and with 1 channel.

You don't need to do any resampling yourself. Model Maker takes care of that for you.

frame_length is to decide how long each traininng sample is. in this caase EXPECTED_WAVEFORM_LENGTH * 3s
frame_steps is to decide how far apart are the training samples. In this case, the ith sample will start at EXPECTED_WAVEFORM_LENGTH * 6s after the (i-1)th sample.

The reason to set these values is to work around some limitation in real world dataset.

For example, in the bird dataset, birds don't sing all the time. They sing, rest and sing again, with noises in between. Having a long frame would help capture the singing, but setting it too long will reduce the number of samples for training.

  spec 
 = 
 audio_classifier 
 . 
 YamNetSpec 
 ( 
 keep_yamnet_and_custom_heads 
 = 
 True 
 , 
 frame_step 
 = 
 3 
 * 
 audio_classifier 
 . 
 YamNetSpec 
 . 
 EXPECTED_WAVEFORM_LENGTH 
 , 
 frame_length 
 = 
 6 
 * 
 audio_classifier 
 . 
 YamNetSpec 
 . 
 EXPECTED_WAVEFORM_LENGTH 
 )

Loading the data

Model Maker has the API to load the data from a folder and have it in the expected format for the model spec.

The train and test split are based on the folders. The validation dataset will be created as 20% of the train split.

  train_data 
 = 
 audio_classifier 
 . 
 DataLoader 
 . 
 from_folder 
 ( 
 spec 
 , 
 os 
 . 
 path 
 . 
 join 
 ( 
 data_dir 
 , 
 'train' 
 ), 
 cache 
 = 
 True 
 ) 
 train_data 
 , 
 validation_data 
 = 
 train_data 
 . 
 split 
 ( 
 0.8 
 ) 
 test_data 
 = 
 audio_classifier 
 . 
 DataLoader 
 . 
 from_folder 
 ( 
 spec 
 , 
 os 
 . 
 path 
 . 
 join 
 ( 
 data_dir 
 , 
 'test' 
 ), 
 cache 
 = 
 True 
 )

Training the model

the audio_classifier has the create method that creates a model and already start training it.

You can customize many parameterss, for more information you can read more details in the documentation.

On this first try you'll use all the default configurations and train for 100 epochs.

  batch_size 
 = 
 128 
 epochs 
 = 
 100 
 print 
 ( 
 'Training the model' 
 ) 
 model 
 = 
 audio_classifier 
 . 
 create 
 ( 
 train_data 
 , 
 spec 
 , 
 validation_data 
 , 
 batch_size 
 = 
 batch_size 
 , 
 epochs 
 = 
 epochs 
 )

The accuracy looks good but it's important to run the evaluation step on the test data and vefify your model achieved good results on unseed data.

  print 
 ( 
 'Evaluating the model' 
 ) 
 model 
 . 
 evaluate 
 ( 
 test_data 
 )

Understanding your model

When training a classifier, it's useful to see the confusion matrix . The confusion matrix gives you detailed knowledge of how your classifier is performing on test data.

Model Maker already creates the confusion matrix for you.

  def 
  
 show_confusion_matrix 
 ( 
 confusion 
 , 
 test_labels 
 ): 
  
 """Compute confusion matrix and normalize.""" 
 confusion_normalized 
 = 
 confusion 
 . 
 astype 
 ( 
 "float" 
 ) 
 / 
 confusion 
 . 
 sum 
 ( 
 axis 
 = 
 1 
 ) 
 axis_labels 
 = 
 test_labels 
 ax 
 = 
 sns 
 . 
 heatmap 
 ( 
 confusion_normalized 
 , 
 xticklabels 
 = 
 axis_labels 
 , 
 yticklabels 
 = 
 axis_labels 
 , 
 cmap 
 = 
 'Blues' 
 , 
 annot 
 = 
 True 
 , 
 fmt 
 = 
 '.2f' 
 , 
 square 
 = 
 True 
 ) 
 plt 
 . 
 title 
 ( 
 "Confusion matrix" 
 ) 
 plt 
 . 
 ylabel 
 ( 
 "True label" 
 ) 
 plt 
 . 
 xlabel 
 ( 
 "Predicted label" 
 ) 
 confusion_matrix 
 = 
 model 
 . 
 confusion_matrix 
 ( 
 test_data 
 ) 
 show_confusion_matrix 
 ( 
 confusion_matrix 
 . 
 numpy 
 (), 
 test_data 
 . 
 index_to_label 
 )

Testing the model [Optional]

You can try the model on a sample audio from the test dataset just to see the results.

First you get the serving model.

  serving_model 
 = 
 model 
 . 
 create_serving_model 
 () 
 print 
 ( 
 f 
 'Model 
 \' 
 s input shape and type: 
 { 
 serving_model 
 . 
 inputs 
 } 
 ' 
 ) 
 print 
 ( 
 f 
 'Model 
 \' 
 s output shape and type: 
 { 
 serving_model 
 . 
 outputs 
 } 
 ' 
 )

Coming back to the random audio you loaded earlier

  # if you want to try another file just uncoment the line below 
 random_audio 
 = 
 get_random_audio_file 
 () 
 show_bird_data 
 ( 
 random_audio 
 )

The model created has a fixed input window.

For a given audio file, you'll have to split it in windows of data of the expected size. The last window might need to be filled with zeros.

  sample_rate 
 , 
 audio_data 
 = 
 wavfile 
 . 
 read 
 ( 
 random_audio 
 , 
 'rb' 
 ) 
 audio_data 
 = 
 np 
 . 
 array 
 ( 
 audio_data 
 ) 
 / 
 tf 
 . 
 int16 
 . 
 max 
 input_size 
 = 
 serving_model 
 . 
 input_shape 
 [ 
 1 
 ] 
 split_audio_data 
 = 
 tf 
 . 
 signal 
 . 
 frame 
 ( 
 audio_data 
 , 
 input_size 
 , 
 input_size 
 , 
 pad_end 
 = 
 True 
 , 
 pad_value 
 = 
 0 
 ) 
 print 
 ( 
 f 
 'Test audio path: 
 { 
 random_audio 
 } 
 ' 
 ) 
 print 
 ( 
 f 
 'Original size of the audio data: 
 { 
 len 
 ( 
 audio_data 
 ) 
 } 
 ' 
 ) 
 print 
 ( 
 f 
 'Number of windows for inference: 
 { 
 len 
 ( 
 split_audio_data 
 ) 
 } 
 ' 
 )

You'll loop over all the split audio and apply the model for each one of them.

The model you've just trained has 2 outputs: The original YAMNet's output and the one you've just trained. This is important because the real world environment is more complicated than just bird sounds. You can use the YAMNet's output to filter out non relevant audio, for example, on the birds use case, if YAMNet is not classifying Birds or Animals, this might show that the output from your model might have an irrelevant classification.

Below both outpus are printed to make it easier to understand their relation. Most of the mistakes that your model make are when YAMNet's prediction is not related to your domain (eg: birds).

  print 
 ( 
 random_audio 
 ) 
 results 
 = 
 [] 
 print 
 ( 
 'Result of the window ith:  your model class -> score,  (spec class -> score)' 
 ) 
 for 
 i 
 , 
 data 
 in 
 enumerate 
 ( 
 split_audio_data 
 ): 
 yamnet_output 
 , 
 inference 
 = 
 serving_model 
 ( 
 data 
 ) 
 results 
 . 
 append 
 ( 
 inference 
 [ 
 0 
 ] 
 . 
 numpy 
 ()) 
 result_index 
 = 
 tf 
 . 
 argmax 
 ( 
 inference 
 [ 
 0 
 ]) 
 spec_result_index 
 = 
 tf 
 . 
 argmax 
 ( 
 yamnet_output 
 [ 
 0 
 ]) 
 t 
 = 
 spec 
 . 
 _yamnet_labels 
 ()[ 
 spec_result_index 
 ] 
 result_str 
 = 
 f 
 'Result of the window 
 { 
 i 
 } 
 : ' 
\ f 
 ' 
 \t 
 { 
 test_data 
 . 
 index_to_label 
 [ 
 result_index 
 ] 
 } 
 -> 
 { 
 inference 
 [ 
 0 
 ][ 
 result_index 
 ] 
 . 
 numpy 
 () 
 : 
 .3f 
 } 
 , ' 
\ f 
 ' 
 \t 
 ( 
 { 
 spec 
 . 
 _yamnet_labels 
 ()[ 
 spec_result_index 
 ] 
 } 
 -> 
 { 
 yamnet_output 
 [ 
 0 
 ][ 
 spec_result_index 
 ] 
 : 
 .3f 
 } 
 )' 
 print 
 ( 
 result_str 
 ) 
 results_np 
 = 
 np 
 . 
 array 
 ( 
 results 
 ) 
 mean_results 
 = 
 results_np 
 . 
 mean 
 ( 
 axis 
 = 
 0 
 ) 
 result_index 
 = 
 mean_results 
 . 
 argmax 
 () 
 print 
 ( 
 f 
 'Mean result: 
 { 
 test_data 
 . 
 index_to_label 
 [ 
 result_index 
 ] 
 } 
 -> 
 { 
 mean_results 
 [ 
 result_index 
 ] 
 } 
 ' 
 )

Exporting the model

The last step is exporting your model to be used on embedded devices or on the browser.

The export method export both formats for you.

  models_path 
 = 
 './birds_models' 
 print 
 ( 
 f 
 'Exporing the TFLite model to 
 { 
 models_path 
 } 
 ' 
 ) 
 model 
 . 
 export 
 ( 
 models_path 
 , 
 tflite_filename 
 = 
 'my_birds_model.tflite' 
 )

You can also export the SavedModel version for serving or using on a Python environment.

  model 
 . 
 export 
 ( 
 models_path 
 , 
 export_format 
 = 
 [ 
 mm 
 . 
 ExportFormat 
 . 
 SAVED_MODEL 
 , 
 mm 
 . 
 ExportFormat 
 . 
 LABEL 
 ])

Next Steps

You did it.

Now your new model can be deployed on mobile devices using TFLite AudioClassifier Task API .

You can also try the same process with your own data with different classes and here is the documentation for Model Maker for Audio Classification .

Also learn from end-to-end reference apps: Android , iOS .

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-05-28 UTC.

Design a Mobile Site

View Site in Mobile | Classic

Share by: