Transfer Learning for the Audio Domain with TensorFlow Lite Model MakerStay organized with collectionsSave and categorize content based on your preferences.
Copyright 2024 The AI Edge Authors.
Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## https://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.
In this colab notebook, you'll learn how to use theTensorFlow Lite Model Makerto train a custom audio classification model.
The Model Maker library uses transfer learning to simplify the process of training a TensorFlow Lite model using a custom dataset. Retraining a TensorFlow Lite model with your own custom dataset reduces the amount of training data and time required.
You'll use a custom birds dataset and export a TFLite model that can be used on a phone, a TensorFlow.JS model that can be used for inference in the browser and also a SavedModel version that you can use for serving.
Import TensorFlow, Model Maker and other libraries
Among the dependencies that are needed, you'll use TensorFlow and Model Maker. Aside those, the others are for audio manipulation, playing and visualizations.
The audios are already split in train and test folders. Inside each split folder, there's one folder for each bird, using theirbird_codeas name.
The audios are all mono and with 16kHz sample rate.
For more information about each file, you can read themetadata.csvfile. It contains all the files authors, lincenses and some more information. You won't need to read it yourself on this tutorial.
# @title [Run this] Util functions and data structures.data_dir='./dataset/small_birds_dataset'bird_code_to_name={'wbwwre1':'White-breasted Wood-Wren','houspa':'House Sparrow','redcro':'Red Crossbill','chcant2':'Chestnut-crowned Antpitta','azaspi1':"Azara's Spinetail",}birds_images={'wbwwre1':'https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Henicorhina_leucosticta_%28Cucarachero_pechiblanco%29_-_Juvenil_%2814037225664%29.jpg/640px-Henicorhina_leucosticta_%28Cucarachero_pechiblanco%29_-_Juvenil_%2814037225664%29.jpg',# Alejandro Bayer Tamayo from Armenia, Colombia'houspa':'https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/House_Sparrow%2C_England_-_May_09.jpg/571px-House_Sparrow%2C_England_-_May_09.jpg',# Diliff'redcro':'https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Red_Crossbills_%28Male%29.jpg/640px-Red_Crossbills_%28Male%29.jpg',# Elaine R. Wilson, www.naturespicsonline.com'chcant2':'https://upload.wikimedia.org/wikipedia/commons/thumb/6/67/Chestnut-crowned_antpitta_%2846933264335%29.jpg/640px-Chestnut-crowned_antpitta_%2846933264335%29.jpg',# Mike's Birds from Riverside, CA, US'azaspi1':'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Synallaxis_azarae_76608368.jpg/640px-Synallaxis_azarae_76608368.jpg',# https://www.inaturalist.org/photos/76608368}test_files=os.path.abspath(os.path.join(data_dir,'test/*/*.wav'))defget_random_audio_file():test_list=glob.glob(test_files)random_audio_path=random.choice(test_list)returnrandom_audio_pathdefshow_bird_data(audio_path):sample_rate,audio_data=wavfile.read(audio_path,'rb')bird_code=audio_path.split('/')[-2]print(f'Bird name:{bird_code_to_name[bird_code]}')print(f'Bird code:{bird_code}')display(Image(birds_images[bird_code]))plttitle=f'{bird_code_to_name[bird_code]}({bird_code})'plt.title(plttitle)plt.plot(audio_data)display(Audio(audio_data,rate=sample_rate))print('functions and data structures created')
Playing some audio
To have a better understanding about the data, lets listen to a random audio files from the test split.
When using Model Maker for audio, you have to start with a model spec. This is the base model that your new model will extract information to learn about the new classes. It also affects how the dataset will be transformed to respect the models spec parameters like: sample rate, number of channels.
YAMNetis an audio event classifier trained on the AudioSet dataset to predict audio events from the AudioSet ontology.
It's input is expected to be at 16kHz and with 1 channel.
You don't need to do any resampling yourself. Model Maker takes care of that for you.
frame_lengthis to decide how long each traininng sample is. in this caase EXPECTED_WAVEFORM_LENGTH * 3s
frame_stepsis to decide how far apart are the training samples. In this case, the ith sample will start at EXPECTED_WAVEFORM_LENGTH * 6s after the (i-1)th sample.
The reason to set these values is to work around some limitation in real world dataset.
For example, in the bird dataset, birds don't sing all the time. They sing, rest and sing again, with noises in between. Having a long frame would help capture the singing, but setting it too long will reduce the number of samples for training.
the audio_classifier has thecreatemethod that creates a model and already start training it.
You can customize many parameterss, for more information you can read more details in the documentation.
On this first try you'll use all the default configurations and train for 100 epochs.
batch_size=128epochs=100print('Training the model')model=audio_classifier.create(train_data,spec,validation_data,batch_size=batch_size,epochs=epochs)
The accuracy looks good but it's important to run the evaluation step on the test data and vefify your model achieved good results on unseed data.
print('Evaluating the model')model.evaluate(test_data)
Understanding your model
When training a classifier, it's useful to see theconfusion matrix. The confusion matrix gives you detailed knowledge of how your classifier is performing on test data.
Model Maker already creates the confusion matrix for you.
defshow_confusion_matrix(confusion,test_labels):"""Compute confusion matrix and normalize."""confusion_normalized=confusion.astype("float")/confusion.sum(axis=1)axis_labels=test_labelsax=sns.heatmap(confusion_normalized,xticklabels=axis_labels,yticklabels=axis_labels,cmap='Blues',annot=True,fmt='.2f',square=True)plt.title("Confusion matrix")plt.ylabel("True label")plt.xlabel("Predicted label")confusion_matrix=model.confusion_matrix(test_data)show_confusion_matrix(confusion_matrix.numpy(),test_data.index_to_label)
Testing the model [Optional]
You can try the model on a sample audio from the test dataset just to see the results.
First you get the serving model.
serving_model=model.create_serving_model()print(f'Model\'s input shape and type:{serving_model.inputs}')print(f'Model\'s output shape and type:{serving_model.outputs}')
Coming back to the random audio you loaded earlier
# if you want to try another file just uncoment the line belowrandom_audio=get_random_audio_file()show_bird_data(random_audio)
The model created has a fixed input window.
For a given audio file, you'll have to split it in windows of data of the expected size. The last window might need to be filled with zeros.
sample_rate,audio_data=wavfile.read(random_audio,'rb')audio_data=np.array(audio_data)/tf.int16.maxinput_size=serving_model.input_shape[1]split_audio_data=tf.signal.frame(audio_data,input_size,input_size,pad_end=True,pad_value=0)print(f'Test audio path:{random_audio}')print(f'Original size of the audio data:{len(audio_data)}')print(f'Number of windows for inference:{len(split_audio_data)}')
You'll loop over all the split audio and apply the model for each one of them.
The model you've just trained has 2 outputs: The original YAMNet's output and the one you've just trained. This is important because the real world environment is more complicated than just bird sounds. You can use the YAMNet's output to filter out non relevant audio, for example, on the birds use case, if YAMNet is not classifying Birds or Animals, this might show that the output from your model might have an irrelevant classification.
Below both outpus are printed to make it easier to understand their relation. Most of the mistakes that your model make are when YAMNet's prediction is not related to your domain (eg: birds).
print(random_audio)results=[]print('Result of the window ith: your model class -> score, (spec class -> score)')fori,datainenumerate(split_audio_data):yamnet_output,inference=serving_model(data)results.append(inference[0].numpy())result_index=tf.argmax(inference[0])spec_result_index=tf.argmax(yamnet_output[0])t=spec._yamnet_labels()[spec_result_index]result_str=f'Result of the window{i}: '\f'\t{test_data.index_to_label[result_index]}->{inference[0][result_index].numpy():.3f}, '\f'\t({spec._yamnet_labels()[spec_result_index]}->{yamnet_output[0][spec_result_index]:.3f})'print(result_str)results_np=np.array(results)mean_results=results_np.mean(axis=0)result_index=mean_results.argmax()print(f'Mean result:{test_data.index_to_label[result_index]}->{mean_results[result_index]}')
Exporting the model
The last step is exporting your model to be used on embedded devices or on the browser.
Theexportmethod export both formats for you.
models_path='./birds_models'print(f'Exporing the TFLite model to{models_path}')model.export(models_path,tflite_filename='my_birds_model.tflite')
You can also export the SavedModel version for serving or using on a Python environment.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-05-28 UTC."],[],[]]