A ResNet-based Language Identification using Spectrograms

9 min readDec 22, 2020

From scratch — for beginners in the field !

Part IV : Employing transfer learning CNN technique to classify languages

We have peeled the different layers of our project and reached the central core — developing a model which distinguishes languages with the help of our Spectrogram dataset. Minute pattern changes in the Spectrogram images which cannot be easily perceived by humans can be recognised and learnt in no time by a machine, as you will witness in the below process.

This is the final Part of a 4-Part series —
(i) Creating your own dataset for Data Science and AI
(ii) Speech Data Exploration through Spectrograms
(iii) Audio Spectrogram Generation and
(iv) A ResNet-based Language Identification using Spectrograms.
The series will guide you to construct your own Deep Learning Project with an in-depth understanding.

We shall employ the technique of Transfer learning for our use case here. Transfer Learning is the method of employing a model already trained for a particular Machine Learning problem on other similar problems.

The PyTorch library we have been using so far houses many such pre-trained Torchvision models which can be used on similar Image classification problems. It also provides an easy interface through which we can use the trained models in our project. Of the many available models, ResNet18 has been used in our classification problem. There’s also a possibility to experiment with other different pre-trained models or develop a new neural network model altogether.

You can find the paper discussing ResNets in depth here.

As always our modus operandi is to approach the solution one step at a time.

Step 1 : Downloading the dataset

The model building and execution has been done on Kaggle Kernel especially to exploit its GPU availability. It is easier and faster for the model to converge to results because of the advantage Kaggle GPU offers.

Import the required libraries

2. Obtain the file ID of the gzipped audio file from Google drive. To do this, we’ll have to locate the file you want to download in your Google drive.

Now that we have the file ‘lid-ds-spec.tar.gz’ file which we want to download, right click on the file and click on ‘Get shareable link’.

The file ID can be got from the Google’s shareable link. The id spans between ‘/d/’ and ’/view’ in the link.

3. Download the file into Kaggle from Google drive

The file ID we obtained in the previous step must be given in the file_id parameter. The root parameter contains the root directory into which the file has to get downloaded. When ‘.’ is specified in root, the default path for download in Kaggle is taken i.e. /kaggle/working. The file will be available in the above working directory once the download is complete.

4. Extract the contents of the zipped file into ‘/data’.

Step 2 : Split the entire dataset into Train, Validation and Test data

In order to successfully train and test the model, we need to split the dataset into train, validation and test datasets. The dataset which we use to train the model is called the Training data. Validation dataset is helpful in evaluating the trained model by tuning the model’s hyper-parameters and constantly readjusting its course of learning. Once the model is completely trained, we can use test data to evaluate the model’s performance.

Create a dataframe for enabling the split of data

To help with the dataset split, we will first create a dataframe ‘data_df’ with image names and their corresponding labels. For this the entire image data under ‘./data/image-lid/train’ has to be included.

2. Preparing the dataframes with split image names

The split function from the Numpy library can be used to create the train, validation and test dataframes. ‘df.sample(frac=1)’ shuffles the dataframe and then the next argument splits the entire dataframe into train set : 60% , validation set : 20% and test set : 20%. We have 30000 images in our input data which will be split into train set : 18000, validation set : 6000 and test set : 6000.

Step 3 : Augment the training and validation data

We introduce data augmentation to the images at this point of data preparation. The machine might mimic rote learning if there is no variation in the data and we do not want that in our model. Data Augmentation is done to introduce variability into the input data. It is basically provisioning different versions of a single image. The samples for that will be chosen in a random fashion. It is also useful in cases where the availability of data is less.

Torchvision from PyTorch has the transforms.Compose method to easily chain up all the transforms we intend to perform on the images. Resizing, horizontal flipping, random erasing are some of the transforms chosen for this train data. We can experiment with different transformation techniques available here to find out which set of methods yield better results. The ToTensor converts an image to a tensor, a multi-dimensional matrix containing elements of a single data type, which are then manipulated to build our required model. We also apply the Normalize method which normalises the tensors with mean and standard deviation. As we’re using an already trained model ResNet18, we use the stats data(mean and standard deviation) from the input that it was trained on, ImageNet. You can find the default stats here.

It is to be noted that the validation dataset has only Resize, ToTensor and Normalize methods applied to it because it is utilised only for evaluation purposes and not for training. Therefore only transforms that are required to maintain uniformity between training and validation data are performed on the validation dataset.

Step 4 : Load the train and validation data into GPU memory

As we have already discussed that training the model is processor heavy, we have to load them into Kaggle’s GPU memory for efficient performance.

The helper class LanguageIdentifyDataset accepts a dataframe and for those image ids in it, it applies the afore mentioned transforms .The image is transformed after opening it using the image module from Pillow. You can find more information about the definition and structure of a class here.

Once we have our transformed datasets train_ds and val_ds, we can go ahead to prepare batches of data to load into the GPU. The class torch.utils.data.DataLoader comes in handy for this purpose. It produces batches of shuffled data and then loads it using multiprocessing workers in parallel. We can give a definite batch_size to indicate the number of images in every batch. Shuffle can be set to ‘true’ for training data as we want our model to learn from as many different combinations of data as there are.

We can now load the batched data into GPU memory using another helper class DeviceDataLoader.

Before we can do that, we have to make sure that GPU environment has been turned on in the notebook settings. If it is already on, then the function get_default _device will return ‘cuda’.

With this, batches of train_dl and val_dl are loaded into the GPU memory and await execution.

Step 5: Define the ResNet model

Before we can go ahead with defining the ResNet model, we must initially define a few functions that can be used for the image batches to be trained.

The user defined class MulticlassImageClassificationBase inherits nn.Module which is the base class for all neural network models in PyTorch.

The above class contains the important function set training_step, validation_step and validation_epoch_end which defines what is to be done during training, validation and at the end of every epoch respectively.

The training_step takes in a batch of data to be trained and applies whatever model we define and gives output batch of images stored in ‘out’. The loss between the generated output and the original output is calculated using Cross entropy.

Cross entropy quantifies the difference between two probability distributions. Though there are several loss calculation methods, the loss factor is calculated by employing Cross Entropy in this case as our problem is a categorical classification.

In validation_step the same steps are defined. However they are applied on the validation batch of data. An additional step to calculate accuracy is performed with which we can understand if accuracy improves over time with each epoch during validation. The below code defines the method to calculate accuracy.

In validation_epoch_end, the average loss and accuracy of all the batches at the end of every epoch is calculated and stacked.

The next important function to be defined is the fit_one_cycle, which jots down the order in which the model has to be executed in each epoch with its respective hyper-parameters.

The main execution is done by model.training_step, which operates as defined in the method explained above. Losses are calculated at the end and gradients are calculated accordingly by loss.backward().

During this computation, there’s a chance for the gradients to grow beyond a certain limit and cause a condition known as exploding gradients. This can be removed by Gradient Clipping. Using this technique, all the gradients that are above the norm can be scaled to a pre-defined threshold.

The weights should be updated based on the gradients, trying to find the minimum point of the gradient to reduce loss. Similar to exploding gradients, the weights can also end up growing too large. In order to penalise this effect, we employ weight decay where the weights are multiplied by a factor less than 1 to contain it.

Another important hyper-parameter handled here is the learning rate. The step-size of the amount of changes required to be done to the model in each step is defined by the Learning rate. The importance of this factor is detailed here. There are several approaches to using this Learning rate parameter. We would be following the one-cycle learning rate scheduler so that the learning rates can be decreased as the learning starts converging with time.

We call opt_func is the optimiser function with model parameters, maximum learning rate for our one-cycle scheduler, and weight_decay. At the end of each training epoch, validation is also done using the validation_step defined above to get intermediate results after every epoch.

Once these methods are created, we can now define our model, LidModel class. It inherits the class explained above, MulticlassImageClassificationBase.

In this class, the model to be used is denoted. As we had seen earlier, we will be using ResNet18, which gave better results when compared to other ResNet models for our particular use case. The model results in output features num_ftrs depending on its hidden layers. However since we have only six output language classes, we have to reduce num_ftrs to 6. Hence we apply linear regression on it to reduce it even further. The sigmoid layer defined in the forward method converts the output self.network to have only non-negative values.

In this project, we also use a tactic called model freeze which will retain the trained model parameters and will not apply any user defined parameters. When we unfreeze the model, the newly defined parameters will take over. This method is applied to fine-tune our model.

Step 6: Load and evaluate the model with validation dataset

The model should now also be loaded into GPU memory, so that data batches can be trained with the model. This can be done similar to data loading using the below piece of code.

We can calculate an initial loss by validating the untrained model with our data.

As we can see, we have obtained an initial accuracy score of 13 %.

Step 7: Model learning

Now that all the functions have been defined, training the model is just a matter of calling the fit function using our defined hyper-parameters.

The epochs can be repeated with varying hyper parameters and also with freeze/ unfreeze techniques in place. In our case, after 18 epochs with different hyper parameters, the accuracy score converged to 90 %. The score changes can be easily visualised with a line plot.

Step 8: Verify the results

Once training and validation are complete, we have to now test our trained model on the testing data. The test data should also be transformed similar to validation data and loaded into GPU memory.

As we can see the trained model gives 90 % accuracy with the test data. By fine tuning our model even more, we can obtain better accuracies. However, this has not been covered under the scope of this project.

We can view a sample image from the test data set with actual and target labels below.

Step 11 : Save the model and hyper parameters

Jovian is code hosting platform which has versioning and many other features. The code for this project has been hosted in the same and it allows logging our hyper parameters and results for every model.

We can therefore save our model weights and log the hyper parameter values and results using the below code.

That’s all Folks!! You will have successfully created your own Deep Learning project if you follow the above steps.

The link to all my Jovian notebooks can be found here. In case of queries and feedback, reach me here. Hope this helps!

A ResNet-based Language Identification using Spectrograms

Part IV : Employing transfer learning CNN technique to classify languages

Written by Abizasobanaraj

No responses yet