Using CNNs and RNNs for Music Genre Recognition

This model passes the input spectogram through both CNN and RNN layers in parallel, concatenating their output and then sending this through a dense layer with softmax activation to perform classification as shown below.Parallel CNN-RNN ModelThe convolutional block of the model consists of 2D convolution layer followed by a 2D Max pooling layer..This is in contrast to the CRNN model that uses 1D convolution and max pooling layers..There are 5 blocks of Convolution Max pooling layers..The final output is flattened and is a tensor of shape None , 256.The recurrent block starts with 2D max pooling layer of pool size 4,2 to reduce the size of the spectogram before LSTM operation..This feature reduction was done primarily to speed up processing..The reduced image is sent to a bidirectional GRU with 64 units..The output from this layer is atensor of shape None, 128.The outputs from the convolutional and recurrent blocks are then concatenated resulting in a tensor of shape, None, 384..Finally we have a dense layer with softmax activation. The model was trained using RMSProp optimizer with a learning rate of 0.0005 and the loss function was categorical cross entropy..The model was trained for 50 epochs and Learning Rate was reduced if the validation accuracy plateaued for at least 10 epochs.Figure below shows the loss and accuracy curves from this modelThis model had a validation loss of around 51%..Both models have very similar overall accuracies which is quite interesting but their class wise performance is very different..Parallel CNN-RNN model has a better performance for Experimental, Folk, Hip-Hop and Instrumental genres..The ensembling of both these models should produce even better results.One question that I asked myself was why is the accuracy only around 51%.. More details

Leave a Reply