Music classification has become an essential task in the digital era, where streaming platforms and music libraries continue to grow rapidly. Traditional approaches, such as manual tagging and metadata-based categorization, are inefficient, subjective, and fail to scale for diverse and multilingual datasets. This creates a pressing need for automated, intelligent systems capable of classifying music with high accuracy and efficiency. Over the years, researchers have explored various machine learning techniques for Music Information Retrieval (MIR). Early models, including k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), and Decision Trees, relied heavily on handcrafted features such as tempo, pitch, and rhythm. While these methods achieved moderate success, they lacked robustness when applied to large, complex, and real-world datasets. More recently, deep learning has transformed audio analysis by automatically learning hierarchical patterns from raw features. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been widely adopted in music classification tasks. In particular, RNNs are well-suited for sequential data such as audio signals, as they capture temporal dependencies in rhythm, melody, and harmony, offering improved genre recognition compared to static models. The objective of this research is to design and implement a Music Genre Classification system using RNNs integrated with feature extraction methods such as Mel-Frequency Cepstral Coefficients (MFCCs), Chroma, and Spectral analysis. The system is deployed as a web-based application using Flask, enabling users to upload audio tracks and receive real-time predictions. The study aims to enhance classification accuracy, scalability, and usability while addressing the limitations of traditional methods.
LITERATURE REVIEW:
Tzanetakis and Cook (2002) pioneered automatic music genre classification using timbral, rhythmic, and pitch-based features such as MFCCs, spectral centroid, and zero-crossing rate. They developed the GTZAN dataset, which became a standard benchmark for later studies. This research demonstrated that combining low-level audio features with statistical classifiers could achieve efficient and objective music categorization, marking a significant milestone in the early stages of Music Information Retrieval (MIR). Li, Ogihara, and Li (2003) enhanced classification performance by implementing Support Vector Machines (SVMs) and feature fusion methods. Their work showed that SVMs outperform traditional models such as k-NN and Decision Trees in handling multidimensional feature spaces, resulting in improved accuracy and better generalization across datasets. This research established SVM as a strong baseline for music genre recognition in the early 2000s. Bergstra and colleagues (2006) introduced ensemble learning techniques such as AdaBoost and Random Forests for genre recognition. Their research emphasized the advantage of combining multiple weak learners to reduce overfitting and improve model generalization. This approach helped achieve more stable results across varying datasets and inspired further exploration into ensemble and hybrid learning techniques in audio classification. Panagakis, Kotropoulos, and Arce (2009) proposed Sparse Representation–based Classification (SRC) for audio signals, which focused on the robustness of music classification under noisy or overlapping genre conditions. This method effectively captured the representation of musical timbre and rhythm, providing a more discriminative and noise-resistant approach for MIR systems. Their study contributed to the shift toward more efficient and robust feature representation models. Dieleman and Schrauwen (2014) marked a major transition from traditional machine learning to deep learning by applying Convolutional Neural Networks (CNNs) directly on spectrograms. Their end-to-end framework learned hierarchical audio patterns automatically, removing the dependency on handcrafted feature extraction. This work proved that CNNs could successfully learn both timbral and temporal patterns in raw audio data, influencing future studies in music and audio processing. Choi, Fazekas, Sandler, and Cho (2017) further advanced deep learning applications in music classification by combining CNN and RNN layers. Their Convolutional Recurrent Neural Network (CRNN) captured both spatial and temporal dependencies within music signals, enabling superior accuracy and context-aware classification. The CRNN model achieved state-of-the-art performance on multi-genre datasets and became a foundation for modern real-time genre recognition systems. Dhakal, Rahman, and Kalita (2020) implemented transfer learning using pretrained CNN architectures such as VGG16 and ResNet50 to classify music genres efficiently. Their approach demonstrated that leveraging pretrained models significantly reduces training time while maintaining high accuracy. This innovation bridged the gap between limited dataset availability and high-performance models, making genre classification more accessible for research and deployment. Pathak and Singh (2022) proposed a hybrid CNN-LSTM architecture that combined convolutional and recurrent layers to classify multilingual and regional music genres effectively. Their model captured both local feature hierarchies and long-term temporal dependencies, improving classification for culturally diverse datasets. This study was particularly significant for Indian and global multilingual music systems. Jha and Kumar (2023) designed a real-time, web-based music genre classification system using TensorFlow and Flask. Their model integrated deep learning with a user-friendly web interface, allowing instant genre prediction for uploaded audio files. This work demonstrated the practical implementation of deep learning in real-world applications, bridging the gap between research and user-interactive systems.
METHODOLOGY:
Swati Badachi*
Dayanand Savakar
Padma Yadahalli
10.5281/zenodo.17328334