Deep Spectrogram Learning for Gunshot Classification: A Comparative Study of CNN Architectures and Time-Frequency Representations

I tiakina i:
Ngā taipitopito rārangi puna kōrero
I whakaputaina i:Journal of Imaging vol. 11, no. 8 (2025), p. 281-307
Kaituhi matua: Pafan, Doungpaisan
Ētahi atu kaituhi: Peerapol, Khunarsa
I whakaputaina:
MDPI AG
Ngā marau:
Urunga tuihono:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Ngā Tūtohu: Tāpirihia he Tūtohu
Kāore He Tūtohu, Me noho koe te mea tuatahi ki te tūtohu i tēnei pūkete!

MARC

LEADER 00000nab a2200000uu 4500
001 3244042760
003 UK-CbPIL
022 |a 2313-433X 
024 7 |a 10.3390/jimaging11080281  |2 doi 
035 |a 3244042760 
045 2 |b d20250101  |b d20251231 
100 1 |a Pafan, Doungpaisan  |u Faculty of Industrial Technology and Management, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand; pafan.d@itm.kmutnb.ac.th 
245 1 |a Deep Spectrogram Learning for Gunshot Classification: A Comparative Study of CNN Architectures and Time-Frequency Representations 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a Gunshot sound classification plays a crucial role in public safety, forensic investigations, and intelligent surveillance systems. This study evaluates the performance of deep learning models in classifying firearm sounds by analyzing twelve time–frequency spectrogram representations, including Mel, Bark, MFCC, CQT, Cochleagram, STFT, FFT, Reassigned, Chroma, Spectral Contrast, and Wavelet. The dataset consists of 2148 gunshot recordings from four firearm types, collected in a semi-controlled outdoor environment under multi-orientation conditions. To leverage advanced computer vision techniques, all spectrograms were converted into RGB images using perceptually informed colormaps. This enabled the application of image processing approaches and fine-tuning of pre-trained Convolutional Neural Networks (CNNs) originally developed for natural image classification. Six CNN architectures—ResNet18, ResNet50, ResNet101, GoogLeNet, Inception-v3, and InceptionResNetV2—were trained on these spectrogram images. Experimental results indicate that CQT, Cochleagram, and Mel spectrograms consistently achieved high classification accuracy, exceeding 94% when paired with deep CNNs such as ResNet101 and InceptionResNetV2. These findings demonstrate that transforming time–frequency features into RGB images not only facilitates the use of image-based processing but also allows deep models to capture rich spectral–temporal patterns, providing a robust framework for accurate firearm sound classification. 
651 4 |a United States--US 
653 |a Accuracy 
653 |a Datasets 
653 |a Deep learning 
653 |a Wavelet transforms 
653 |a Law enforcement 
653 |a Color imagery 
653 |a Racial profiling 
653 |a Artificial neural networks 
653 |a Audio recordings 
653 |a Public safety 
653 |a Small arms 
653 |a Computer vision 
653 |a Automation 
653 |a Machine learning 
653 |a Image processing 
653 |a Representations 
653 |a Pattern recognition 
653 |a Sound 
653 |a Comparative studies 
653 |a Artificial intelligence 
653 |a Fourier transforms 
653 |a Spectrograms 
653 |a Time-frequency analysis 
653 |a Neural networks 
653 |a Classification 
653 |a Image classification 
653 |a Surveillance 
653 |a Gun violence 
653 |a Acoustics 
653 |a Murders & murder attempts 
653 |a Surveillance systems 
700 1 |a Peerapol, Khunarsa  |u Faculty of Science and Technology, Uttaradit Rajabhat University, Uttaradit 53000, Thailand 
773 0 |t Journal of Imaging  |g vol. 11, no. 8 (2025), p. 281-307 
786 0 |d ProQuest  |t Advanced Technologies & Aerospace Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3244042760/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3244042760/fulltextwithgraphics/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3244042760/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch