In this paper, we present a new technique to extract a noise robust representation of speech signals called spectrotemporal power spectrum. To take temporal information into account the time difference of features of adjacent speech frames are appended to the initial features. Speaker emotion recognition based on speech features and classification techniques article pdf available in international journal of computer network and information security 7. In this work, we have investigated the performance of 2d gabor features known as spectrotemporal features for speaker recognition.
As a starting point, the properties of lstf features 1 are evaluated. Spectrotemporal refers most commonly to audition, where the neurons response depends on frequency versus time, while spatiotemporal refers to vision, where the neurons response depends on spatial location versus time. Gabor features have been used mainly for automatic speech recognition asr, where they have yielded improvements. As suggested by, the strf can be effectively modelled by twodimensional 2d gabor functions. This paper proposes a novel feature type for the recognition of emotion from speech. Methods for capturing spectrotemporal modulations in automatic speech recognition. Improved deep speaker feature learning for textdependent. The speech databases that are used for the asr experiments aiming at the analyses of either intrinsic or extrinsic factors in speech are presented in section 2. The corresponding power normalized spectrum pns is then. Separable spectrotemporal gabor filter bank features.
The pascal chime speech separation and recognition challenge, comput. Second, signals are converted to spectrotemporal gabor features that resemble cortical speech representations and which have been shown to improve asr in noisy conditions. Existing automatic speech recognition asr system uses the spectral or temporal features of speech. The following page provides an overview of publications, including books, journal papers, conference proceedings as well as dissertations and research reports, published by researches of the fraunhofer institute for digital media technology idmt. This paper presents results from recent studies utilizing spectrotemporal gabor features for different tasks in automatic speech recognition 20, 21, 23 and is structured as follows. In contrast to the previously mentioned approaches and other models in the.
The microphone signal was fed to a custommade, realtime song recognizer that detected the first stereotypic syllable of song motifs using a two. Methods for capturing spectrotemporal modulations in. First, triangular filters can be replaced with gabor filters, a compactly supported. The resulting features showed a close resemblance to the strfs of cortical neurons in the auditory system. Gabor analysis of auditory midbrain receptive fields.
Localized spectrotemporal features for automatic speech. On the relevance of auditorybased gabor features for deep. Hierarchical spectrotemporal features for robust speech recognition xavier domont 1,2, martin heckmann 1, frank joublin 1, christian goerick 1. Spectrotemporal gabor filterbank features for acoustic event detection. For textindependent speaker identification a prominent combination is to use gaussian mixture models gmm for classification while relying on melfrequency cepstral coefficients mfcc as features.
Noise robust automatic speech recognition based on. Spectrotemporal gabor features for speaker recognition howard lei, bernd t. Automatic speech emotion recognition using machine. A hierarchical framework for spectrotemporal feature. Therefore, the purpose of this study is to assess the integration of sparse speech as a function of listener age, where the speech snippets are variously isolated in both the time and frequency domains, as well as in ear of presentation. As speech spectral peaks constitute the regions of highsnr signaltonoise ratio values in the speech spectrogram, we.
In 14,15, a 2d gabor filter bank was applied to melspectrograms. A summary of features from viewpoint of their physical interpretation. The spectrotemporal gbfb feature extraction incorporates mel filterbank to mimic frequency mapping in the basilar membrane bm in the inner ear. Spectrotemporal gabor features based on auditory knowledge have improved word accuracy for automatic speech recognition in the presence of noise. Spectrotemporal analysis of speech using 2d gabor filters. For further research please use our database fraunhoferpublica. Localized spectrotemporal gabor features for automatic speech recognition the strf of cortical neurons and early auditory features.
Biomimetic multiresolution analysis for robust speaker. Pdf speaker emotion recognition based on speech features. Dnnbased speech recognition greatly benefits from spectrotemporal gabor features. In the present report, we explore the use of a multiresolution analysis for robust speaker verification. Our representation is simple, effective, and computationallyefficient. The resulting outputs of the gabor filters were concatenated into twodimensional vectors and used as features in speech recognition experiments. New features to improve speaker recognition efficiency. In order to improve the performance of automatic speech recognition asr systems a number of di. A 9frame temporal context is taken on the etsi features along with their first, second, and third order dynamic features, resulting in an input feature dimensionality of 468. Modelling, feature extraction and effects of clinical environment a thesis submitted in fulfillment of the requirements for the degree of doctor of philosophy sheeraz memon b. Fundamentals of speaker recognition introduces speaker identification, speaker verification, speaker audio event classification, speaker detection, speaker tracking and more. Informative spectrotemporal bottleneck features for noise.
These physiologically and psychoacoustically motivated features employ spectrotemporal information inherent to the speech signal. Multistream spectrotemporal features for robust speech. Generally, the feature extraction schemes for speaker recognition can be categorized into linear predictive cepstral coefficients. Gabor filterbank features for robust speech recognition. The spectrotemporal receptive field or spatiotemporal receptive field strf of a neuron represents which types of stimuli excite or inhibit that neuron. The proposed scheme is carefully optimized to be particularly sensitive to the informationrich spectrotemporal attributes of the signal while maintaining robustness.
The technical problems are rigorously defined, and a complete picture is made of the relevance of the discussed algorithms and their usage in building a comprehensive. Spectrotemporal gabor filterbank features for acoustic event. Other biologically inspired spectrotemporal speech features, e. Proceedings of triennial forum acusticum, september, 2002. Robust speech recognition based on spectrotemporal. This concept of spectrotemporal modulation decomposition has inspired many approaches in various engineering topics, such as using spectrotemporal modulation features for speaker recognition 12, robust speech recognition 18, voice activity detection 10, and sound. Physiologically motivated feature extraction methods based on 2dgabor filters have already been used successfully in robust automatic speech recognition. For both etsi and mfcc, the 9 frame context window and the 468 dimensional feature representations achieved the. Spectrotemporal analysis of speech using 2d gabor filters, in proceedings of interspeech 2007, pp. Melfrequency cepstrum coefficients mfcc and modulation. These features are then combined to obtain joint spectrotemporal features which are used for posterior based speech recognition system. In previous work, we generated robust spectrotemporal features that incorporated the power normalized cepstral coefficient pncc algorithm. Features extraction gabor filterbank robust speech recognition.
In kleinschmidt, 2002a the usage of 2dimensional gabor. In this paper, gammatone filterbank is used and a comparison is done between gbfb with mel filterbank gbfb mel features and gbfb with gammatone filterbank gbfb gamm features. In this paper we investigate the applicability of spectro. Similar techniques are widely used in the visual domain. An overview of textindependent speaker recognition. To achieve this study, an ser system, based on different classifiers and different methods for features extraction, is developed. Spectrotemporal gabor features improve recognition results in all acoustic conditions under consideration compared with melfrequency cepstral coefficients. Theoretical definition, categorization of affective state and the modalities of emotion expression are presented. A comprehensive treatment of such spectrotemporal integration of speech as it relates to aging is lacking.
Speaker recognition introduction measurement of speaker characteristics construction of speaker models decision and performance applications this lecture is based on rosenberg et al. Part of the lecture notes in computer science book series lncs, volume 8509. Original speaker recognition systems used the average output of several analog filters to perform matching, often with the aid of humans in the loop. Spectrotemporal gabor features as a front end for automatic speech recognition pacs reference 43. Feature extraction techniques in speaker recognition. The author tries to use 2d spectrogram image instead of 1d information. Spectrotemporal power spectrum features for noise robust. This capability of dnns in learning taskoriented features can be utilized to learn speakerdiscriminative features as well. In this work we built a lstm based speaker recognition system on a dataset collected from cousera lectures. Pdf spectrotemporal gabor filterbank features for acoustic. We explored different gabor feature implementations, along with different speaker recognition approaches, on rossi 1 and nist sre08. Arraybased spectrotemporal masking for automatic speech. The features are derived from a longterm spectrotemporal representation of speech. Experimental results with the berlin emotional speech database show that the proposed.
Index termsspeaker recognition, gaussian mixture model, feature extraction, expectation maximization, timit database. The joint spectrotemporal features adaptively capture. Designing of gabor filters for spectrotemporal feature extraction to. Neurophysiological studies suggest that the response of neurons in the primary auditory cortex of mammals are tuned to specific spectrotemporal patterns theunissen2001. Gabor filters with high temporal modulation encode the most relevant information. Robust automatic speech recognition and modeling of auditory. Speaker recognition or broadly speech recognition has been an active area of research for the past two decades. Multistream spectrotemporal features for robust speech recognition sherry y. Spectrotemporal gabor features for speaker recognition. Selection and enhancement of gabor filters for automatic.
Introduction measurement of speaker characteristics. Temporal response characteristics of icc neurons can be interpreted by four parameters of the temporal gabor model eq. Meyer, and nikki mirghafori international computer science institute 1947 center street, suite 600 berkeley, ca 94704, usa abstract in this work, we have investigated the performance of 2d gabor features known as spectrotemporal features for speaker recognition. This technique is based on applying a simple 2d filter to the speech spectrogram to highlight the movements of spectral peaks. Algorithms for the automatic detection and recognition of acoustic. Speechnonspeech discrimination based on spectrotemporal. Optimization of gabor features for textindependent. Exploring spectrotemporal features in endtoend convolutional. Spectrotemporal directional derivative features for. A measure of phoneme similarity is proposed to quantify class separability. This response characteristic of neurons can be described by the socalled strf.
Pdf normalization of spectrotemporal gabor filter bank features. Together, the peak latency and response duration determine the locality and width of the trf. Communication systems and networks school of electrical and computer engineering. Spectrotemporal gabor features as a front end for automatic speech recognition. Novel gammatone filterbank based spectrotemporal features. This chapter presents a comparative study of speech emotion recognition ser systems. Combining binaural and cortical features for robust speech. These gabor features implement the idea of a complex, second order spectrotemporal feature extractor by considering combinations of temporal and spectral transitions as the template for desired speech elements. Shortterm spectral features, as the name suggests, are computed from short frames of about 2030 figure 2.
1101 209 623 524 350 1404 994 716 1343 65 134 512 969 546 1349 632 376 753 1509 1449 477 572 782 1194 811 85 1474 1365 1436 611 1411 583 478 1344 617 958 650 799 101 1335 798 1449 1233 725 1280