Wyniki wyszukiwania - BazTech

1

Recognition of speaker’s age group and gender for a large database of telephone-recorded voices

Staroniewicz Piotr

Vibrations in Physical Systems

|

2022

|

Vol. 33, nr 2

art. no. 2022203

EN

The paper presents the results of the automatic recognition of age group and gender of speakers performed for the large SpeechDAT(E) acoustic database for the Polish language, containing recordings of 1000 speakers (486 males/514 females) aged 12 to 73, recorded in telephone conditions. Three age groups were recognised for each gender. Mel Frequency Cepstral Coefficients (MFCC) were used to describe the recognized signals parametrically. Among the classification methods tested in this study, the best results were obtained for the SVM (Support Vector Machines) method.

2

Fully automated speaker identification and intelligibility assessment in dysarthria disease using auditory knowledge

Kadi K. L., Selouani S. A., Boudraa B., Boudraa M.

Biocybernetics and Biomedical Engineering

|

2016

|

Vol. 36, no. 1

233--247

EN

Millions of children and adults suffer from acquired or congenital neuro-motor communication disorders that can affect their speech intelligibility. The automatically characterization of speech impairment can contribute to improve the patient's life quality, and assist experts in assessment and treatment design. In this paper, we present new approaches to improve the analysis and classification of disordered speech. First, we propose an automatic speaker recognition approach especially adapted to identify dysarthric speakers. Secondly, we suggest a method for the automatic assessment of the dysarthria severity level. For this purpose, a model simulating the external, middle and inner parts of the ear is presented. This ear model provides relevant auditory-based cues that are combined with the usual Mel-Frequency Cepstral Coefficients (MFCC) to represent atypical speech utterances. The experiments are carried out by using data of both Nemours and Torgo databases of dysarthric speech. Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs) and hybrid GMM/SVM systems are tested and compared in the context of dysarthric speaker identification and assessment. The experimental results achieve a correct speaker identification rate of 97.2% which can be considered promising for this novel approach; also the existing assessment systems are outperformed with a 93.2% correct classification rate of dysarthria severity levels.

3

Charakterystyki behawioralne w biometrii

Kamińska D., Sapiński T., Pelikant A.

Elektronika : konstrukcje, technologie, zastosowania

|

2015

|

Vol. 56, nr 10

38-40

PL

W artykule omówiono sposoby pozyskiwania, przetwarzania i reprezentacji sygnałów audio w celu prowadzenia dalszych analiz związanych zarówno z semantyką wypowiedzi, jak również z cechami behawioralnymi mówcy. Przyjęto, że analiza danych powinna być prowadzona możliwie blisko miejsca ich przechowywania, np. w komercyjnych serwerach baz danych z wykorzystaniem enkapsulacji klas obiektowych do elementów programistycznych relacyjnego serwera. Poza wykorzystaniem reprezentacji sygnału za pomocą wektorów wyrażonych w skalach cepstralnych, ważnym elementem analizy jest zastosowanie algorytmów dopasowania strumieni wektorów danych – Spring DTW. W przypadku analizy stanów emocjonalnych do wzmocnienia procesu klasyfikacji zastosowano komitety klasyfikatorów działających na różnych zestawach atrybutów, a analizę odniesiono do modelu Plutchika.

EN

The article describes methods of acquisition, processing and representation of audio signals for the purpose of further analysis associated with both the semantics of expression, as well as behavioral characteristics of the speaker. It is assumed that the data analysis should be carried out as close to the place of storage, eg. in commercial database servers using the encapsulation of object classes to relational server software components. In addition to using a representation of a signal as vectors in cepstral scale, an important part of the analysis is to apply matching algorithms - Spring DTW. In order to enhance the analysis of emotional states classification committees consiting of classifiers operating on different sets of attributes were used. Emotion detection was based on Plutchik’s wheel.

4

Evaluation of speech corpora for speech and speaker recognition systems

Ślimok J., Kotas J.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 6

373--375

EN

Creating advanced speech processing and speech recognition techniques involves the need of working with real voice samples. Access to various speech corpora is extremely helpful in such a situation. Having this type of resources available during the development process, it is possible to detect errors quicker, as well as estimate algorithm parameters better. Selecting a proper voice sample set is a key element in the development of a speech processing application. Each speech corpus has been adapted to support different aspects of speech processing. The goal of this paper is to present available speech corpora. Each of them is shown in the form of a table. The tables contain the description of features helpful in choosing a suitable set of voice samples.

PL

Tworzenie zaawansowanych technik przetwarzania oraz rozpoznawania mowy wiąże się z koniecznością pracy z rzeczywistymi próbkami głosu. Dostęp do różnorodnych zbiorów sygnałów mowy jest w tej sytuacji niezwykle pomocny. Posiadając tego typu zasoby, możliwe jest szybsze wykrywanie błędów, jak również lepsze oszacowanie parametrów algorytmów. Celem niniejszego artukułu jest zaprezentowanie dostępnych zbiorów próbek głosu. Dostępne bazy mowy różnią się między sobą między innym jakością, warunkami nagrywania oraz możliwymi zastosowaniami. Część baz zawiera rejestrowane rozmowy telefoniczne, z kolei inne zawierają wypowiedzi zarejestrowane przy użyciu wielu mikrofonów wysokiej jakości. Wykorzystywanie publicznych baz danych ma jeszcze jedną ważną zaletę - umożliwia porównywanie algorytmów stworzonych przez różne ośrodki badawcze, wykorzystujące tę samą metodologię. Uzyskiwane wyniki są prezentowane w postaci benchmarków, co umożliwia szybkie porównywanie opracowanych rozwiązań. Z tego powodu, wybór odpowiedniej bazy mowy jest kluczowy z punktu widzenia skuteczności działania systemu. Każdy ze zbiorów został przedstawiony w formie tabeli. Tabele zawierają opis cech pomocnych podczas wyboru odpowiedniego zbioru próbek głosu.

5

Classification of speech intelligibility in Parkinson's disease

Khan T., Westin J., Dougherty M.

Biocybernetics and Biomedical Engineering

|

2014

|

Vol. 34, no. 1

35--45

EN

A problem in the clinical assessment of running speech in Parkinson's disease (PD) is to track underlying deficits in a number of speech components including respiration, phonation, articulation and prosody, each of which disturbs the speech intelligibility. A set of 13 features, including the cepstral separation difference and Mel-frequency cepstral coefficients were computed to represent deficits in each individual speech component. These features were then used in training a support vector machine (SVM) using n-fold cross validation. The dataset used for method development and evaluation consisted of 240 running speech samples recorded from 60 PD patients and 20 healthy controls. These speech samples were clinically rated using the Unified Parkinson's Disease Rating Scale Motor Examination of Speech (UPDRS-S). The classification accuracy of SVM was 85% in 3 levels of UPDRS-S scale and 92% in 2 levels with the average area under the ROC (receiver operating characteristic) curves of around 91%. The strong classification ability of selected features and the SVM model supports suitability of this scheme to monitor speech symptoms in PD.

6

Cepstral separation difference: A novel approach for speech impairment quantification in Parkinson's disease

Khan T., Westin J., Dougherty M.

Biocybernetics and Biomedical Engineering

|

2014

|

Vol. 34, no. 1

25--34

EN

This paper introduces a novel approach, Cepstral Separation Difference (CSD), for quantification of speech impairment in Parkinson's disease (PD). CSD represents a ratio between the magnitudes of glottal (source) and supra-glottal (filter) log-spectrums acquired using the source-filter speech model. The CSD-based features were tested on a database consisting of 240 clinically rated running speech samples acquired from 60 PD patients and 20 healthy controls. The Guttmann (μ2) monotonic correlations between the CSD features and the speech symptom severity ratings were strong (up to 0.78). This correlation increased with the increasing textual difficulty in different speech tests. CSD was compared with some non-CSD speech features (harmonic ratio, harmonic-to-noise ratio and Mel-frequency cepstral coefficients) for speech symptom characterization in terms of consistency and reproducibility. The high intra-class correlation coefficient (>0.9) and analysis of variance indicates that CSD features can be used reliably to distinguish between severity levels of speech impairment. Results motivate the use of CSD in monitoring speech symptoms in PD.

7

Improving speech processing based on phonetics and phonology of Polish language

Kłosowski P.

Przegląd Elektrotechniczny

|

2013

|

R. 89, nr 8

303--307

EN

The article presents methods of improving speech processing based on phonetics and phonology of Polish language. The new presented method for speech recognition was based on detection of distinctive acoustic parameters of phonemes in Polish language. Distinctivity has been assumed as the most important selection of parameters, which have represented objects from recognized classes. Speech recognition is widely used in telecommunications applications.

PL

W artykule zaprezentowano metody usprawnienia przetwarzania mowy wykorzystując do tego celu wiedzę z zakresu fonetyki I fonologii języka polskiego. Przedstawiona innowacyjna metoda automatycznego rozpoznawania mowy polega na detekcji akustycznych parametrów dystynktywnych fonemów mowy polskiej. O dystynktywności cech decydują parametry niezbędne do klasyfikacji fonemów.

8

Comparison of speaker dependent and speaker independent emotion recognition

Rybka J., Janicki A.

International Journal of Applied Mathematics and Computer Science

|

2013

|

Vol. 23, no. 4

797--808

EN

This paper describes a study of emotion recognition based on speech analysis. The introduction to the theory contains a review of emotion inventories used in various studies of emotion recognition as well as the speech corpora applied, methods of speech parametrization, and the most commonly employed classification algorithms. In the current study the EMO-DB speech corpus and three selected classifiers, the k-Nearest Neighbor (k-NN), the Artificial Neural Network (ANN) and Support Vector Machines (SVMs), were used in experiments. SVMs turned out to provide the best classification accuracy of 75.44% in the speaker dependent mode, that is, when speech samples from the same speaker were included in the training corpus. Various speaker dependent and speaker independent configurations were analyzed and compared. Emotion recognition in speaker dependent conditions usually yielded higher accuracy results than a similar but speaker independent configuration. The improvement was especially well observed if the base recognition ratio of a given speaker was low. Happiness and anger, as well as boredom and neutrality, proved to be the pairs of emotions most often confused.

9

Detection of disfluencies in speech signal

Barczewska K., Igras M.

Challenges of Modern Technology

|

2013

|

Vol. 4, no. 2

3--10

EN

During public presentations or interviews, speakers commonly and unconsciously abuse interjections or filled pauses that interfere with speech fluency and negatively affect listeners impression and speech perception. Types of disfluencies and methods of detection are reviewed. Authors carried out a survey which results indicated the most adverse elements for audience. The article presents an approach to automatic detection of the most common type of disfluencies - filled pauses. A base of patterns of filled pauses (prolongated I, prolongated e, mm, Im, xmm, using SAMPA notation) was collected from 72 minutes of recordings of public presentations and interviews of six speakers (3 male, 3 female). Statistical analysis of length and frequency of occurrence of such interjections in recordings are presented. Then, each pattern from training set was described with mean values of first and second formants (F1 and F2). Detection was performed on test set of recordings by recognizing the phonemes using the two formants with efficiency of recognition about 68%. The results of research on disfluencies in speech detection may be applied in a system that analyzes speech and provides feedback of imperfections that occurred during speech in order to help in oratorical skills training. A conceptual prototype of such an application is proposed. Moreover, a base of patterns of most common disfluencies can be used in speech recognition systems to avoid interjections during speech-to-text transcription.

10

Amplitude and Frequency Modulation in Speaker Recognition Systems

Ciota Z.

International Journal of Microelectronics and Computer Science

|

2012

|

Vol. 3, nr 2

41-44

EN

The paper presents a review of the nowadays methods of voice vector extraction, applied in such speech processing, like person identification and emotion recognition. A special attention was held on mixed time-frequency analysis based on temporary frequency approach. The methods of calculation of time - frequency voice characterization were also described. The most important building blocks of identification and recognition of speakers have been presented. The characterization of feature vectors suitable for identification and verification in microcomputer systems was described. Components and appropriate method of speech identification based on the long-term spectra vectors were discussed.

11

Diagnostic significance of phase spectrum in acoustic analysis of pathological voice

Samborska-Owczarek A.

Pomiary Automatyka Kontrola

|

2010

|

R. 56, nr 12

1547-1550

EN

The paper regards the possibility of using new numerical features extracted from the phase spectrum of a speech signal for voice quality estimation in acoustic analysis for medical purposes. This novel approach does not require detection or estimation of the fundamental frequency and works on all types of speech signal: euphonic, dysphonic and aphonic as well. The experiment results presented in the paper are very promising: the developed F0-independant voice features are strongly correlated with two voice quality indicators: grade of hoarseness G (r>0.8) and roughness R (r>0.75) from GIRBAS scale, and exceed the standard voice parameters: jitter and shimmer.

PL

Artykuł dotyczy możliwości ekstrakcji cech numerycznych z widma fazowego sygnału mowy w celu wykorzystania w analizie akustycznej na potrzeby medyczne. Podejście to umożliwia uzależnienie analizy akustycznej od zawodnych metod wykrywania/wyznaczania częstotliwości podstawowej (tonu krtaniowego) i dzięki temu przeznaczone jest do badania wszystkich typów sygnału mowy (również afonicznych). Wyniki eksperymentu są bardzo obiecujące - proponowane cechy Ph1 i Ph2 są silnie skorelowane z dwoma kategoriami percepcyjnymi: stopniem chrypki (r>0.8) oraz szorstkością głosu (r>0.75) ze skali GIRBAS, wykazując silniejsze znaczenie diagnostyczne niż znane i stosowane od dawna wskaźniki jitter i shimmer. Proponowane podejście oprócz skuteczności charakteryzuje się szeregiem dodatkowych korzyści: algorytm metody z powodu niskiej złożoności jest szybki i niekosztowny, interpretacja matematyczna jest prosta i jednoznaczna oraz spójna z obserwowanym obrazem widma fazowego głosu. Ponadto uniezależnienie od detekcji częstotliwości podstawowej sprawia, że algorytm jest deterministyczny oraz efektywny dla każdego typu sygnału mowy.

12

Polish phoneme statistics obtained on large set of written texts

Ziółko B., Gałka J., Ziółko M.

Computer Science

|

2009

|

Vol. 10

97-106

EN

The phonetical statistics were collected from several Polish corpora. The paper is a summary of the data which are phoneme n-grams and some phenomena in the statistics. Triphone statistics apply context-dependent speech units which have an important role in speech recognition systems and were never calculated for a large set of Polish written texts. The standard phonetic alphabet for Polish, SAMPA, and methods of providing phonetic transcriptions are described.

PL

W niniejszej pracy zaprezentowano opis statystyk głosek języka polskiego zebranych z dużej liczby tekstów. Triady głosek pełnią istotną rolę w rozpoznawaniu mowy. Omówiono obserwacje dotyczące zebranych statystyk i przedstawiono listy najpopularniejszych elementów.

13

Automatyczna ocena zaburzeń emisji głosu będących wynikiem procesów neurodegeneracyjnych w oparciu o analizę wyizolowanych głosek

Orzechowski T., Chmurzyńska K., Radkowski P.

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

|

2006

|

T. 10, z. 3

91-97

PL

Prezentowane wyniki stanowią początek badań nad automatyczną klasyfikacją głosu. W niniejszej pracy zarysowano teoretyczne podstawy fizjologiczne głosu, patologiczne zmiany w mowie powodowane dyzartrią, następnie scharakteryzowano dobór materiału lingwistycznego pod względem miejsca i sposobu artykulacji w systemie fonetycznym języka polskiego. Kolejne miejsce w pracy zajmuje opis rejestracji i wstępnej analizy głosu badanych (zmiany w realizacji głosek, natężenie głosek wymawianych wielokrotnie w izolacji, analiza widma dźwięków ciągłych). Zjawiska słyszane w badaniu subiektywnym patologa mowy, bądź neurologa zostały potwierdzone precyzyjnym badaniem obiektywnym. Uzyskane parametry pozwalają na sparametryzowanie wyników badań, umożliwiające kompleksową klasyfikację. Pozwoli to również na dokładną ocenę progresji choroby, niemożliwą w klasycznym badaniu subiektywnym.

EN

This paper presents results of preliminary research of voice pathological changes caused by dysarthria. Computer analysis of voice may lead to identification of parameters correlated with neurological diseases. The selection of linguistic material was characterized according to the place and manner of articulation in the phonetic system of Polish. Results of clinical examination allowed to determine simple markers of neurodegenerative diseases, which will serve as a basis for construction of objective examination model.

14

Blind deconvolution of timely correlated sources by gradient descent search

Kasprzak W.

Image Processing & Communications

|

2003

|

Vol. 9, no 1-2

33-52

EN

In multichannel blind decon volution (MB D) the goal is to calculate possibly scaled and delayed estimates of source signals from their convoluted mixtures, using approximate knowledge of the source characteristics oniy. Nearly all of the solutions to MBD proposed so far require from source signals to be pairwise statistically independent and to be timely not correlated. In practice, this can only be satisfied by specific synthetic signals. In this paper we describe how to modify gradient-based iterative algorithms in order to perform the MBD task on timely correlated sources. Implementation issues are discussed and specific tests on synthetic and real 2-D images are documented.

15

Pathological speech recognition based on images generated by the Kohonen neural network

Kapusta M., Gajer M., Shomali A.

Image Processing & Communications

|

1999

|

Vol. 5, no 2

11-17

EN

The nature of speech signal is very complicated, that causes that its visualisation and further analysis, without some initial pre-processing, is very complicated and doesn't always bring the desired effects. Speech signal in most cases is represented by videograms. The analysis of these forms of signal visualisation is not easy because of difficulties in their interpretation. In this article the usage of Kohonen neural network for visualising speech signals uttered by children with a cleft palate was proposed. Speech signal is converted to its spectrum matrices representation, which constitutes the input for Kohonen neural network. Further a method for generating a simplified form of speech signal (a poly-line figure) based on the network's output was presented. In addition a method for pathological speech signal recognition was presented. Test results based on utterances obtained from children with a cleft palate were presented.