Ograniczanie wyników
Czasopisma help
Autorzy help
Lata help
Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 27

Liczba wyników na stronie
first rewind previous Strona / 2 next fast forward last
Wyniki wyszukiwania
Wyszukiwano:
w słowach kluczowych:  speech processing
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 2 next fast forward last
EN
With global life expectancy rising every year, ageing-associated diseases are becoming an increasingly important problem. Very often, successful treatment relies on early diagnosis. In this work, the issue of Parkinson's disease (PD) diagnostics is tackled. It is particularly important, as there are no certain antemortem methods of diagnosing PD - meaning that the presence of the disease can only be confirmed after the patient's death. In our work, we propose a non-invasive approach for classification of raw speech recordings for PD recognition using deep learning models. The core of the method is an audio classifier using knowledge transfer from a pretrained natural language model, namely wav2vec 2.0. The model was tested on a group of 38 PD patients and 10 healthy persons above the age of 50. A dataset of speech recordings acquired using a smartphone recorder was constructed and the recordings were labelled as PD/non-PD with the severity of the disease additionally rated using Hoehn-Yahr scale. We then benchmarked the classification performance against baseline methods. Additionally, we show an assessment of human-level performance with neurology professionals.
EN
In order to design a stable and reliable voice communication system, it is essential to know how many resources are necessary for conveying quality content. These parameters may include objective quality of service (QoS) metrics, such as: available bandwidth, bit error rate (BER), delay, latency as well as subjective quality of experience (QoE) related to user expectations. QoE is expressed as clarity of speech and the ability to interpret voice commands with adequate mean opinion score (MOS) grades. This paper describes a quality evaluation study of a two-way speech transmission system via bandwidth over power line – power line communication (BPL-PLC) technology in an operating underground mine. We investigate how different features of the available wired medium can affect end-user quality. The results of the described study include: two types of coupling (capacitive and inductive), two transmission modes (mode 1 and 11), and four language sets of speech samples (American English, British English, German, and Polish) encoded at three different bit rates (8, 16, and 24 kbps). Our findings can aid both researchers working on low-bit rate coding and compression, signal processing and speech perception, as well as professionals active in the mining and oil industry.
EN
The paper presents the results of the automatic recognition of age group and gender of speakers performed for the large SpeechDAT(E) acoustic database for the Polish language, containing recordings of 1000 speakers (486 males/514 females) aged 12 to 73, recorded in telephone conditions. Three age groups were recognised for each gender. Mel Frequency Cepstral Coefficients (MFCC) were used to describe the recognized signals parametrically. Among the classification methods tested in this study, the best results were obtained for the SVM (Support Vector Machines) method.
EN
Today’s human-computer interaction systems have a broad variety of applications in which automatic human emotion recognition is of great interest. Literature contains many different, more or less successful forms of these systems. This work emerged as an attempt to clarify which speech features are the most informative, which classification structure is the most convenient for this type of tasks, and the degree to which the results are influenced by database size, quality and cultural characteristic of a language. The research is presented as the case study on Slavic languages.
EN
Although the emotions and learning based on emotional reaction are individual-specific, the main features are consistent among all people. Depending on the emotional states of the persons, various physical and physiological changes can be observed in pulse and breathing, blood flow velocity, hormonal balance, sound properties, face expression and hand movements. The diversity, size and grade of these changes are shaped by different emotional states. Acoustic analysis, which is an objective evaluation method, is used to determine the emotional state of people’s voice characteristics. In this study, the reflection of anxiety disorder in people’s voices was investigated through acoustic parameters. The study is a case-control study in cross-sectional quality. Voice recordings were obtained from healthy people and patients. With acoustic analysis, 122 acoustic parameters were obtained from these voice recordings. The relation of these parameters to anxious state was investigated statistically. According to the results obtained, 42 acoustic parameters are variable in the anxious state. In the anxious state, the subglottic pressure increases and the vocalization of the vowels decreases. The MFCC parameter, which changes in the anxious state, indicates that people can perceive this situation while listening to the speech. It has also been shown that text reading is also effective in triggering the emotions. These findings show that there is a change in the voice in the anxious state and that the acoustic parameters are influenced by the anxious state. For this reason, acoustic analysis can be used as an expert decision support system for the diagnosis of anxiety.
EN
Speech emotion recognition is an important part of human-machine interaction studies. The acoustic analysis method is used for emotion recognition through speech. An emotion does not cause changes on all acoustic parameters. Rather, the acoustic parameters affected by emotion vary depending on the emotion type. In this context, the emotion-based variability of acoustic parameters is still a current field of study. The purpose of this study is to investigate the acoustic parameters that fear affects and the extent of their influence. For this purpose, various acoustic parameters were obtained from speech records containing fear and neutral emotions. The change according to the emotional states of these parameters was analyzed using statistical methods, and the parameters and the degree of influence that the fear emotion affected were determined. According to the results obtained, the majority of acoustic parameters that fear affects vary according to the used data. However, it has been demonstrated that formant frequencies, mel-frequency cepstral coefficients, and jitter parameters can define the fear emotion independent of the data used.
EN
Millions of children and adults suffer from acquired or congenital neuro-motor communication disorders that can affect their speech intelligibility. The automatically characterization of speech impairment can contribute to improve the patient's life quality, and assist experts in assessment and treatment design. In this paper, we present new approaches to improve the analysis and classification of disordered speech. First, we propose an automatic speaker recognition approach especially adapted to identify dysarthric speakers. Secondly, we suggest a method for the automatic assessment of the dysarthria severity level. For this purpose, a model simulating the external, middle and inner parts of the ear is presented. This ear model provides relevant auditory-based cues that are combined with the usual Mel-Frequency Cepstral Coefficients (MFCC) to represent atypical speech utterances. The experiments are carried out by using data of both Nemours and Torgo databases of dysarthric speech. Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs) and hybrid GMM/SVM systems are tested and compared in the context of dysarthric speaker identification and assessment. The experimental results achieve a correct speaker identification rate of 97.2% which can be considered promising for this novel approach; also the existing assessment systems are outperformed with a 93.2% correct classification rate of dysarthria severity levels.
EN
Creating advanced speech processing and speech recognition techniques involves the need of working with real voice samples. Access to various speech corpora is extremely helpful in such a situation. Having this type of resources available during the development process, it is possible to detect errors quicker, as well as estimate algorithm parameters better. Selecting a proper voice sample set is a key element in the development of a speech processing application. Each speech corpus has been adapted to support different aspects of speech processing. The goal of this paper is to present available speech corpora. Each of them is shown in the form of a table. The tables contain the description of features helpful in choosing a suitable set of voice samples.
PL
Tworzenie zaawansowanych technik przetwarzania oraz rozpoznawania mowy wiąże się z koniecznością pracy z rzeczywistymi próbkami głosu. Dostęp do różnorodnych zbiorów sygnałów mowy jest w tej sytuacji niezwykle pomocny. Posiadając tego typu zasoby, możliwe jest szybsze wykrywanie błędów, jak również lepsze oszacowanie parametrów algorytmów. Celem niniejszego artukułu jest zaprezentowanie dostępnych zbiorów próbek głosu. Dostępne bazy mowy różnią się między sobą między innym jakością, warunkami nagrywania oraz możliwymi zastosowaniami. Część baz zawiera rejestrowane rozmowy telefoniczne, z kolei inne zawierają wypowiedzi zarejestrowane przy użyciu wielu mikrofonów wysokiej jakości. Wykorzystywanie publicznych baz danych ma jeszcze jedną ważną zaletę - umożliwia porównywanie algorytmów stworzonych przez różne ośrodki badawcze, wykorzystujące tę samą metodologię. Uzyskiwane wyniki są prezentowane w postaci benchmarków, co umożliwia szybkie porównywanie opracowanych rozwiązań. Z tego powodu, wybór odpowiedniej bazy mowy jest kluczowy z punktu widzenia skuteczności działania systemu. Każdy ze zbiorów został przedstawiony w formie tabeli. Tabele zawierają opis cech pomocnych podczas wyboru odpowiedniego zbioru próbek głosu.
EN
This paper presents an analysis of issues related to the fixed-point implementation of a speech signal applied to biometric purposes. For preparing the system for automatic speaker identification and for experimental tests we have used the Matlab computing environment and the development software for Texas Instruments digital signal processors, namely the Code Composer Studio (CCS). The tested speech signals have been processed with the TMS320C5515 processor. The paper examines limitations associated with operation of the realized embedded system, demonstrates advantages and disadvantages of the technique of automatic software conversion from Matlab to the CCS and shows the impact of the fixed-point representation on the speech identification effectiveness.
10
Content available remote Classification of speech intelligibility in Parkinson's disease
EN
A problem in the clinical assessment of running speech in Parkinson's disease (PD) is to track underlying deficits in a number of speech components including respiration, phonation, articulation and prosody, each of which disturbs the speech intelligibility. A set of 13 features, including the cepstral separation difference and Mel-frequency cepstral coefficients were computed to represent deficits in each individual speech component. These features were then used in training a support vector machine (SVM) using n-fold cross validation. The dataset used for method development and evaluation consisted of 240 running speech samples recorded from 60 PD patients and 20 healthy controls. These speech samples were clinically rated using the Unified Parkinson's Disease Rating Scale Motor Examination of Speech (UPDRS-S). The classification accuracy of SVM was 85% in 3 levels of UPDRS-S scale and 92% in 2 levels with the average area under the ROC (receiver operating characteristic) curves of around 91%. The strong classification ability of selected features and the SVM model supports suitability of this scheme to monitor speech symptoms in PD.
EN
This paper introduces a novel approach, Cepstral Separation Difference (CSD), for quantification of speech impairment in Parkinson's disease (PD). CSD represents a ratio between the magnitudes of glottal (source) and supra-glottal (filter) log-spectrums acquired using the source-filter speech model. The CSD-based features were tested on a database consisting of 240 clinically rated running speech samples acquired from 60 PD patients and 20 healthy controls. The Guttmann (μ2) monotonic correlations between the CSD features and the speech symptom severity ratings were strong (up to 0.78). This correlation increased with the increasing textual difficulty in different speech tests. CSD was compared with some non-CSD speech features (harmonic ratio, harmonic-to-noise ratio and Mel-frequency cepstral coefficients) for speech symptom characterization in terms of consistency and reproducibility. The high intra-class correlation coefficient (>0.9) and analysis of variance indicates that CSD features can be used reliably to distinguish between severity levels of speech impairment. Results motivate the use of CSD in monitoring speech symptoms in PD.
12
Content available remote Improving speech processing based on phonetics and phonology of Polish language
EN
The article presents methods of improving speech processing based on phonetics and phonology of Polish language. The new presented method for speech recognition was based on detection of distinctive acoustic parameters of phonemes in Polish language. Distinctivity has been assumed as the most important selection of parameters, which have represented objects from recognized classes. Speech recognition is widely used in telecommunications applications.
PL
W artykule zaprezentowano metody usprawnienia przetwarzania mowy wykorzystując do tego celu wiedzę z zakresu fonetyki I fonologii języka polskiego. Przedstawiona innowacyjna metoda automatycznego rozpoznawania mowy polega na detekcji akustycznych parametrów dystynktywnych fonemów mowy polskiej. O dystynktywności cech decydują parametry niezbędne do klasyfikacji fonemów.
EN
This paper describes a study of emotion recognition based on speech analysis. The introduction to the theory contains a review of emotion inventories used in various studies of emotion recognition as well as the speech corpora applied, methods of speech parametrization, and the most commonly employed classification algorithms. In the current study the EMO-DB speech corpus and three selected classifiers, the k-Nearest Neighbor (k-NN), the Artificial Neural Network (ANN) and Support Vector Machines (SVMs), were used in experiments. SVMs turned out to provide the best classification accuracy of 75.44% in the speaker dependent mode, that is, when speech samples from the same speaker were included in the training corpus. Various speaker dependent and speaker independent configurations were analyzed and compared. Emotion recognition in speaker dependent conditions usually yielded higher accuracy results than a similar but speaker independent configuration. The improvement was especially well observed if the base recognition ratio of a given speaker was low. Happiness and anger, as well as boredom and neutrality, proved to be the pairs of emotions most often confused.
14
Content available Detection of disfluencies in speech signal
EN
During public presentations or interviews, speakers commonly and unconsciously abuse interjections or filled pauses that interfere with speech fluency and negatively affect listeners impression and speech perception. Types of disfluencies and methods of detection are reviewed. Authors carried out a survey which results indicated the most adverse elements for audience. The article presents an approach to automatic detection of the most common type of disfluencies - filled pauses. A base of patterns of filled pauses (prolongated I, prolongated e, mm, Im, xmm, using SAMPA notation) was collected from 72 minutes of recordings of public presentations and interviews of six speakers (3 male, 3 female). Statistical analysis of length and frequency of occurrence of such interjections in recordings are presented. Then, each pattern from training set was described with mean values of first and second formants (F1 and F2). Detection was performed on test set of recordings by recognizing the phonemes using the two formants with efficiency of recognition about 68%. The results of research on disfluencies in speech detection may be applied in a system that analyzes speech and provides feedback of imperfections that occurred during speech in order to help in oratorical skills training. A conceptual prototype of such an application is proposed. Moreover, a base of patterns of most common disfluencies can be used in speech recognition systems to avoid interjections during speech-to-text transcription.
EN
The paper analyzes the estimation of the fundamental frequency from the real speech signal which is obtained by recording the speaker in the real acoustic environment modeled by the MP3 method. The estimation was performed by the Picking-Peaks algorithm with implemented parametric cubic convolution (PCC) interpolation. The efficiency of PCC was tested for Catmull-Rom, Greville, and Greville two- parametric kernel. Depending on MSE, a window that gives optimal results was chosen.
EN
The paper presents a review of the nowadays methods of voice vector extraction, applied in such speech processing, like person identification and emotion recognition. A special attention was held on mixed time-frequency analysis based on temporary frequency approach. The methods of calculation of time - frequency voice characterization were also described. The most important building blocks of identification and recognition of speakers have been presented. The characterization of feature vectors suitable for identification and verification in microcomputer systems was described. Components and appropriate method of speech identification based on the long-term spectra vectors were discussed.
PL
W artykule zaproponowano nową metodę czasu rzeczywistego do segmentacji mówców w rozmowach telefonicznych. Zakładamy, że dzięki dostępowi do wyposażenia jednej ze stron (np. biura operatora centrum obsługi telefonów alarmowych) istnieje możliwość dodania cyfrowego znaku wodnego do wypowiedzi operatora. Przedstawiona procedura może służyć jako wstępny etap przetwarzania sygnału w zagadnieniach automatycznego rozpoznawania mówcy. Jej skuteczne działanie zostało przetestowane w środowisku Matlab / Simulink przy różnych tłach akustycznych.
EN
In this paper a new real-time method for speaker segmentation in telephone calls is proposed. We assume that due to access to the equipment of one side (e.g., an operator office of the emergency call service center) there a possibility to add a digital watermark to the operator's utterances. The presented procedure can serve as a pre-processing stage for automatic speaker recognition. It has been tested with various acoustic backgrounds using Matlab / Simulink environment.
18
Content available remote Signal processing of voice in case of patients after stroke
PL
Artykuł dotyczy zmian w parametrach głosu u osób po przebytym udarze mózgu. Przetwarzanie sygnałów może być użyteczne do celów monitorowania postępów hospitalizacji i rehabilitacji. Dokonano nagrań głosu kilkunastu pacjentów, następnie przeprowadzono analizy wszystkich próbek z użyciem różnorodnych algorytmów.
EN
The paper is focused in human voice changes as a stroke result. We suppose that signal processing could be useful to monitor of hospitalisation and rehabilitation progress. The voice of several patients has been recorded; afterwards the analysis of these samples has been performed, by using various algorithms.
EN
The paper regards the possibility of using new numerical features extracted from the phase spectrum of a speech signal for voice quality estimation in acoustic analysis for medical purposes. This novel approach does not require detection or estimation of the fundamental frequency and works on all types of speech signal: euphonic, dysphonic and aphonic as well. The experiment results presented in the paper are very promising: the developed F0-independant voice features are strongly correlated with two voice quality indicators: grade of hoarseness G (r>0.8) and roughness R (r>0.75) from GIRBAS scale, and exceed the standard voice parameters: jitter and shimmer.
PL
Artykuł dotyczy możliwości ekstrakcji cech numerycznych z widma fazowego sygnału mowy w celu wykorzystania w analizie akustycznej na potrzeby medyczne. Podejście to umożliwia uzależnienie analizy akustycznej od zawodnych metod wykrywania/wyznaczania częstotliwości podstawowej (tonu krtaniowego) i dzięki temu przeznaczone jest do badania wszystkich typów sygnału mowy (również afonicznych). Wyniki eksperymentu są bardzo obiecujące - proponowane cechy Ph1 i Ph2 są silnie skorelowane z dwoma kategoriami percepcyjnymi: stopniem chrypki (r>0.8) oraz szorstkością głosu (r>0.75) ze skali GIRBAS, wykazując silniejsze znaczenie diagnostyczne niż znane i stosowane od dawna wskaźniki jitter i shimmer. Proponowane podejście oprócz skuteczności charakteryzuje się szeregiem dodatkowych korzyści: algorytm metody z powodu niskiej złożoności jest szybki i niekosztowny, interpretacja matematyczna jest prosta i jednoznaczna oraz spójna z obserwowanym obrazem widma fazowego głosu. Ponadto uniezależnienie od detekcji częstotliwości podstawowej sprawia, że algorytm jest deterministyczny oraz efektywny dla każdego typu sygnału mowy.
EN
The phonetical statistics were collected from several Polish corpora. The paper is a summary of the data which are phoneme n-grams and some phenomena in the statistics. Triphone statistics apply context-dependent speech units which have an important role in speech recognition systems and were never calculated for a large set of Polish written texts. The standard phonetic alphabet for Polish, SAMPA, and methods of providing phonetic transcriptions are described.
PL
W niniejszej pracy zaprezentowano opis statystyk głosek języka polskiego zebranych z dużej liczby tekstów. Triady głosek pełnią istotną rolę w rozpoznawaniu mowy. Omówiono obserwacje dotyczące zebranych statystyk i przedstawiono listy najpopularniejszych elementów.
first rewind previous Strona / 2 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.