Wyniki wyszukiwania - BazTech

1

Lossy coding impact on speech recognition with convolutional neural networks

Kucharski Mateusz

Vibrations in Physical Systems

|

2022

|

Vol. 33, nr 3

art. no. 2022302

EN

This paper presents research of lossy coding impact on speech recognition with convolutional neural networks. For this purpose, google speech commands dataset containing utterances of 30 words was encoded using four most common all-purpose codecs: mp3, aac, wma and ogg. A convolutional neural network was taught using part of the original files and later tested with the rest of the files, as well as their counterparts encoded with different codecs and bitrates. The same network model was also taught using mp3 encoded data showing the biggest loss in effectiveness of the previous network. Results show that lossy coding does have an effect on speech recognition, especially for low bitrates.

2

HMM-based phoneme speech recognition system for the control and command of industrial robots

Naik Adwait

Technical Transactions

|

2021

|

Vol. 118, iss. 1

art. no. e2021002

EN

n recent years, the integration of human-robot interaction with speech recognition has gained a lot of pace in the manufacturing industries. Conventional methods to control the robots include semi-autonomous, fully-autonomous, and wired methods. Operating through a teaching pendant or a joystick is easy to implement but is not effective when the robot is deployed to perform complex repetitive tasks. Speech and touch are natural ways of communicating for humans and speech recognition, being the best option, is a heavily researched technology. In this study, we aim at developing a stable and robust speech recognition system to allow humans to communicate with machines (robotic-arm) in a seamless manner. This paper investigates the potential of the linear predictive coding technique to develop a stable and robust HMM-based phoneme speech recognition system for applications in robotics. Our system is divided into three segments: a microphone array, a voice module, and a robotic arm with three degrees of freedom (DOF). To validate our approach, we performed experiments with simple and complex sentences for various robotic activities such as manipulating a cube and pick and place tasks. Moreover, we also analyzed the test results to rectify problems including accuracy and recognition score.

3

Speech-Based Vehicle Movement Control Solution

Kaur Gurpreet, Srivastava Mohit, Kumar Amod

Journal of Telecommunications and Information Technology

|

2021

|

nr 3

72--77

EN

The article describes a speech-based robotic prototype designed to aid the movement of elderly or handicapped individuals. Mel frequency cepstral coefficients (MFCC) are used for the extraction of speech features and a deep belief network (DBN) is trained for the recognition of commands. The prototype was tested in a real-world environment and achieved an accuracy rate of 87.4%.

4

Development of Speaker Voice Identification Using Main Tone Boundary Statistics for Applying To Robot-Verbal Systems

Amirgaliyev Yedilkhan, Musabayev Timur, Yedilkhan Didar, Wojcik Waldemar, Amirgaliyeva Zhazira

International Journal of Electronics and Telecommunications

|

2020

|

Vol. 66, No. 3

583--588

EN

Hereby there is given the speaker identification basic system. There is discussed application and usage of the voice interfaces, in particular, speaker voice identification upon robot and human being communication. There is given description of the information system for speaker automatic identification according to the voice to apply to robotic-verbal systems. There is carried out review of algorithms and computer-aided learning libraries and selected the most appropriate, according to the necessary criteria, ALGLIB. There is conducted the research of identification model operation performance assessment at different set of the fundamental voice tone. As the criterion of accuracy there has been used the percentage of improperly classified cases of a speaker identification.

5

Czech parliament meeting recordings as ASR training data

Krůza Jan Oldřich

Annals of Computer Science and Information Systems

|

2020

|

Vol. 21

185--188

EN

I present a way to leverage the stenographed recordings of the Czech parliament meetings for purposes of training a speech-to-text system. The article presents a method for scraping the data, acquiring word-level alignment and selecting reliable parts of the imprecise transcript. Finally, I present an ASR system trained on these and other data.

6

Agentowa struktura wielomodalnego interfejsu do Narodowej Platformy Cyberbezpieczeństwa, część 2

Kasprzak Włodzimierz, Szynkiewicz Wojciech, Stefańczyk Maciej, Dudek Wojciech, Figat Maksym, Węgierek Maciej, Seredyński Dawid, Zieliński Cezary

Pomiary Automatyka Robotyka

|

2019

|

R. 23, nr 4

5--18

PL

Ten dwuczęściowy artykuł przedstawia interfejs do Narodowej Platformy Cyberbezpieczeństwa (NPC). Wykorzystuje on gesty i komendy wydawane głosem do sterowania pracą platformy. Ta część artykułu przedstawia strukturę interfejsu oraz sposób jego działania, ponadto prezentuje zagadnienia związane z jego implementacją. Do specyfikacji interfejsu wykorzystano podejście oparte na agentach upostaciowionych, wykazując że podejście to może być stosowane do tworzenia nie tylko systemów robotycznych, do czego było wykorzystywane wielokrotnie uprzednio. Aby dostosować to podejście do agentów, które działają na pograniczu środowiska fizycznego i cyberprzestrzeni, należało ekran monitora potraktować jako część środowiska, natomiast okienka i kursory potraktować jako elementy agentów. W konsekwencji uzyskano bardzo przejrzystą strukturę projektowanego systemu. Część druga tego artykułu przedstawia algorytmy wykorzystane do rozpoznawania mowy i mówców oraz gestów, a także rezultaty testów tych algorytmów.

EN

This two part paper presents an interface to the National Cybersecurity Platform utilising gestures and voice commands as the means of interaction between the operator and the platform. Cyberspace and its underlying infrastructure are vulnerable to a broad range of risk stemming from diverse cyber-threats. The main role of this interface is to support security analysts and operators controlling visualisation of cyberspace events like incidents or cyber-attacks especially when manipulating graphical information. Main visualization control modalities are gesture- and voice-based commands. Thus the design of gesture recognition and speech-recognition modules is provided. The speech module is also responsible for speaker identification in order to limit the access to trusted users only, registered with the visualisation control system. This part of the paper focuses on the structure and the activities of the interface, while the second part concentrates on the algorithms employed for the recognition of: gestures, voice commands and speakers.

7

Agentowa struktura wielomodalnego interfejsu do Narodowej Platformy Cyberbezpieczeństwa, część 1

Kasprzak Włodzimierz, Szynkiewicz Wojciech, Stefańczyk Maciej, Dudek Wojciech, Figat Maksym, Węgierek Maciej, Seredyński Dawid, Zieliński Cezary

Pomiary Automatyka Robotyka

|

2019

|

R. 23, nr 3

41--54

PL

Ten dwuczęściowy artykuł przedstawia interfejs do Narodowej Platformy Cyberbezpieczeństwa (NPC). Wykorzystuje on gesty i komendy wydawane głosem do sterowania pracą platformy. Ta część artykułu przedstawia strukturę interfejsu oraz sposób jego działania, ponadto prezentuje zagadnienia związane z jego implementacją. Do specyfikacji interfejsu wykorzystano podejście oparte na agentach upostaciowionych, wykazując że podejście to może być stosowane do tworzenia nie tylko systemów robotycznych, do czego było wykorzystywane wielokrotnie uprzednio. Aby dostosować to podejście do agentów, które działają na pograniczu środowiska fizycznego i cyberprzestrzeni, należało ekran monitora potraktować jako część środowiska, natomiast okienka i kursory potraktować jako elementy agentów. W konsekwencji uzyskano bardzo przejrzystą strukturę projektowanego systemu. Część druga tego artykułu przedstawia algorytmy wykorzystane do rozpoznawania mowy i mówców oraz gestów, a także rezultaty testów tych algorytmów.

EN

This two part paper presents an interface to the National Cybersecurity Platform utilising gestures and voice commands as the means of interaction between the operator and the platform. Cyberspace and its underlying infrastructure are vulnerable to a broad range of risk stemming from diverse cyber-threats. The main role of this interface is to support security analysts and operators controlling visualisation of cyberspace events like incidents or cyber-attacks especially when manipulating graphical information. Main visualization control modalities are gesture- and voice-based commands. Thus the design of gesture recognition and speech-recognition modules is provided. The speech module is also responsible for speaker identification in order to limit the access to trusted users only, registered with the visualisation control system. This part of the paper focuses on the structure and the activities of the interface, while the second part concentrates on the algorithms employed for the recognition of: gestures, voice commands and speakers.

8

Zastosowanie uczenia maszynowego w budowie interfejsu sterowanego głosem na przykładzie odtwarzacza muzyki

Basiakowski Jakub

Journal of Computer Sciences Institute

|

2019

|

Vol. 13

302--309

PL

Poniższy artykuł przedstawia wyniki badań wpływu zastosowania uczenia maszynowego w budowie interfejsu sterowanego głosem. Do analizy wykorzystane zostały dwa różne modele: jednokierunkowa sieć neuronowa zawierająca jedną warstwę ukrytą oraz bardziej skomplikowana konwolucyjna sieć neuronowa. Dodatkowo wykonane zostało porównanie modeli użytych w celu realizacji badań pod względem jakości oraz przebiegu treningu.

EN

The following paper presents the results of research on the impact of machine learning in the construction of a voice-controlled interface. Two different models were used for the analysys: a feedforward neural network containing one hidden layer and a more complicated convolutional neural network. What is more, a comparison of the applied models was presented. This comparison was performed in terms of quality and the course of training.

9

Badania sterowania głosowego modelu centrali sterującej inteligentnego budynku

Majchrowicz J., Śmietański W.

Przegląd Mechaniczny

|

2018

|

nr 9

46--48

PL

W artykule zaprezentowano badania dwóch systemów sterowania głosowego w zakresie komend dedykowanych dla inteligentnego budynku. Opisano implementację rozpoznawania mowy opartą na platformach Google Cloud Speech API i BitVoicer. Przeprowadzono badania w celu weryfikacji poprawności działania sterowania głosowego i określono dalsze możliwości rozwoju.

EN

The article presents the research of two voice control systems in the field of commands dedicated to the intelligent building. An implementation of speech recognition based on the Google Cloud Speech API and BitVoicer platform is described. Research was carried out to verify the correctness of voice control and further development options were identified.

10

Using full covariance matrix for CMU Sphinx-III speech recognition system

Płonkowski M., Urbanovich P.

Przegląd Elektrotechniczny

|

2018

|

R. 94, nr 7

102--104

EN

In this article authors proposed a hybrid system in which the full covariance matrix is used only at the initial stage of learning. At the further stage of learning, the amount of covariance matrix increases significantly, which, combined with rounding errors, causes problems with matrix inversion. Therefore, when the number of matrices with a determinant of 0 exceeds 1%, the system goes into the model of diagonal covariance matrices. Thanks to this, the hybrid system has achieved a better result of about 11%.

PL

W niniejszym artykule autorzy zaproponowali system hybrydowy, w którym pełna macierz kowariancji wykorzystywana jest tylko w początkowym etapie procedury treningowej. W dalszym etapie uczenia, znacząco wzrasta liczba macierzy kowariancji, co w połączeniu z błędami zaokrąglania powoduje problemy z odwróceniem tego typu macierzy. Dlatego też, gdy liczba macierzy o wyznaczniku równym 0 przekracza 1%, system przechodzi do modelu wykorzystującego macierze diagonalne. Dzięki temu system hybrydowy osiągnął wynik lepszy o około 11%.

11

Genetic Algorithm for Combined Speaker and Speech Recognition using Deep Neural Networks

Kaur G., Srivastava M., Kumar A.

Journal of Telecommunications and Information Technology

|

2018

|

nr 2

23--31

EN

Huge growth is observed in the speech and speaker recognition ﬁeld due to many artiﬁcial intelligence algorithms being applied. Speech is used to convey messages via the language being spoken, emotions, gender and speaker identity. Many real applications in healthcare are based upon speech and speaker recognition, e.g. a voice-controlled wheelchair helps control the chair. In this paper, we use a genetic algorithm (GA) for combined speaker and speech recognition, relying on optimized Mel Frequency Cepstral Coeﬃcient (MFCC) speech features, and classiﬁcation is performed using a Deep Neural Network (DNN). In the ﬁrst phase, feature extraction using MFCC is executed. Then, feature optimization is performed using GA. In the second phase training is conducted using DNN. Evaluation and validation of the proposed work model is done by setting a real environment, and eﬃciency is calculated on the basis of such parameters as accuracy, precision rate, recall rate, sensitivity, and speciﬁcity. Also, this paper presents an evaluation of such feature extraction methods as linear predictive coding coefficient (LPCC), perceptual linear prediction (PLP), mel frequency cepstral coefﬁcients (MFCC) and relative spectra ﬁltering (RASTA), with all of them used for combined speaker and speech recognition systems. A comparison of diﬀerent methods based on existing techniques for both clean and noisy environments is made as well.

12

Management of IoT Devices in a Smart Home through the Application of an Interactive Mirror

Majchrowicz M., Hufnagiel M.

Image Processing & Communications

|

2017

|

Vol. 22, no. 4

43--50

EN

Internet of Things (IoT) devices are big part of concept, which is called by electronic producers and many others Smart Home. Authors of this paper have decided to take a look at it and as a result of this analysis propose and implement (in a form of working prototype) a system that could manage different kinds of devices. The main objective of the project presented in the paper is a device that looks like a mirror and it is known to most people also is an interactive center, a place to obtain information of the devices that surround us and various parameters coming from sensors. Authors have prepared a prototype, that will be the central point of the apartment and allow users control over IoT devices.

13

Korpus mowy angielskiej do celów multimodalnego automatycznego rozpoznawania mowy

Szykulski M., Bratoszewski P., Kotus J., Czyżewski A., Kostek B.

Przegląd Telekomunikacyjny + Wiadomości Telekomunikacyjne

|

2016

|

nr 8-9

1129--1132, CD

PL

W referacie zaprezentowano audiowizualny korpus mowy zawierający 31 godzin nagrań mowy w języku angielskim. Korpus dedykowany jest do celów automatycznego audiowizualnego rozpoznawania mowy. Korpus zawiera nagrania wideo pochodzące z szybkoklatkowej kamery stereowizyjnej oraz dźwięk zarejestrowany przez matrycę mikrofonową i mikrofon komputera przenośnego. Dzięki uwzględnieniu nagrań zarejestrowanych w warunkach szumowych korpus może być wykorzystany do badania wpływu zakłóceń na skuteczność rozpoznawania mowy.

EN

An audiovisual corpus containing 31 hours of English speech recordings is presented. The new corpus was created in order to assist the development of audiovisual speech recognition systems (AVSR). The corpus includes high-framerate stereoscopic video streams and audio recorded by both microphone array and a microphone built in a mobile computer. Owing to the inclusion of recordings made in noisy conditions, the corpus can be used to assess the robustness of speech recognition systems in the presence of acoustic noise.

14

Czy można rozmawiać z robotem spawalniczym?

Rogowski A.

Przegląd Spawalnictwa

|

2016

|

R. 88, nr 1

5--8

PL

W artykule przedstawiono zagadnienia związane ze sterowaniem głosowym robotami przemysłowymi, w tym robotami spawalniczymi. Omówiono celowość wykorzystania automatycznego rozpoznawania mowy w robotyce, potencjalny zakres zastosowań oraz specyficzne wymagania dotyczące aplikacji sterowania głosowego związanych z robotami przemysłowymi. W szczególności skoncentrowano się na głosowym wspomaganiu programowania robotów przez uczenie. Poruszone zostało zagadnienie definiowania języka komend głosowych oraz różne aspekty integracji systemu rozpoznawania mowy z układem sterowania robota przemysłowego. Rozważania poparto przykładami ze zrealizowanej implementacji sterowania głosowego robotem Movemaster.

EN

Current paper deals with various aspects of voice control system that could be applied to industrial robots, particularly in welding applications. It discusses the usefulness of voice-based human-machine interfaces, potential areas of application, restrictions as well as specific requirements regarding these systems. In particular, it focuses on speech-aided teach-in robot programming. A separate chapter is dedicated to the issue of voice command language description. Integration of speech recognition system and robot controller is also broadly discussed. Description of these issues is illustrated by example of practically implemented voice control system applied to educational robot Movemaster.

15

The use of pitch in Large-Vocabulary Continuous Speech Recognition System

Płonkowski M., Urbanovich P.

Przegląd Elektrotechniczny

|

2016

|

R. 92, nr 8

78--81

EN

In this article the authors normalize the speech signal based on the publicly available AN4 database. The authors added to the algorithm of calculating the MFCC coefficients, the normalization procedure, that uses pitch of the voice. As demonstrated by empirical tests authors were able to improve speech recognition accuracy rate of about 20%.

PL

W niniejszym artykule autorzy normalizują sygnał mowy wykorzystując publicznie dostępną bazę danych AN4. Autorzy dodali do algorytmu obliczania współczynników MFCC, procedurę normalizacji, wykorzystującą wysokość tonu głosu. Jak wynika z przeprowadzonych testów, autorzy uzyskali poprawę dokładności rozpoznawania mowy o około 20%.

16

Kaldi Toolkit in polish whispery peech ecognition

Kozierski P., Sadalla T., Drgas Sz., Dąbrowski A., Horla D.

Przegląd Elektrotechniczny

|

2016

|

R. 92, nr 11

301--304

EN

In this paper, the automatic speech recognition task has been presented. Used toolkits, libraries and prepared speech corpus have been described. The obtained results suggest, that using different acoustic models for normal speech and whispered speech can reduce word error rate. The optimal training steps has been also selected. Thanks to the additional simulations it has been found that used corpus (over 9 hours of normal speech and the same of the whispery speech) is definitely too small and must be enlarged in the future.

PL

W artykule przedstawiono automatyczne rozpoznawanie mowy. Wykorzystane narzędzia, biblioteki i korpus opisano w artykule. Uzyskane wyniki wskazują, że wykorzystując różne modele akustyczne dla mowy zwykłej i szeptanej uzyskuje się polepszenie skuteczności rozpoznawania mowy. W wyniku wykonanych badań wskazano również optymalną kolejność kroków treningu. Dzięki dodatkowym obliczeniom stwierdzono, że użyty korpus (ponad 9 godzin zwykłej mowy i drugie tyle szeptu) jest zdecydowanie za mały do dobrego wytrenowania systemu rozpoznawania mowy i w przyszłości musi zostać powiększony.

17

Speech Recognition in an Enclosure with a Long Reverberation Time

Kociński J., Ozimek E.

Archives of Acoustics

|

2016

|

Vol. 41, No. 2

255--264

EN

The aim of this work was to measure subjective speech intelligibility in an enclosure with a long reverberation time and comparison of these results with objective parameters. Impulse Responses (IRs) were first determined with a dummy head in different measurement points of the enclosure. The following objective parameters were calculated with Dirac 4.1 software: Reverberation Time (RT), Early Decay Time (EDT), weighted Clarity (C50) and Speech Transmission Index (STI). For the chosen measurement points, a convolution of the IRs with the Polish Sentence Test (PST) and logatome tests was made. PST was presented at a background of a babble noise and speech reception threshold – SRT (i.e. SNR yielding 50% speech intelligibility) for those points were evaluated. A relationship of the sentence and logatome recognition vs. STI was determined. It was found that the final SRT data are well correlated with speech transmission index (STI), and can be expressed by a psychometric function. The difference between SRT determined in condition without reverberation and in reverberation conditions appeared to be a good measure of the effect of reverberation on speech intelligibility in a room. In addition, speech intelligibility, with and without use of the sound amplification system installed in the enclosure, was compared.

18

Laughter Classification Using Deep Rectifier Neural Networks with a Minimal Feature Subset

Gosztolya G., Beke A., Neuberger T., Tóth L.

Archives of Acoustics

|

2016

|

Vol. 41, No. 4

669--682

EN

Laughter is one of the most important paralinguistic events, and it has specific roles in human conversation. The automatic detection of laughter occurrences in human speech can aid automatic speech recognition systems as well as some paralinguistic tasks such as emotion detection. In this study we apply Deep Neural Networks (DNN) for laughter detection, as this technology is nowadays considered state-of-the-art in similar tasks like phoneme identification. We carry out our experiments using two corpora containing spontaneous speech in two languages (Hungarian and English). Also, as we find it reasonable that not all frequency regions are required for efficient laughter detection, we will perform feature selection to find the sufficient feature subset.

19

Using gesture and voice commands for the Tribot robot control

Czekalski P., Golenia M., Lipka Ł., Tokarz K.

Studia Informatica

|

2015

|

Vol. 36, nr 3

11--25

EN

Presented project integrates seamlessly modern device control methods into one, solid solution. The Project is in touch-less control algorithm to the robotics, considered as a technology sampler for feature industrial usage. It implements gesture and voice recognition based solution to control the mobile Tribot robot driving over flat, two dimensional surface. It integrates Microsoft Kinect sensor, Lego Mindstorms NXT robot and a PC computer all together. It also provides voice con-trolled calibration of the human to machine interface.

PL

W dokumencie opisano projekt, w którym zintegrowano nowoczesne metody sterowania bezdotykowego robotem mobilnym przy użyciu gestów oraz rozpoznawania głosu. Przedmiotem sterowania jest robot zbudowany na platformie Lego Mindstorms NXT, poruszający się po dwuwymiarowej przestrzeni. Rozwiązanie integruje sensor Microsoft Kinect do sterowania robotem oraz metodę kalibracji położenia użytkownika za pomocą rozpoznawania komend głosowych.

20

Szczegóły implementacyjne algorytmów do rozpoznawania mowy

Ośka J., Wojtuń J., Piotrowski Z., Bernat M.

Elektronika : konstrukcje, technologie, zastosowania

|

2015

|

Vol. 56, nr 2

40-44

PL

W artykule zaprezentowano i porównano algorytmy do rozpoznawania mowy w kontekście ich późniejszej implementacji na platformie sprzętowej DSK OMAP. Głównym zadaniem było dogłębne porównanie dwóch klasycznych metod wykorzystywanych w rozpoznawaniu mowy GMM vs HMM (ang. GMM Gaussian Mixtures Models, ang. HMM – Hidden Markov Models). W artykule jest również opisana i porównana metoda ulepszonych mikstur gaussowskich GMM-UBM (ang. GMM UBM – Gaussian Mixtures Model Universal Background Model). Parametryzacja sygnału w oparciu o współczynniki MFCC oraz LPCC (ang. Mel Frequency Cepstral Coefficients, ang. Linear Prediction Cepstral Coefficients) została opisana [1]. Analizowany model składał się ze zbioru 10-elementowego reprezentującego cyfry mowy polskiej 0-9. Badania zostały przeprowadzone na zbiorze 3000 nagrań, które zostały przygotowane przez nasz zespół. Porównanie wyników wykonano dla rozłącznych zbiorów uczących oraz trenujących. Każda z opisywanych metod klasyfikacji operuje na tych samych danych wejściowych. Daje to możliwość miarodajnego porównania jakości tych klasyfikatorów jako skutecznych narzędzi do rozpoznawania izolowanych fraz głosowych.

EN

This paper presents and compares the speech recognition algorithms in the context of their subsequent implementation on the hardware platform OMAP DSK. The main task was to compare two classical methods used in speech recognition systems GMM vs HMM (GMM – Gaussian Mixtures Models, HMM – Hidden Markov Models). In the article improved Gaussian Mixtures Model called GMM-UBM (Gaussian Mixtures Model Universal Background Model) were described and compared. Preprocessing of the input signal using MFCC and LPCC coefficients (Mel Frequency Cepstral Coefficients, Linear Prediction Cepstral Coefficients) were described [1]. Analyzed data model consists set of 10-elements that represents Polish language digits 0-9. Research is done on a set of 3000 records prepared by our team with disjoint sets of learners and trainees. Methods are compared on the same input data. The same set of input data allows for reliable comparison of these classifiers to choose effective classifier for identifying isolated voice phases.