Nowa wersja platformy, zawierająca wyłącznie zasoby pełnotekstowe, jest już dostępna.
Przejdź na https://bibliotekanauki.pl
Ograniczanie wyników
Czasopisma help
Lata help
Autorzy help
Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 28

Liczba wyników na stronie
first rewind previous Strona / 2 next fast forward last
Wyniki wyszukiwania
Wyszukiwano:
w słowach kluczowych:  information extraction
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 2 next fast forward last
EN
The data contained within user generated kontent websites prove to be valuable in many applications, for example in social media monitoring or in acquisition of training sets for machine learning algorithms. Mining such data is especially difficult in case of web forums, because of hundreds of various forum engines used. We propose an algorithm capable of unsupervised extraction of posts from social websites, without the need to analyse more than one page in advance. Our method localizes potential data regions by repetition analysis within document structure and filtering potential results. Subsequently, the fields of data records are fund using key characteristics and series-wide dependencies. We manager to achieve 85% precision of extraction and 79% recall after experiments on single pages taken from 258 websites. Our solution is characterized by high computing efficiency, thus enabling wide applications.
2
Content available Similarity-based web clip matching
80%
EN
The research areas of extraction and integration of web data aim at delivery of tools and methods to extract pieces of information from third-party web sites and then to integrate them into profiled, domain-specific, custom web pages. Existing solutions rely on specialized APIs or XPath querying tools and are therefore not easily accessible to non technical end users. In this paper we describe our new comprehensive, non-XPath integration platform which allows end users to extract web page fragments using a simple query-by-example approach and then to combine these fragments into custom, integrated web pages. We focus on our two novel similarity-based web clip matching algorithms: Attribute Weights Tree Matching and Edit Distance Tree Matching.
3
Content available remote Cerberus: A New Information Retrieval Tool for Marine Metagenomics
80%
EN
The number of papers published every year in scientific journals is growing tremendously, especially in biological sciences. Keeping the track of a given branch of science is therefore a difficult task. This was one of the reasons for developing the classification tool we called Cerberus. The classification categories may correspond to some areas of research defined by the user. We have used the tool to classify papers as containing marine metagenomic, terrestrial metagenomic or non-metagenomic information. Cerberus is based on special filters using weighted domain vocabularies. Depending on the number of occurrences of the keywords from the vocabularies in the paper, the program classifies the paper to a predefined category. This classification can precede the information extraction since it can reduce the number of papers to be analyzed. Classification of papers using the method we propose results in an accurate and precise result set of articles that are relevant to the scientist. This can reduce the resources needed to find the data required in ones field of studies.
EN
In this paper, we present the DANTE system, a tagger for temporal expressions in English documents. DANTE performs both recognition and normalization of the expressions in accordance with the TIMEX2 annotation standard. The system is built on modular principles, with a clear separation between the recognition and normalisation components. The interface between these components is based on our novel approach to representing the local semantics of temporal expressions. DANTE has been developed in two phases: first on the basis of the TIMEX2 guidelines alone, and then on the ACE 2005 development data. The system has been evaluated on the ACE 2005 and ACE 2007 data. Although this is still work in progress, we already achieve highly satisfactory results, both for the recognition of temporal expressions and their interpretation (normalisation).
EN
The paper presents a method of automatic construction of a semantically annotated corpus using the results of a rulebased information extraction (IE) application. Construction of the corpus is based on using existing programs for text tokenization and morphological analysis and combining their results with domain related correction rules. We reuse the specialized IE system to obtain a corpus annotated on the semantic level. The texts included within the corpus are Polish free text clinical data. We present the documents - diabetic patients' discharge records, the structure of the corpus annotation and the methods for obtaining the annotations. Initial evaluations based on the results of manual verification of selected data subset are also presented. The corpus, once manually corrected, is designed to be used for developing supervised machine learning models for IE applications.
6
80%
EN
The paper focuses on resolving natural language issues which have been affecting performance of our system processing Polish medical data. In particular, we address phenomena such as ellipsis, anaphora, comparisons, coordination and negation occurring in mammogram reports. We propose practical data-driven solutions which allow us to improve the systems performance.
EN
Towards Recognition of Spatial Relations between Entities for PolishIn this paper, the problem of spatial relation recognition in Polish is examined. We present the different ways of distributing spatial information throughout a sentence by reviewing the lexical and grammatical signals of various relations between objects. We focus on the spatial usage of prepositions and their meaning, determined by the ‘conceptual’ schemes they constitute. We also discuss the feasibility of a comprehensive recognition of spatial relations between objects expressed in different ways by reviewing the existing tools and resources for text processing in Polish. As a result, we propose a heuristic method for the recognition of spatial relations expressed in various phrase structures called spatial expressions. We propose a definition of spatial expressions by taking into account the limitations of the available tools for the Polish language. A set of rules is used to generate candidates of spatial expressions which are later tested against a set of semantic constraints.The results of our work on recognition of spatial expressions in Polish texts were partially presented in (Marcińczuk, Oleksy, & Wieczorek, 2016). In that paper we focused on a detailed analysis of errors obtained using a set of basic morphosyntactic patterns for generating spatial expression candidates - we identified and described the most common sources of errors, i.e. incorrectly recognized or unrecognized expressions. In this paper we focused mainly on the preliminary stages of spatial expression recognition. We presented an extensive review on how the spatial information can be encoded in the text, types of spatial triggers in Polish and a detailed evaluation of morphosyntactic patterns which can be used to generate spatial expression candidates. Rozpoznawanie relacji przestrzennych między obiektami fizycznymi w języku polskimArtykuł dotyczy zagadnienia rozpoznawania relacji przestrzennych w języku polskim. Autorzy przedstawili różne sposoby przekazywania w tekstach informacji na temat relacji przestrzennych między obiektami fizycznymi, uwzględniając sygnały o charakterze leksykalnym i gramatycznym. Istotną częścią artykułu jest omówienie znaczenia przyimków użytych w celu wyrażenia relacji przestrzennych. Znaczenie to kształtowane jest przez schematy konceptualne współtworzone przez poszczególne przyimki. Omówiono również możliwości kompleksowego rozpoznawania relacji przestrzennych wyrażonych za pomocą różnych środków językowych. Służy temu przegląd istniejących zasobów i narzędzi przetwarzania języka polskiego.Jako rezultat autorzy proponują heurystyczną metodę rozpoznawania relacji przestrzennych realizowanych językowo za pomocą struktur składniowych określonych jako wyrażenia przestrzenne. W artykule zaprezentowano definicję wyrażeń przestrzennych uwzględniającą specyfikę narzędzi dostępnych do przetwarzania języka polskiego. Zestaw reguł składniowych umożliwia wytypowanie fraz – kandydatów kwalifikujących się jako wyrażenia przestrzenne, które następnie zostają porównane z adekwatnym zestawem ograniczeń semantycznych.
8
Content available remote Relational Transformation-based Tagging for Activity Recognition
80%
EN
The ability to recognize human activities from sensory information is essential for developing the next generation of smart devices. Many human activity recognition tasks are - from a machine learning perspective-quite similar to tagging tasks in natural language processing. Motivated by this similarity, we develop a relational transformation-based tagging system based on inductive logic programming principles, which is able to cope with expressive relational representations as well as a background theory. The approach is experimentally evaluated on two activity recognition tasks and an information extraction task, and compared to Hidden Markov Models, one of the most popular and successful approaches for tagging.
9
Content available Towards an event annotated corpus of Polish
80%
EN
Towards an event annotated corpus of PolishThe paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent) and text mentions (language dependant). The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr) was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations.
EN
In this article attention is paid to improving the quality of text document classification. The common techniques of analysis of text documents used in classification are shown and the weakness of these methods arc stressed. Discussed here is the integration of quantitative and qualitative methods, which is increasing the quality of classification. In the proposed approach the expanded terms, obtained by using information patterns are used in the Latent Semantic Analysis. Finally empirical research is presented and based upon the quality measures of the text document classification, the effectiveness of the proposed approach is proved.
PL
W artykule skoncentrowano się na poprawie jakości klasyfikacji dokumentów tekstowych. Zostały przybliżone najpopularniejsze techniki analizy dokumentów tekstowych wykorzystywanych w klasyfikacji. Zwrócono uwagę na słabe strony opisanych technik. Omówiono możliwość integracji metod ilościowych i jakościowych analizy tekstu i jej wpływ na poprawę jakości klasyfikacji. Zaproponowano rozwiązanie, w którym rozbudowane wyrażenia otrzymane za pomocą wzorców informacyjnych są wykorzystywane w niejawnej analizie semantycznej. Ostatecznie w oparciu o miary jakości klasyfikacji dokumentów tekstowych zaprezentowano wyniki badań testowych, które potwierdzają skuteczność zaproponowanego rozwiązania.
EN
This article presents a method of the rough estimation of geographical coordinates of villages and cities, which is described in the 19th-Century geographical encyclopedia entitled: “The Geographical Dictionary of the Polish Kingdom and Other Slavic Countries”[18]. Described are the algorithm function for estimating location, the tools used to acquire and process necessary information, and the context of this research.
EN
Glossary of Terms extraction from textual requirements is an impor- tant step in ontology engineering methodologies. Although initially it was intended to be performed manually, last years have shown that some degree of automatization is possible. Based on these promising approaches, we introduce a novel, human inter- pretable, rule-based method named ReqTagger, which can extract candidates for ontology entities (classes or instances) and relations (data or object properties) from textual requirements automatically. We compare ReqTagger to existing automatic methods on an evaluation benchmark consisting of over 550 requirements and tagged with over 1700 entities and relations expected to be extracted. We discuss the quality of ReqTagger and provide details showing why it outperforms other methods. We also publish both the evaluation dataset and the implementation of ReqTagger.
EN
In this paper, we discuss a software architecture, which has been developed for the needs of the System for Intelligent Maritime Monitoring (SIMMO). The system bases on the state-of-the-art information fusion and intelligence analysis techniques, which generates an enhanced Recognized Maritime Picture and thus supports situation analysis and decision-making. The SIMMO system aims to automatically fuse an up-to-date maritime data from Automatic Identification System (AIS) and open Internet sources. Based on collected data, data analysis is performed to detect suspicious vessels. Functionality of the system is realized in a number of different modules (web crawlers, data fusion, anomaly detection, visualization modules) that share the AIS and external data stored in the system’s database. The aim of this article is to demonstrate how external information can be leveraged in maritime awareness system and what software solutions are necessary. A working system is presented as a proof of concept.
PL
Prezentowany artykuł omawia architekturę oprogramowania opracowanego na potrzeby projektu System for Intelligent Maritime Monitoring (SIMMO). System ten bazuje na najnowszych osiągnięciach w dziedzinach fuzji oraz inteligentnej analizy danych w celu generowania wzbogaconego obrazu sytuacji na morzu i wspomagania decyzji. SIMMO w sposób automatyczny łączy dane dotyczące ruchu morskiego z systemu AIS z danymi pochodzącymi z otwartych źródeł internetowych. Dzięki zebranym danym możliwa jest analiza w celu wykrycia podejrzanych zachowań na morzu. Funkcjonalność systemu stanowi wypadkową zawartych w nim modułów (ekstrakcja danych, fuzja danych, detekcja anomalii, wizualizacja) współdzielących dostęp do baz z danymi AIS oraz z zewnętrznych źródeł. Celem artykułu jest demonstracja sposobu wykorzystywania zewnętrznych informacji w systemach przeznaczonych do monitorowania ruchu morskiego, a także prezentacja działającego systemu.
PL
W artykule zaprezentowano koncepcję stworzenia narzędzia wspomagającego wyszukiwanie informacji zgromadzonych w zasobach polskiego Internetu. Działa ono opierając się na systemie zbierającym i indeksującym dane oraz dedykowane gramatyki wyszukiwania, pozwalając efektywniej odnajdywać wartościowe informacje w sieci. Zaprezentowano przewagę prezentowanej koncepcji w porównaniu z rezultatami otrzymanymi przy użyciu wyszukiwarki Google dla przykładu z przemysłu tłoczniczego. Zaprezentowano także możliwości adaptacji systemu do innych gałęzi przemysłu oraz ewolucję jego wersji podstawowej.
EN
The paper presents the idea of an information extraction and search support system based on polish Web resources. System consist web crawling, data indexing and dedicated grammar syntax modules, which results with results quality improvement. As an usage example, it is presented stamp industry use case, compared to Google search results. Possible usage domains, improvement and evolution directions are shown in conclusion.
15
60%
EN
Effective analysis of structured documents may decide on management information systems performance. In the paper, an adaptive method of information extraction from structured text documents is considered. We assume that documents belong to thematic groups and that required set of information may be determined ”apriori”. The knowledge of document structure allows to indicate blocks, where certain information is more probable to appear. As the result structured data, which can be further analysed are obtained. The proposed solution uses dictionaries and flexion analysis, and may be applied to Polish texts. The presented approach can be used for information extraction from official letters, information sheets and product specifications.
EN
Objective: The objective of the paper is to analyse publicly available government policy documents of the United Arab Emirates (UAE) and the Kingdom of Saudi Arabia (KSA) in order to identify key topics and themes for these two countries in relation to the COVID-19 response. Research Design & Methods: In view of the availability of large volumes of documents as well as advancement in computing system, text mining has emerged as a significant tool to analyse large volumes of unstructured data. For this paper, we have applied latent semantic analysis and Singular Value Decomposition (SVD) for text clustering. Findings: The results of the analysis of terms indicate similarities of key themes around health and pandemic for the UAE and the KSA. However, the results of text clustering indicate that focus of the UAE’ documents in on ‘Digital’-related terms, whereas for the KSA, it is around ‘International Travel’-related terms. Further analysis of topic modelling demonstrates that topics such as ‘Vaccine Trial’, ‘Economic Recovery’, ‘Health Ministry’, and ‘Digital Platforms’ are common across both the UAE and the KSA. Contribution / Value Added: The study contributes to text-mining literature by providing a framework for analyzing public policy documents at the country level. This can help to understand the key themes in policies of the governments and can potentially aid the identification of the success and failure of various policy measures in certain cases by means of comparing the outcomes. Implications / Recommendations: The results of this study clearly showed that text clustering of unstructured data such as policy documents could be very useful for understanding the themes and orientation topics of the policies.
EN
Recognizing textual entailment (RTE) became a well established and widely studied task. Partial textual entailment -- and faceted textual entailment in particular -- belong to tasks that are derived from RTE. Although there exist many annotated corpora for the original RTE problem, faceted textual entailment is in the sense of easy-accessible corpora highly neglected. In this paper, we present a semi-automatic approach to deriving corpora for faceted entailment task from a general RTE corpus using open information extraction (open IE) tools.
EN
To improve measurement precision and solve the question on projectile information extraction when projectile go through detection screen of photoelectric detection target, the wavelet analysis method was applied to process its information and look for its starting time in screen. The detection principle of photoelectric detection target was analyzed, the characteristic of wavelet analysis method and LMS adaptive filtering algorithm were used to research and analyze the output signal of photoelectric detection target. According to the output signal characteristic of photoelectric detection target, the wavelet transform modulus maxima theory and singularity position point were applied to search out signal’s start moment that projectile flying through detection screen, and ensure start moment and calculate time value between detection screens. Base on test velocity principle and experimentation, wavelet analysis method is compared with the traditional nose trigger extraction method, the precision of measuring velocity is less than 0.2%, which verifies wavelets analysis method to extract the photoelectric detection target detection information is feasible and correct.
PL
W artykule analizowane są metody analizy obrazu detekcji fotoelektrycznej w przypadku poruszającego się szybko obiektu, takiego jak n. pocisk. Do analizy wykorzystano transformatę falkową oraz algorytm filtrowania LMS.
20
Content available Information extraction from chemical patents
60%
EN
The development of new chemicals or pharmaceuticals is preceded by an indepth analysis of published patents in this field. This information retrieval is a costly and time inefficient step when done by a human reader, yet it is mandatory for potential success of an investment. The goal of the research project UIMA-HPC is to automate and hence speed-up the process of knowledge mining about patents. Multi-threaded analysis engines, developed according to UIMA (Unstructured Information Management Architecture) standards, process texts and images in thousands of documents in parallel. UNICORE (UNiform Interface to COmputing Resources) workflow control structures make it possible to dynamically allocate resources for every given task to gain best cpu-time/realtime ratios in an HPC environment.
first rewind previous Strona / 2 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.