Hilberg’s Conjecture : a Challenge for Machine Learning

Dębowski, Ł.

doi:10.4467/20838476SI.14.003.3020

Artykuł - szczegóły

Tytuł artykułu

Hilberg’s Conjecture : a Challenge for Machine Learning

Autorzy

Dębowski Ł.

Wybrane pełne teksty z tego czasopisma

http://www.ejournals.eu/Schedae-Informaticae/

Identyfikatory

DOI

10.4467/20838476SI.14.003.3020

Warianty tytułu

Języki publikacji

Abstrakty

We review three mathematical developments linked with Hilberg’s conjecture – a hypothesis about the power-law growth of entropy of texts in natural language, which sets up a challenge for machine learning. First, considerations concerning maximal repetition indicate that universal codes such as the Lempel-Ziv code may fail to efficiently compress sources that satisfy Hilberg’s conjecture. Second, Hilberg’s conjecture implies the empirically observed power-law growth of vocabulary in texts. Third, Hilberg’s conjecture can be explained by a hypothesis that texts describe consistently an infinite random object.

Słowa kluczowe

statistical language modeling Hilberg’s conjecture maximal repetition grammar-based codes Santa Fe processes

Wydawca

Wydawnictwo Uniwersytetu Jagiellońskiego

Czasopismo

Schedae Informaticae

Rocznik

2014

Tom

Vol. 23

Strony

33--44

Opis fizyczny

Bibliogr. 42 poz.

Twórcy

autor

Dębowski Ł.

ldebowsk@ipipan.waw.pl

Institute of Computer Science Polish Academy of Sciences ul. Jana Kazimierza 5, 01-248 Warszawa, Poland

Bibliografia

[1] Charikar M., Lehman E., Lehman A., Liu D., Panigrahy R., Prabhakaran M., Sahai A., Shelat A., The smallest grammar problem, IEEE Transactions on Information Theory 51, 2005, pp. 2554–2576.
[2] de Luca A., On the combinatorics of finite words, Theoretical Computer Science 218, 1999, pp. 13–39.
[3] de Marcken C. G. Unsupervised Language Acquisition, PhD thesis, Massachussetts Institute of Technology, 1996.
[4] Dębowski Ł., Variable-length coding of two-sided asymptotically mean stationary measures, Journal of Theoretical Probability 23, 2010, pp. 237–256.
[5] Dębowski Ł., On the vocabulary of grammar-based codes and the logical consistency of texts, IEEE Transactions on Information Theory 57, 2011, pp. 4589–4599.
[6] Dębowski Ł., Maximal lengths of repeat in English prose. S. Naumann, P. Grzybek, R. Vulanovic, G. Altmann (Eds.), Synergetic Linguistics. Text and Language as Dynamic System, Praesens Verlag, Wien 2012, pp. 23–30.
[7] Dębowski Ł., Mixing, ergodic, and nonergodic processes with rapidly growing information between blocks, IEEE Transactions on Information Theory 58, 2012, pp. 3392–3401.
[8] Dębowski Ł., Empirical evidence for Hilberg’s conjecture in single-author texts. In: I. Obradovic, E. Kelih, R. Köhler (Eds.), Methods and Applications of Quantitative Linguistics – Selected papers of the 8th International Conference on Quantitative Linguistics (QUALICO), Academic Mind, Belgrade, 2013, pp. 143–151.
[9] Dębowski Ł., A preadapted universal switch distribution for testing Hilberg’s conjecture, http://arxiv.org/abs/1310.8511, 2013.
[10] Dębowski Ł., Estimation of entropy from subword complexity, http://www.ipipan.waw.pl/˜ldebowsk/, 2014.
[11] Dębowski Ł., Maximal repetitions in written texts: Finite energy hypothesis vs. strong Hilberg conjecture, http://www.ipipan.waw.pl/˜ldebowsk/, 2014.
[12] Dębowski Ł., A new universal code helps to distinguish natural language from random texts, http://www.ipipan.waw.pl/˜ldebowsk/, 2014.
[13] Dębowski Ł., On hidden Markov processes with infinite excess entropy, Journal of Theoretical Probability 27, 2014, pp. 539–551.
[14] Ebeling W., P¨oschel T., Entropy and long-range correlations in literary English, Europhysics Letters 26, 1994, pp. 241–246.
[15] Ebeling W., Nicolis G., Entropy of symbolic sequences: The role of correlations, Europhysics Letters 14, 1991, pp. 191–196.
[16] Ebeling W., Nicolis G., Word frequency and entropy of symbolic sequences: A dynamical perspective, Chaos, Solitons and Fractals 2, 1992, pp. 635–650.
[17] Graham R.L., Knuth D.E., Patashnik O., Concrete Mathematics. A Foundation for Computer Science, Addison-Wesley, 1994.
[18] Guiraud P., Les caract`eres statistiques du vocabulaire, Presses Universitaires de France, Paris, France, 1954.
[19] Heaps H.S., Information Retrieval – Computational and Theoretical Aspects, Academic Press, New York, USA, 1978.
[20] Herdan G., Quantitative Linguistics, Butterworths, London, England, 1964.
[21] Hilberg W., Der bekannte Grenzwert der redundanzfreien Information in Texten – eine Fehlinterpretation der Shannonschen Experimente? Frequenz 44, 1990, pp. 243–248.
[22] Jelinek F., Statistical Methods for Speech Recognition, The MIT Press, Cambridge, Massachusetts, USA, 1997.
[23] Kieffer J.C., Yang E., Grammar-based codes: A new class of universal lossless source codes, IEEE Transactions on Information Theory 46, 2000, pp. 737–754.
[24] Kit Ch., Wilks Y., Unsupervised learning of word boundary with description length gain, M. Osborne and E.T.K. Sang (Eds.), Proceedings of the Computational Natural Language Learning ACL Workshop, Bergen, 1999, pp. 1–6.
[25] Köhler R., Altmann G., Piotrowski R.G. (Eds.), Quantitative Linguistik. Ein internationales Handbuch / Quantitative Linguistics. An International Handbook, Walter de Gruyter, Berlin, Germany, 2005.
[26] Kolmogorov A.N., Three approaches to the quantitative definition of information, Problems of Information Transmission 1 (1), 1965, pp. 1–7.
[27] Kolpakov R., Kucherov G., On maximal repetitions in words, Journal of Discrete Algorithms 1, 1999, pp. 159–186.
[28] Kuraszkiewicz W., Łukaszewicz J., The number of different words as a function of text length, Pamietnik Literacki (In Polish) 42 (1), 1951, pp. 168–182.
[29] Li M., Vitányi P.M.B., An Introduction to Kolmogorov Complexity and Its Applications, 2nd ed. Springer, New York, USA, 1997.
[30] Mandelbrot B., An informational theory of the statistical structure of languages, W. Jackson (Ed.), Communication Theory, Butterworths, London, England, 1953, pp. 486–502.
[31] Mandelbrot B., Structure formelle des textes et communication, Word 10, 1954, pp. 1–27.
[32] Markov A.A., An example of statistical investigation of the text ‘Eugene Onegin’ concerning the connection of samples in chains, Science in Context 19, 2006, pp. 591–600.
[33] Menzerath P., Über einige phonetische Probleme, Actes du premier Congres international de linguistes, Sijthoff, Leiden, Holland, 1928.
[34] Nevill-Manning C.G., Inferring Sequential Structure, PhD thesis, University of Waikato, 1996.
[35] Shannon C., A mathematical theory of communication, Bell System Technical Journal 30, 1948, pp. 379–423, 623–656.
[36] Shannon C., Prediction and entropy of printed English, Bell System Technical Journal 30, 1951, pp. 50–64.
[37] Shields P.C., String matching: The ergodic case, The Annals of Probability 20, 1992, pp. 1199–1203.
[38] Shields P.C., Universal redundancy rates don’t exist, IEEE Transactions on Information Theory IT-39, 1993, pp. 520–524.
[39] Shields P.C., String matching bounds via coding, The Annals of Probability 25, 1997, pp. 329–336.
[40] Wolff J.G., Language acquisition and the discovery of phrase structure, Language and Speech 23, 1980, pp. 255–269.
[41] Zipf G.K., The Psycho-Biology of Language: An Introduction to Dynamic Philology, 2nd ed. The MIT Press, Cambridge, Massachusetts, USA, 1965.
[42] Ziv J., Lempel A., A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, 23, 1977, pp. 337–343

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-cdb6ea73-5f67-492e-b7a5-f43c6f3e656b