Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data

Habrard, A.; Bernard, M.; Sebban, M.

Artykuł - szczegóły

Tytuł artykułu

Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data

Autorzy

Habrard A. , Bernard M. , Sebban M.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

In front of the large increase of the available amount of structured data (such as XML documents), many algorithms have emerged for dealing with tree-structured data. In this article, we present a probabilistic approach which aims at a priori pruning noisy or irrelevant subtrees in a set of trees. The originality of this approach, in comparison with classic data reduction techniques, comes from the fact that only a part of a tree (i.e. a subtree) can be deleted, rather than the whole tree itself. Our method is based on the use of confidence intervals, on a partition of subtrees, computed according to a given probability distribution. We propose an original approach to assess these intervals on tree-structured data and we experimentally show its interest in the presence of noise.

Słowa kluczowe

data reduction tree structured data noisy data stochastic tree automata

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2005

Tom

Vol. 66, nr 1,2

Strony

103--130

Opis fizyczny

Bibliogr. 38 poz., tab., wykr.

Twórcy

autor

Habrard A.

amuary.habrard@univ-st-etienne.fr

EURISE, Univ.Jean Monet, de Saint-Etienne 23, rue Michelon, 42023 St Etienne cedex 2, France

autor

Bernard M.

marc.bernard@univ-st-etienne.fr

EURISE, Univ.Jean Monet, de Saint-Etienne 23, rue Michelon, 42023 St Etienne cedex 2, France

autor

Sebban M.

marc.sebban@univ-st-etienne.fr

EURISE, Univ.Jean Monet, de Saint-Etienne 23, rue Michelon, 42023 St Etienne cedex 2, France

Bibliografia

[1] Abe, N., Mamitsuka, H.: Predicting protein secondary structure using stochastic tree grammars, Machine Learning, 29, 1997, 275–301.
[2] Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules, Proc. 20th Int. Conf. Very Large Data Bases, VLDB (J. B. Bocca, M. Jarke, C. Zaniolo, Eds.), Morgan Kaufmann, 12–15 1994, ISBN 1-55860-153-8.
[3] Amoth, T. R., Cull, P., Tadepalli, P.: On Exact Learning of Unordered Tree Patterns, Machine Learning, 44(3), 2001, 211–243.
[4] Bernard, M., de la Higuera, C.: GIFT: Grammatical Inference for Terms, Late Breaking paper at the 9th International Conference on Inductive Logic Programming, June 1999.
[5] Blake, C., Merz, C.: University of California Irvine Repository of machine learning databases, http://www.ics.uci.edu/~mlearn/, 1998.
[6] Brown, P., Pietra, V. D., deSouza, P., Lai, J., Mercer, R.: Class-Based n-gram Models of Natural Language, Computational Linguistics, 18(4), 1992, 467–479.
[7] Calera-Rubio, J., Carrasco, R. C.: Computing the relative entropy between regular tree languages, Information Processing Letters, 68(6), 1998, 283–289.
[8] Carrasco, R., Oncina, J., Calera-Rubio, J.: Stochastic Inference of Regular Tree Languages, Machine Learning, 44(1/2), 2001, 185–197.
[9] Carrasco, R. C., Rico-Juan, J. R.: A similarity between probabilistic tree languages: application to XML document families, Pattern Recognition, in press, 2002.
[10] Chaudhuri, R., Rao, A. N. V.: Approximating Grammar Probabilities: Solution of a Conjecture, Journal of the Association for Computing Machinery, 33(4), 1986, 702–705.
[11] Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison, S., Tommasi, M.: Tree Automata Techniques and Applications , Available on: http://www.grappa.univ-lille3.fr/tata, 1997.
[12] Cover, T.M., Hart, P. E.: Nearest Neighbor Pattern Classification, IEEE Transactions on Information Theory, 13(1), 1967, 21–27.
[13] De Raedt, L.: Data mining in Multi-Relational Databases, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), 2000, Invited talk.
[14] Dupont, P., Miclet, L., Vidal, E.: What Is the Search Space of the Regular Inference?, Proceedings of the Second International Colloquium on Grammatical Inference and Applications, number 862 in LNAI, Springer-Verlag, 1994.
[15] Garcia, P., Oncina, J.: Inference of recognizable tree sets, Research Report DSIC - II/47/93, Departamento de Sistemas Inform´aticos y Computaci´on, Universidad Polit´ecnica de Valencia, 1993.
[16] Gécseg, F., Steinby, M.: Tree Automata, Akadémiai Kiadó, Budapest, 1984.
[17] Gold, E. M.: Language Identification in the limit, Information and Control, 10(n5), 1967, 447–474.
[18] Goldman, S. A., Kwek, S. S.: On Learning Unions of Pattern Languages and Tree Patterns, 10th Algorithmic Learning Theory conference, 1720, 1999.
[19] Habrard, A., Bernard, M., Jacquenet, F.: Generalized stochastic tree automata for multi-relational data mining, 6th International Colloquium on Grammatical Inference. ICGI 2002, 2484, Springer, Amsterdam, september 2002.
[20] Habrard, A., Bernard, M., Jacquenet, F.: Multi-Relational Data Mining in Medical Databases, Proceedings of the 9th Conference on Artificial Intelligence for Medicine Europe (AIME’03) (M. Dojat, E. Keravnou, P. Barahona, Eds.), 2780, Springer–Verlag, Protaras, Cyprus, October 2003.
[21] Hoeffding, W.: Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 58(301), 1963, 13–30.
[22] John, G., Kohavi, R., Pfleger, K.: Irrelevant Features and the Subset Selection Problem, 11th International Conference on Machine Learning, 1994.
[23] Kilpeläinen, P.: Tree Matching Problems with Applications to Structured Text Databases, Report a-1992-6, Department of Computer Science, University of Helsinki, November 1992.
[24] King, R., Srinivasan, A., Sternberg, M.: Relating chemical activity to structure: An examination of ILP successes, New Generation Computing, Special issue on Inductive Logic Programming, 13(3-4), 1995, 411–434.
[25] Knuutila, T., Steinby, M.: Inference of tree languages from a finite sample: an algebraic approach, Theoretical Computer Science, 129, 1994, 337–367.
[26] Kosala, R., Bruynooghe, M., den Bussche, J. V., Blockeel, H.: Information Extraction from web documents based on local unranked tree automaton inference, Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-2003), 2003.
[27] Kosala, R., Bussche, J., Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents using Tree Automata Induction, 6th European Conference on Principles and Practise of Knowledge Discovery in Databases (PKDD’02), 2431, Springer, 2002.
[28] Lyngsø, R., Pedersen, C., Nielsen, H.: Metrics and Similarity Measures for Hidden Markov Models, 7th International Conference on Intelligent Systems for Molecular Biology, ISMB ’99 Proceedings, AAAI Press USA, Heidelberg, Germany, 1999.
[29] Miyahara, T., Suzuki, Y., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H.: Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents, PAKDD 2002, Taipei, Taiwan, May 2002.
[30] Ney, H., Essen, U., Kneser, R.: On Structuring Probabilistic Dependences in Stochastic Language Modelling, Computer Speech and Language, 8, 1994, 1–38.
[31] Nijssen, S., Kok, J. N.: Efficient Discovery of Frequent Unordered Trees, Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences (MGTS-2003) (L. de Raedt, T. Washio, Eds.), september 2003, ECML/PKDD’03 workshop proceedings.
[32] Nock, R., Sebban, M.: Sharper Bounds for the Hardness of Prototype and Feature Selection, International Conference on Algorithmic Learning Theory (S. V. L. 1968, Ed.), 2000.
[33] Rico-Juan, J., Calera, J., Carrasco, R.: Probabilistic k-Testable Tree-Languages, ICGI 2000, Lisbon (Portugal) (A. L. Oliveira, Ed.), 1891, Springer, Berlin, September 2000.
[34] Rico-Juan, J., Calera-Rubio, J., Carrasco, R.: Stochastic k-testable Tree Languages and Applications, ICGI 2002, 2484, Springer-Verlag, Amsterdam (Nederland), september 2002.
[35] Sakakibara, Y.: Efficient Learning of Context-Free Grammars from Positive Structural Examples, Information and Computation, 97, 1992, 23–60.
[36] Termier, A., Rousset, M., Sebag, M.: TreeFinder: a First Step towards XML Data Mining, International Conference on Data Mining (ICDM’02), 2002.
[37] Wilson, D.,Martinez, T.: Reduction Techniques for Instance-Based Learning Algorithms, Machine Learning, 38(3), 2000, 257–286.
[38] Zaki, M.: Efficiently mining frequent trees in a forest, Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (KDD), 2002.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS2-0007-0023