Choosing a persistent storage for data mining task

Kasprowski, P.

Artykuł - szczegóły

Tytuł artykułu

Choosing a persistent storage for data mining task

Autorzy

Kasprowski P.

Identyfikatory

Warianty tytułu

Wybór sposobu przechowywania danych stosowanych w data miningu

Języki publikacji

Abstrakty

The amount of data available for mining or machine learning is increasing. Therefore one of the main problems of nowadays mining is decision how to persistently store that data in the way that it is easy and fast to load and save by mining algorithms. When data is too big to fit in the memory, there are two common ways to handle it: text or binary file in own format or ready-to-use universal database engine. Both have advantages and disadvantages. As for database engine, the most popular storage is a relational database server. Recently another promising option became non-relational databases like document-oriented databases. The work presented in this paper analyses how different storages behave for big amounts of data. Experiments compare efficiency of these storages for some classic mining tasks.

Ilość danych dostępnych do analizy i algorytmów uczenia się z roku na rok rośnie. W związku z tym coraz większym problemem staje się przechowywanie tych danych w sposób trwały, który umożliwi szybki odczyt i jednocześnie bezpieczny zapis wyników analizy. Są dwa najpopularniejsze rozwiązania problemu trwałego zapisu danych: pliki we własnym formacie binarnym lub tekstowym albo relacyjne bazy danych. Oba te sposoby mają swoje zalety i wady. W ostatnich czasach popularność zaczynają zdobywać nierelacyjne bazy danych. W artykule zaprezentowano eksperyment mający na celu porównanie możliwości tych trzech sposobów przechowywania danych.

Słowa kluczowe

data mining persistent storage document-oriented database

eksploracja danych przechowywanie danych bazy nierelacyjne

Wydawca

Wydawnictwo Politechniki Śląskiej

Czasopismo

Studia Informatica

Rocznik

2012

Tom

Vol. 33, nr 2B

Strony

509--520

Opis fizyczny

Bibliogr. 27 poz.

Twórcy

autor

Kasprowski P.

Politechnika Śląska, Instytut Informatyki, Pawel.Kasprowski@polsl.pl

Bibliografia

1. Anderson J.: CouchDB. The Definite Guide. O’Reilly, 2010.
2. Apache Jackrabbit, http://jackrabbit.apache.org/.
3. Caraciolo M.: Map Reduce with MongoDB and Python. Artifical Intelligence in Motion blog, http://aimotion.blogspot.com/2010/08/mapreduce-with-mongodb-and-python.html
4. Codd E. F.: A relational model of data for large shared data banks. Commun. ACM, 1970.
5. Deshpande A., Madden S.: MauveDB: Supporting Model-based User Views in Database Systems. SIGMOD, 2006.
6. Džeroski S., Lavrač N.: Relational data mining. Springer Verlag, Berlin, Heidelberg 2001.
7. Extensible Markup Language (XML), W3C domain, http://www.w3.org/XML.
8. Hand D. J., Mannila H., Smyth P.: Principles of Data Mining. MIT Press, 2002.
9. Holsheimer M., Kersten M., Mannila H., Toivonen H.: A Perspective on Databases and Data Mining. KDD, 1995.
10. Imieliński T., Virmani A.: MSQL: A Query Language for Database Mining. Data Mining and Knowledge Discovery, Kluwer Academic Publishers, 1999.
11. Introducing JSON (Javasript Object Notation), http://www.json.org.
12. JSON in Java, http://www.json.org/java/index.html.
13. Kasprowski P., Ober J.: Eye movements in biometrics. Biometric Authentication Workshop, European Conference on Computer Vision ECCV’2004, Lecture Notes in Computer Science, Springer, Prague 2004.
14. Meo R., Psaila G., Ceri S.: A New SQL-like Operator for Mining Association Rule. Proceedings of the 22nd VLDB Conference, India, 1996.
15. MongoDB document-oriented storage, http://www.mongodb.org.
16. MySQL web page, http://www.mysql.com.
17. NOSQL Databases, http://nosql-database.org.
18. Oram A.: MongoDB experts model the move from a relational database to MongoDB. O’Reilly Community, http://broadcast.oreilly.com/2010/04/mongodb-experts-model-the-move.html.
19. Ordonez C., Pitchaimalai S. K.: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling. Data & Knowledge Engineering, Vol. 69, No. 5, 2010, p. 383÷398.
20. Ordonez C.: Integrating K-Means Clustering with a Relational DBMS Using SQL. IEEE Transactions On Knowledge And Data Engineering, Vol. 18, No. 2, 2006.
21. Ordonez C.: Programming the Kmeans Clustering Algorithm in SQL. KDD, 2004.
22. PostgreSQL database system web page, http://www.postgresql.org.
23. Raedt L. D.: A perspective on inductive databases. ACM SIGKDD Explorations Newsletter, 2002.
24. Sarawagi S., Thomas S., Agrawal R.: Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD, 1998.
25. Suresh L., Simha J. B.: Novel and Efficient Clustering Algorithm Using Structured Query Language. Proceedings of the 2008 International Conference on Computing, Communication and Networking, 2008.
26. Vicknair C.: A Comparison of a Graph Database and a Relational Database, ACM Proceedings of the 48th Annual Southeast Regional Conference, New York 2010.
27. Zhang T., Ramakrishnan R., Livny M.: BIRCH: An Efficient Data Clustering Method for Very Large Databases SIGMOD, 1996.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BSL2-0026-0101