Sample Footprints für Data-Warehouse-Datenbanken

Rösch, Philipp; Lehner, Wolfgang

doi:10.1007/s00450-009-0100-x

Sample Footprints für Data-Warehouse-Datenbanken

Original Paper
Published: 01 August 2009

Volume 25, pages 217–233, (2010)
Cite this article

Computer Science - Research and Development

Philipp Rösch¹ &
Wolfgang Lehner¹

107 Accesses
Explore all metrics

Zusammenfassung

Durch stetig wachsende Datenmengen in aktuellen Data-Warehouse-Datenbanken erlangen Stichproben eine immer größer werdende Bedeutung. Insbesondere interaktive Analysen können von den signifikant kürzeren Antwortzeiten der approximativen Anfrageverarbeitung erheblich profitieren. Linked-Bernoulli-Synopsen bieten in diesem Szenario speichereffiziente, schemaweite Synopsen, d. h. Synopsen mit Stichproben jeder im Schema enthaltenen Tabelle bei minimalem Mehraufwand für die Erhaltung der referenziellen Integrität innerhalb der Synopse. Dies ermöglicht eine effiziente Unterstützung der näherungsweisen Beantwortung von Anfragen mit beliebigen Fremdschlüsselverbundoperationen.

In diesem Artikel wird der Einsatz von Linked-Bernoulli-Synopsen in Data-Warehouse-Umgebungen detaillierter analysiert. Dies beinhaltet zum einen die Konstruktion speicherplatzbeschränkter, schemaweiter Synopsen, wobei unter anderem folgende Fragen adressiert werden: Wie kann der verfügbare Speicherplatz auf die einzelnen Stichproben aufgeteilt werden? Was sind die Auswirkungen auf den Mehraufwand? Zum anderen wird untersucht, wie Linked-Bernoulli-Synopsen für die Verwendung in Data-Warehouse-Datenbanken angepasst werden können. Hierfür werden eine inkrementelle Wartungsstrategie sowie eine Erweiterung um eine Ausreißerbehandlung für die Reduzierung von Schätzfehlern approximativer Antworten von Aggregationsanfragen mit Fremdschlüsselverbundoperationen vorgestellt.

Eine Vielzahl von Experimenten zeigt, dass Linked-Bernoulli-Synopsen und die in diesem Artikel präsentierten Verfahren vielversprechend für den Einsatz in Data-Warehouse-Datenbanken sind.

Abstract

With the amount of data in current data warehouse databases growing steadily, random sampling is continuously gaining in importance. In particular, interactive analyses of large datasets can greatly benefit from the significantly shorter response times of approximate query processing. In this scenario, Linked Bernoulli Synopses provide memory-efficient schema-level synopses, i. e., synopses that consist of random samples of each table in the schema with minimal overhead for retaining foreign-key integrity within the synopsis. This provides efficient support to the approximate answering of queries with arbitrary foreign-key joins. In this article, we focus on the application of Linked Bernoulli Synopses in data warehouse environments. On the one hand, we analyze the instantiation of memory-bounded synopses. Among others, we address the following questions: How can the given space be partitioned among the individual samples? What is the impact on the overhead? On the other hand, we consider further adaptations of Linked Bernoulli Synopses for usage in data warehouse databases. We show how synopses can incrementally be kept up-to-date when the underlying data changes. Further, we suggest additional outlier handling methods to reduce the estimation error of approximate answers of aggregation queries with foreign-key joins. With a variety of experiments, we show that Linked Bernoulli Synopses and the proposed techniques have great potential in the context of data warehouse databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Literatur

Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) Join Synopses for Approximate Query Answering. In: SIGMOD, S 275–286
Acharya S, Gibbons PB, Poosala V (2000) Congressional Samples for Approximate Answering of Group-By Queries. In: SIGMOD, S 487–498
Babcock B, Chaudhuri S, Das G (2003) Dynamic Sample Selection for Approximate Query Processing. In: SIGMOD, S 539–550
Brown PG, Haas PJ (2003) BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data. In: VLDB, S 668–679
Chakrabarti K, Garofalakis MN, Rastogi R, Shim K (2000) Approximate Query Processing Using Wavelets. In: VLDB, S 111–122
Chaudhuri S, Das G, Datar M, Motwaniand R, Narasayya VR (2001) Overcoming Limitations of Sampling for Aggregation Queries. In: ICDE, S 534–544
Chaudhuri S, Das G, Srivastava U (2004) Effective Use of Block-Level Sampling in Statistics Estimation. In: SIGMOD, S 287–298
Chaudhuri S, Dayal U (1997) An Overview of Data Warehousing and OLAP Technology. SIGMOD Rec 26(1):65–74
Article Google Scholar
Chaudhuri S, Motwani R, Narasayya V (1999) On Random Sampling over Joins. In: SIGMOD, S 263–274
Deligiannakis A, Garofalakis MN, Roussopoulos N (2005) A Fast Approximation Scheme for Probabilistic Wavelet Synopses. In: SSDBM, S 243–252
Salley CT, Codd EF, Codd SB (1993) Providing OLAP to User-Analysts: An IT Mandate
Garofalakis M, Gibbons PB (2004) Probabilistic Wavelet Synopses. ACM Trans Database Syst 29(1):43–90
Article Google Scholar
Garofalakis M, Kumar A (2005) Wavelet Synopses for General Error Metrics. ACM Trans Database Syst 30(4):888–928
Article Google Scholar
Gemulla R, Rösch P, Lehner W (2008) Linked Bernoulli Synopses: Sampling Along Foreign-Keys. In: SSDBM, S 6–23
Getoor L, Taskar B, Koller D (2001) Selectivity Estimation using Probabilistic Models. In: SIGMOD, S 461–472
Ioannidis YE, Poosala V (1999) Histogram-Based Approximation of Set-Valued Query-Answers. In: VLDB, S 174–185
Johnson T, Muthukrishnan S, Rozenbaum I (2005) Sampling Algorithms in a Stream Operator. In: SIGMOD, S 1–12
Kollios G, Gunopulos D, Koudas N, Berchtold S (2001) An Efficient Approximation Scheme for Data Mining Tasks. In: ICDE, S 453–462
Poosala V, Ganti V, Ioannidis YE (1999) Approximate Query Answering Using Histograms. IEEE Data Eng Bull 22:5–14
Rösch P, Gemulla R, Lehner W (2008) Designing Random Sample Synopses with Outliers. In: ICDE, S 1400–1402
Spiegel J, Polyzotis N (2006) Graph-Based Synopses for Relational Selectivity Estimation. In: SIGMOD, S 205–216
Toivonen H (1996) Sampling Large Databases for Association Rules. In: VLDB, S 134–145
Tuy H (2000) Monotonic Optimization: Problems Solution Approaches. SIAM J Opt 11(2):464–494
Article MathSciNet MATH Google Scholar
Vitter JS, Wang M (1999) Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. In: SIGMOD, S 193–204

Download references

Author information

Authors and Affiliations

Fakultät Informatik, Institut für Systemarchitektur, Lehrstuhl für Datenbanken, Technische Universität Dresden, 01062, Dresden, Deutschland
Philipp Rösch & Wolfgang Lehner

Authors

Philipp Rösch
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Lehner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philipp Rösch.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rösch, P., Lehner, W. Sample Footprints für Data-Warehouse-Datenbanken. Comput Sci Res Dev 25, 217–233 (2010). https://doi.org/10.1007/s00450-009-0100-x

Download citation

Received: 28 August 2008
Accepted: 26 March 2009
Published: 01 August 2009
Issue Date: September 2010
DOI: https://doi.org/10.1007/s00450-009-0100-x

Schlagworte

Keywords

CR subject classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sample Footprints für Data-Warehouse-Datenbanken

Zusammenfassung

Abstract

Access this article

Literatur

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Schlagworte

Keywords

CR subject classification

Search

Navigation