Organizing and Accessing a Data Warehouse on Tertiary Storage

Theodore Johnson

                      AT&T Labs - Research
                     180 Park Ave, Bldg. 103
                     Florham Park, NJ 07932
                    johnsont@research.att.com
                          973 360 8779

Telecommunications monitoring applications (call-detail, traffic patterns, etc.) generate very large data sets -- on the order of 10s of Gbytes per day. Warehousing all of this data using on-line storage (e.g., magnetic disk) is usually infeasible because of cost. The typical method for handling such large data sets is to store summaries and samples of the raw data on-line, and to dump the data stream to tape. The tape-resident data is usually very slow and difficult to access, discouraging experimentation.

We are developing techniques that permit easy and fast access to tape-resident data. Recent advances in robotic storage library technology have made available compact and low-cost devices that store tens of terabytes and have a data throughput of tens of megabytes per second. However, the unusual performance characteristics of robotic storage libraries make it difficult to achieve efficient access to the data.

In this talk we will present issues and techniques for high-performance access to tape-resident data. We start by presenting a performance characterization of common robotic storage library devices, based on measurements that we have taken (including DLT4000 tape and a Storagetek 9710 robotic storage library). We discuss the impact of these characteristics on common data access operations, such as indexing, streaming access, parallel I/O, and join processing.

The problem of automating efficient access to tape-resident data has been addressed by several researchers, including ourselves. We present a summary of this work on data layout, indexing, striping, join processing, and sampling. We conclude with a discussion of research directions.