### Astrophysics and Algorithms: A DIMACS Workshop on Massive Astronomical Data Sets

#### Title:

The MACHO Project

#### Author:

David Bennett
University of Notre Dame

#### Abstract:

The MACHO Project has been taking data since late 1992, and has now accumulated more than 70,000 dual color images of 1/2 square degree fields of the Magellanic Clouds and the Galactic bulge for a total of 5.3 Tb of raw image data. The observed fields are rather crowded with an average density of about 1 million detected stars per square degree. More than 80\% of the data which has been taken to data has been reduced with our SoDOPHOT point spread function fitting photometry program, The reduced photometry database contains 50 billion individual photometric measurements and occupies 500 Gb of storage space which is split between rotating disk and a robotic tape library. The incoming data of up to 7 Gb of image data per night is reduced within a few hours of data taking so that new gravitational microlensing events can be discovered and announced in progress. The photometry database is regularly accessed by both the alert system which requires the rapid access to the lightcurves of a few stars and by complete analysis passes which must sequentially access several hundred Gb of reduced data.

Some of the computational problems that the MACHO Project has faced (and solved) will be discussed.

#### Title:

The USNO PMM Program

#### Author:

Dave Monet
US Naval Observatory Flagstaff Station

#### Abstract:

At last count, the U.S. Naval Observatory's Precision Measuring Machine has digitized and processed 7,560,089,606,712 pixels from the Palomar, ESO, AAO, UKST, and Lick photographic sky survey plates, and 52-byte records have been computed for each of 6,648,074,159 detections. The pixel database occupies 1866 rolls of 8-mm tape (only 5,199,639,984,384 pixels were saved), and the detection database occupies 581 CD-ROMs housed in two jukeboxes. The PMM program's first catalog was USNO-A1.0 (see http://www.usno.navy.mil/pmm for details), and completion of its known tasks will take another two years. The presentation will include a brief description of the PMM, some of the lessons learned during the first 3.5 years of operation, and a discussion of the problems anticipated in going from G-rated products such as USNO-A to X-rated products such as public access to the pixel and detection databases.

#### Title:

DSS-II and GSC-II: STScI All-Sky Image and Catalog Databases

#### Author:

Barry M. Lasker, Gretchen R. Greene, Mario J. Lattanzi,
Brian J. McLean, and Antonio Volpicelli
ST ScI and OATo

#### Abstract:

A program of digitizing photographic sky survey plates (DSS-II), now quite close to completion, is approaching its final size, a 5 Tbyte collection of 1.1 Gbyte plate scans that cover the entire sky in 42 square degree fields. A set of image processing and object recognition tools applied to these data then results in a list of 4E9 (estimated) objects constituting the second Guide Star Catalog (GSC-II), which consists of positions, proper motions, magnitudes, and colors for each object. In order to preserve generality in the exploitation of these data, we maintain the connection between the images (plate scans) and the GSC-II catalog objects by associating the plate-calibration data (astrometry, photometry, classification) in FITS-like header structures pertinent to each plate.

Internally, all the GSC-II data, ie, both the raw plate measures and the calibrated astronomical results, are stored in a database called COMPASS (Catalog of Objects and Measured Parameters from All-Sky Surveys). COMPASS, an object-oriented system built on the Objectivity (tm) DBMS, has an expected final size of 4 Tbytes, is structured for identifying systematic calibration effects so as to optimize the calibrations, and is organized on the sky with the hierarchical triangulated mesh developed by the SDSS Archive team. COMPASS is also used to support consistent object naming between plates, as well as cross-matching with other optical surveys and with data from other wavebands. A much smaller "export" catalogue, in ESO SkyCat format (about 100 Gbyte), will also be produced.

#### Title:

The Two Micron All Sky Survey

#### Author:

Carol Lonsdale
IPAC, JPL/Caltech

#### Abstract:

The 2 Micron All Sky Survey (2MASS) project, a collaboration between the University of Massachusetts (Dr. Mike Skrutskie, PI) and the Infrared Processing and Analysis Center, JPL/Caltech funded primarily by NASA and the NSF, will scan the entire sky utilizing two new, highly automated 1.3m telescopes at Mt. Hopkins, AZ and at CTIO, Chile. Each telescope simultaneously scans the sky at J, H and Ks with a three channel camera using 256x256 arrays of HgCdTe detectors to detect point sources brighter than about 1 mJy (to SNR=10), with a pixel size of 2.0 arcseconds. The data rate is $\sim 19$ Gbyte per night, with a total processed data volume of 13 Tbytes of images and 0.5 Tbyte of tabular data. The 2MASS data is archived nightly into the Infrared Science Information System at IPAC, which is based on an Informix database engine, judged at the time of purchase to have the best commercially available indexing and parallelization flexibility, and a 5 Tbyte-capacity RAID multi-threaded disk system with multi-server shared disk architecture. I will discuss the challenges of processing and archiving the 2MASS data, and of supporting intelligent query access to them by the astronomical community across the net, including possibilities for cross-correlation with other remote data sets.

#### Author:

Richard L. White (STScI)
Robert H. Becker (UC-Davis & LLNL/IGPP)
David J. Helfand (Columbia)

#### Abstract:

The FIRST (Faint Images of the Radio Sky at Twenty-cm) survey began in 1993 and has to date covered 4800 square degrees of the north and south Galactic caps. The NRAO Very Large Array is used to create 1.4 GHz images with a resolution of 5.4 arcsec and a 5-sigma sensitivity of 1 mJy for point sources. Both the sensitivity and spatial resolution are major improvements over previous radio surveys.

The FIRST survey has some unusual characteristics compared with most other surveys discussed at this workshop. Our data volume is not so overwhelming (the total image data currently stands at 0.6 Tbytes), but the data processing involved in constructing the final images is computationally intensive. It requires about 17 hours of CPU time on a Sparc-20 processor to process a square degree of sky (only 20 minutes of VLA observing); the production of the current image database consumed 9 years of Sparc-20 processing time!

The data reduction for the FIRST survey has been carried out on a shoestring. The imaging pipeline was developed by 2 to 3 people and has been operated by a single person (RHB) for practically the entire project. It consequently must be highly automated and robust, which is non-trivial for radio imaging.

Finally, the FIRST survey is being carried out using a national telescope facility. This makes some things easier (we did not have to build a telescope) and some harder (we must fight continually to maintain our observing time allocation.)

Both images and catalogs from the FIRST survey are released essentially immediately after their construction. They are available on the web at http://sundog.stsci.edu

#### Title:

THE NASA/IPAC EXTRAGALACTIC DATABASE

#### Author:

Co-Director
NASA/IPAC Extragalactic Database
Infrared Processing and Analysis Center
Jet Propulsion Laboratory
California Institute of Technology

#### Abstract:

NED has been operating in the public domain since 1990. Originally composed of a merger of a few well known catalogs of galaxies containing around 30,000 entries each, the object database has now grown to over 750,000, and will soon exceed 3,000,000 extragalactic objects.

The problems and challenges unique to a heterogeneous scientific database will be addressed. Design of a successful user interface will be discussed. And the existing shortcominmgs of NED will be highlighted.

The question of doing original and meaningful research with a literature-based database in extragalactic astronomy will be critically reviewed, and plans for upgrading NED in the near future will be explored.

#### Title:

Automated Galaxy Classification in Large Sky Surveys

S. C. Odewahn
Caltech

#### Abstract:

Current efforts to perform automatic galaxy classification using artificial neural network image classifiers are reviewed. For both DPOSS Schmidt plate and WFPC2 CCD imagery, a variety of two-dimensional photometric parameter spaces produce a segregation by Hubble type. Through the use of hidden node layers, an artifical neural network is capable of mapping complicated, highly nonlinear data spaces. This powerful technique is used to map a multivariate photometric parameter space to the revised Hubble system of galaxy classification. I discuss a new morphological classifiction approach using Fourier image models to identify barred and ringed spiral systems. Multi-color photometric and morphological type catalogs derived from large image data sets provided by new ground and space-based surveys will be used to compute wavelength-dependent galaxy number counts (see HST example below in Panel B) over a large range in apparent magnitude and provide an observational basis for studies of galaxy formation and evolution.

#### Title:

The Sloan Digital Sky Survey and its Science Database

#### Author:

Alex Szalay
Johns Hopkins University

#### Abstract:

Astronomy is about to undergo a major paradigm shift, with data sets becoming larger, and more homogeneous, for the first time designed in the top-down fashion. In a few years it may be much easier to dial-up'' a part of the sky, when we need a rapid observation than wait for several months to access a (sometimes quite small) telescope. With several projects in multiple wavelengths under way, like the SDSS, 2MASS, GSC-2, POSS2, ROSAT, FIRST and DENIS projects, each surveying a large fraction of the sky, the concept of having a digital sky,'' with multiple, TByte-size databases interoperating in a seamless fashion is no longer an outlandish idea. More and more catalogs will be added and linked to the existing ones, query engines will become more sophisticated, and astronomers will have to be just as familiar with mining data as with observing on telescopes.

The Sloan Digital Sky Survey is a project to digitally map about $1/2$ of the Northern sky in five filter bands from UV to the near IR, and is expected to detect over 200 million objects in this area. Simultaneously, redshifts will be measured for the brightest 1 million galaxies. The SDSS will revolutionize the field of astronomy, increasing the amount of information available to researchers by several orders of magnitude. The resultant archive that will be used for scientific research will be large (exceeding several Terabytes) and complex: textual information, derived parameters, multi-band images, and spectra. The catalog will allow astronomers to study the evolution of the universe in greater detail and is intended to serve as the standard reference for the next several decades. As a result, we felt the need to provide an archival system that would simplify the process of data mining'' and shield researchers from any underlying complex architecture. In our efforts, we have invested a considerable amount of time and energy in understanding how large, complex data sets can be explored.

#### Title:

Mathematical Methods for Mining in Massive Data Sets

#### Author:

Helene E. Kulsrud
Center for Communications Research - Princeton / Institute for Defense Analyses

#### Abstract:

With the advent of higher bandwidth and faster computers, distributed data sets in the petabyte range are being collected. The problem of obtaining information quickly from such data bases requires new and improved mathematical methods. Parallel computation and scaling issues are important areas of research. Techniques such as decision trees, vector-space methods, bayesian and neural nets have been utilized. A short desciption of some successful methods and the problems to which they have been applied will be presented.

#### Title:

Trends in High-End Computing and Storage Technologies: Implications for Astronomical Data Analysis

Tom Prince
Caltech

#### Abstract:

In certain areas of astronomical research, advances in computing and information technologies will determine the shape and scope of future research activities. I will review trends and projections for computing, storage, and networking technologies, and explore some of the possible implications for astronomical research. I will discuss several examples of technology-enabled data analysis projects including the Digital Sky project and the search for gravitational waves by LIGO.

#### Title:

Inverse Problems in Helioseismology

#### Author:

Sarbani Basu
Institute for Advanced Study, Princeton, NJ

#### Abstract:

Helioseismology is the study of the Sun using data obtained by monitoring solar oscillations. The data consist of frequencies of normal modes which are most commonly described by spherical harmonics and have three quantum' numbers associated with them -- the radial order $n$, the degree $\ell$ and the azimuthal order $m$. In the absence of asphericities, all modes with the same $n$ and $\ell$ have the same frequency and the frequency is determined by the spherically symmetric structure. Asymmetry is introduced mainly by rotation and cause the $(n,\ell)$ multiplet to split'' into $2\ell +1$ components.

To date, the frequencies of about $10^6$ modes have been measured. These therefore, provide $10^6$ observational constraints in addition to the usual constraints of mass, radius and luminosity. However, no solar model constructed so far has been able to reproduce the observed frequencies to within errors. Hence, the interior of the Sun is studied by inverting the observed frequencies. There are essentially two types of inversion problems in helioseismology. The first is inverting for rotation, which is a linear inversion problem, and the second is inversion to obtain solar structure which is not a linear problem and hence the problem needs to be linearized before it can be solved. In this talk I shall describe some of the common methods used in helioseismic inversions and talk about some of the techniques used to reduce the problem to a manageable form -- both in terms of memory and time required.

#### Title:

"Fast" Statistical Methods for Interpolation and Model Fitting in One-Dimensional Data

#### Author:

Bill Press
Harvard University

#### Abstract:

There exist several "fast" (in the sense of linear running time) methods for applying the full machinery of linear prediction and global linear fitting to large one-dimensional data sets such as time series or spectra. These methods make practical calculations which would otherwise have been rejected for their $N^3$ running times. This talk will review the status of these methods and give applications. The software for applying these methods is available, free, on the web.

#### Title:

Science With Digital POSS-II (DPOSS)

S. G. Djorgovski
Caltech

#### Abstract:

The ongoing processing of the digitized POSS-II (DPOSS) will result in a catalog containing over 50 million galaxies and over 2 billion stellar objects, complete down to the equivalent limiting magnitude of B ~ 22 mag, over the entire northern sky. The creation, maintenance, and effective scientific exploration of this huge dataset has posed non-trivial technical challenges. A great variety of scientific projects will be possible with this vast new data base, including studies of the large-scale structure in the universe and of the Galactic structure, automatic optical identifications of sources from radio through x-ray bands, generation of objectively defined catalogs of clusters and groups of galaxies, generation of statistically complete catalogs of galaxies to be used in redshift surveys, searches for high-redshift quasars and other active objects, searches for variable or extreme-color objects, etc.

#### Title:

Efficient Width Computation of High-Dimensional Point Sets

#### Author:

Andreas Brieden
Technische Universitaet Muenchen

#### Abstract:

In analyzing high-dimensional point sets several geometric quantities may play an important role. E.g., assume that a point set originally located in an $(n-1)$-dimensional hyperspace can only be measured, by the influence of n oise, to be in an $n$-dimensional space. Then the knowledge of the (Euclidean) width of the convex hull of this point set and also of a width-generating hyperplane can be used to project the data back into a pro per hyperspace. Iterating this process it is also possible to project point sets to lower-dimensional subspaces.

In this talk, efficient approximation algorithms for the width-computation are presented that turn out to be asymptotically optimal (with respect to a standard computing model in computational complexity). The presented approach can be extended to other quantities like diameter, inradius, circumradius and the norm-maximum in $l_p$-spaces.

Joint work with Peter Gritzmann, Technische Universitaet Muenchen and Victor Klee, University of Washington, Seattle.

#### Title:

Shapefinders: a New Shape Diagnostic for Large-Scale Structure

#### Author:

Sergei F. Shandarin
Department of Physics and Astronomy, University of Kansas,
Lawrence, KS 66045

#### Abstract:

We construct a set of shape-finders which determine shapes of compact surfaces (iso-density surfaces in galaxy surveys or N-body simulations) without fitting them to ellipsoidal configurations as done earlier. The new indicators arise from simple, geometrical considerations and are derived from fundamental properties of a surface such as its volume, surface area, integrated mean curvature and connectivity characterized by the Genus. These Shapefinders' could be used to diagnose the presence of filaments, pancakes and ribbons in large scale structure. Their lower-dimensional generalization may be useful for the study of two-dimensional distributions such as temperature maps of the Cosmic Microwave Background.

#### Title:

Challenges in Analysing Future CMB Space Missions

#### Author:

Francois Bouchet
Insitut d'Astrophysique, Paris

#### Abstract:

The planned CMB missions (MAP/NASA/circa 2000 and PLANCK/ESA/circa 2005) will produce full sky maps of the microwave sky in different frequencies with resolutions better than half a degree. The optimal extraction of information (in particular cosmological) from the quite large database of "timelines" poses a variety of problems which I will survey. I will also describe some of the partial answer obtained so far in the context of the PLANCK scentific preparation.

#### Title:

CMB and LSS Power Spectrum Analysis

Max Tegmark

#### Abstract:

I describe numerical challenges involved in analyzing cosmic microwave background (CMB) and large-scale structure (LSS) data sets. These include mapmaking (regularized linear inversions), power spectrum estimation, Karhunen-Loeve data compression and computation of the Fisher information matrix for cosmological parameters.

#### Title:

An Efficient and Stable Fast Spherical Transform Algorithm

#### Author:

Dan Rockmore
Darthmouth University

#### Abstract:

In this talk we explain and present an implementation of a fast spherical harmonic expansion algorithm. Asymptotically, and in exact arithmetic, we compute exactly a full spherical transform of a function with harmonics of at most order $N$ in $O(N^2 (\log N)^2)$ operations vs. $O(N^3)$ required by direct computation. We require a similar number of operations to perform the inverse transform which goes from Fourier coefficients to sample values.

The key component of the fast spherical transform algorithm is the fast Legendre transform which, assuming a precomputed data structure of size $O(N \log N)$, can be performed in $O(N (\log N)^2)$ operations. This asymptotic result is achieved by novel application of the three-term recurrence relation which Legendre functions satisfy. These are general techniques applicable to any set of orthogonal polynomials.

Experimental results from our implementation on an HP Exemplar X-Class computer model SPP2000, show a significant speed-up at large problem sizes, with little degradation in numerical stability. There is also evidence which suggests that similar performance should be possible on an SGI Origin.

This is joint work with D. M. Healy (Dartmouth College), P. Kostelec (Dartmouth College) and S. S. B. Moore (GTE/BBN)

#### Title:

A Fast Method to Bound the CMB Power Spectrum Likelihood Function

#### Author:

Julian Borrill
Center for Particle Astrophysics, Berkeley, CA

#### Abstract:

As the Cosmic Microwave Background (CMB) radiation is observed to higher and higher angular resolution the size of the resulting datasets becomes a serious constraint on their analysis. In particular current algorithms to determine the location of, and curvature at, the peak of the power spectrum likelihood function from a general $N_{p}$-pixel CMB sky map scale as $O(N_{p}^{3})$. Moreover the current best algorithm --- the quadratic estimator --- is a Newton-Raphson iterative scheme and so requires a `sufficiently good' starting point to guarantee convergence to the true maximum. Here we present an algorithm to calculate bounds on the likelihood function at any point in parameter space using Gaussian quadrature and show that, judiciously applied, it scales as only $O(N_{p}^{7/3})$.

#### Title:

Approaches to Gamma-Ray Burst Classification

#### Author:

Jon Hakkila (Mankato State U.), David J. Haglin (Mankato State U.),
Richard J. Roiger (Mankato State U.), Robert S. Mallozzi (U. Alabama Huntsville),
Geoffrey N. Pendleton (U. Alabama Huntsville), and Charles A. Meegan (NASA/MSGC)

#### Abstract:

An understanding of gamma-ray burst (grb) physics is dependent upon interpreting the large body of grb spectral and temporal data. Although many grb spectral and temporal attributes have been identified by various researchers, considerable disagreement exists as to the physical meaning and relative importance of each. We present preliminary but promising attempts to classify grbs using data mining techniques and artificial intelligence classification algorithms.

#### Title:

Multiscale Methods in Astronomical Image Processing, Cluster Analysis, and Information Retrieval

#### Author:

Fionn Murtagh
University of Ulster

#### Abstract:

We will survey multiresolution methods - discrete wavelet transforms and other multiscale transforms - in astronomical image proccessing and data analysis. Objectives include: noise filtering, deconvolution, visualization, image registration, object detection and image compression. A range of examples will be discussed.

The extension of this work to cater for detection of point pattern clusters will be described. Finally some very recent applications of this approach to large hypertext dependence arrays will be discussed.

References

[1] J-L Starck, F Murtagh and A Bijaoui, Image and Data Analysis: The Multiscale Approach, Cambridge University Press, to appear about April 1998.

[2] F Murtagh, "A palette of multiresolution applications", http://hawk.infm.ulst.ac.uk:1998/multires

#### Title:

A Multiscale Vision Model and Applications to Astronomical Image and Data Analyses

#### Author:

A. Bijaoui, E. Slezak, and B.Vandame
Observatoire de la Cote d'Azur, B.P. 229, 06304 Nice Cedex 4 France

#### Abstract:

Many researches were carried out on the automated identification of the astrophy sical sources, and their relevant measurements. Some vision models have been developed for this task, their use depending on the image content. We have developed a multiscale vision model (MVM) (BR95) well suited for analyzing complex structures such like interstellar clouds, galaxies, or cluster of galaxies. Our model is based on a redundant wavelet transform. For each scale we detect significant wavelet coefficients by application of a decision rule based on their probability density functions (PDF) under the hypothesis of a uniform distribution. In the case of a Poisson noise, this PDF can be determined from the autoconvolution of the wavelet function histogram (SLB93). We may also apply Anscombe's transform, scale by scale in order to take into account the integrated number of events at each scale (FSB98).

Our aim is to compute an image of all detected structural features. MVM allows us to build oriented trees from the neighbouring of significant wavelet coefficients. Each tree is also divided into subtrees taking into account the maxima along the scale axis. This leads to identify objects in the scale space, and then to restore their images by classical inverse methods. This model works only if the sampling is correct at each scale. It is not generally the case for the orthogonal wavelets, so that we apply the so-called a trous algorithm (BSM94) or a specific pyramidal one (RBV98). It leads to extract superimposed objets of different size, and it gives for each of them a separate image, from which we can obtain position, flux and p attern parameters.

We have applied these methods to different kinds of images, photographic plates, CCD frames or X-ray images. We have only to change the statistical rule for extr acting significant coefficients to adapt the model from an image class to another one. We have also applied this model to extract clusters hierarchically distributed or to identify regions devoid of objects from galaxy counts.

References

A. Bijaoui and F. Rue. A multiscale vision model adapted to the astronomical images. Signal Processing, 46:345--362, 1995. [BR95]

A. Bijaoui, J.L. Starck, and Murtagh F. Restauration des images multi-echelles par l'algorithme a trous. Traitement du Signal, 11:229--243, 1994. [BSM94]

D. Fadda, E. Slezak, A. Bijaoui. Density estimation with non-parametric methods. Astron. and Astrophys. Sup. Ser 127 pp. 335-352 1998. [FSB98]

F. Rue, A. Bijaoui, B.Vandame. A Pyramidal Vision Model for astronomical images. I.E.E.E. Image Processing, submitted 1997. [RBV98]

E. Slezak, V. de Lapparent, A. Bijaoui. Objective Detection of voids and High density structures in the first CfA redshift survey slice. Ap. J. 409 pp.517-529 1993. [SLB93]

#### Title:

Analysing Very Large Data Sets From Cosmological Simulations

#### Author:

Renyue Cen
Princeton University Observatory

#### Abstract:

Current large scale cosmological simulations generate data of order 100 GB/per simulation. Post-simulation analyses of such large data sets pose a severe challenge to simulators. We will present some methods that we use to circumvent the problem of limited RAM size (assuming CPU time permits).

Workshop Program
Workshop Index
DIMACS Homepage
Contacting the Center