DIMACS Workshop on Network Data Streaming and Compressive Sensing

October 14 - 15, 2010
DIMACS Center, CoRE Building, Rutgers University

Organizers:: Jin Cao, Alcatel-Lucent Bell Labs, cao at research.bell-labs.com; Cristian Estan, NetLogic, estan at netlogicmicro.com; Jun (Jim) Xu, Georgia Tech, jx at cc.gatech.edu

Presented under the auspices of the DIMACS Special Focus on Algorithmic Foundations of the Internet.

Abstracts:

Title: Fast Streaming Indexes for Data-intensive Networked Systems

In recent years, a numbers of systems have been developed where there is a need to maintain large streaming indexes (>100GB), with entries being updated and looked up at a rapid rate (>10K ops per second). Examples of such systems include WAN Optimization devices, data de-duplication systems, large-scale network monitoring systems and directory services.

In this talk, I will describe the opportunities that emerging NAND-flashed based storage offers in designing streaming indexes for these applications. I will first describe an approach that employs a small amount of DRAM together with a much larger amount of flash memory to significantly lower the amortized cost of random index updates at massive scales. I will show how this approach can also support a variety of index eviction policies, as well. I will then describe the challenges and opportunities in extending this approach to support other data-intensive systems, such as those require approximate index lookups.

Jin Cao, Bell Labs

Title: A fast and compact method for unveiling significant patterns in high speed network

The identification of significant patterns in the network traffic such as IPs or flows that contribute large volume (heavy hitters) or introduce large changes (heavy changers) has many applications such as accounting and network anomaly detection. As the network speed and the number of flows grow rapidly, tracking per-ip or per-flow statistics becomes infeasible due to both the computation and memory requirements. In this talk, we propose a sequential hashing scheme that requires only $O(H\log N)$ both in memory and computation that are close to being optimal, where $N$ is the the number of all possible keys (e.g., flows, IPs) and $H$ is the maximum number of heavy keys respectively. Moreover, the generalized sequential hashing scheme makes it possible to tradeoff among memory, update cost, and detection cost in a large range that can be utilized by different computer architectures for optimizing the overall performance. In addition, we also propose statistically efficient algorithms for estimating the values of heavy hitters and heavy changers. Using both theoretical analysis and experimental studies of Internet traces, we demonstrate that our approach can achieve the same accuracy as the existing methods do but using much less memory and computation. This is joint work with Tian Bu, Aiyou Chen and Patrick Lee.

Yan Chen, Northwestern University

Title: NetShield: Massive Semantics-based Vulnerability Signature Matching for High-speed Networks

Accuracy and speed are the two most important metrics for Network Intrusion Detection/ Prevention Systems (NIDS/NIPSes). Due to emerging polymorphic attacks and the fact that in many cases regular expressions (regexes) cannot capture the vulnerability conditions accurately, the accuracy of existing regex-based NIDS/NIPS systems has become a serious problem. In contrast, the recently-proposed vulnerability signatures can exactly describe the vulnerability conditions and achieve better accuracy. However, how to efficiently apply vulnerability signatures to high speed NIDS/NIPS with a large ruleset remains an untouched but challenging issue.

We present the first systematic design of vulnerability signature based parsing and matching engine, NetShield, which achieves multi-gigabit throughput while offering much better accuracy. Particularly, we made the following contributions: (i) we proposed a candidate selection algorithm which efficiently matches thousands of vulnerability signatures simultaneously requiring a small amount of memory; (ii) we proposed an automatic lightweight parsing state machine achieving fast protocol parsing. Experimental results show that the core engine of NetShield achieves at least 1.9+Gbps signature matching throughput on a 3.8GHz single-core PC, and can scale-up to at least 11+Gbps under a 8-core machine for 794 HTTP vulnerability signatures. We release our prototype and sample signatures at www.nshield.org.

Xiuzhen Cheng, The George Washington University

Title: parse Target Counting and Localization in Sensor Networks Based on Compressive Sensing

In this talk, we present a novel compressive sensing (CS) based approach for sparse target counting and positioning in wireless sensor networks. While this is not the first work to apply CS to count and localize targets, it is the first to rigorously justify the validity of the problem formulation. Moreover, we propose a novel greedy matching pursuit algorithm (GMP) that complements the well-known signal recovery algorithms in CS theory and prove that GMP can accurately recover a sparse signal with a high probability. We also propose a framework for counting and positioning targets from multiple categories, a novel problem that has never been addressed before. Finally we report our simulation results, which demonstrate the superiority of our approach over the existing CS and non-CS based techniques.

Anna Gilbert, University of Michigan

Title: Approximate Sparse Recovery: Optimizing Time and Measurements

A Euclidean approximate sparse recovery system consists of a measurement matrix and a decoding algorithm. Given a vector x, the system approximates the signal by decoding the received measurements. This approximation must be close to the best k-term approximation to the original signal; although the approximation may have more than k terms and may succeed with probability greater than 3/4. Among the goals in designing such systems are minimizing the number m of measurements and the runtime of the decoding algorithm.

In this talk, we give a system with $m=O(k \log(N/k))$ measurements---matching a lower bound, up to a constant factor---and decoding time $k\log^{O(1)} N$, matching a lower bound up to $\log(N)$ factors. The approximation can be made arbitrarily close to the best k-term representation for the original signal (in the l2 sense) and we give a precise relationship between the measurements, encoding and decoding times in terms of this factor.

Ashwin Lall, Denison University

Title: Global Iceberg Detection over Distributed Data Streams

In today's Internet applications or sensor networks we often encounter large amounts of data spread over many physically distributed nodes. The sheer volume of the data and bandwidth constraints make it impractical to send all the data to one central node for query processing. Finding distributed icebergs---elements that may have low frequency at individual nodes but high aggregate frequency----is a problem that arises commonly in practice. In this paper we present a novel algorithm with two notable properties. First, its accuracy guarantee and communication cost are independent of the way in which element counts (for both icebergs and non-icebergs) are split amongst the nodes. Second, it works even when each distributed data set is a stream (i.e., one pass data access only). Our algorithm builds upon sketches constructed for the estimation of the second frequency moment (F2) of data streams. The intuition of our idea is that when there are global icebergs in the union of these data streams the F2 of the union becomes very large. This quantity can be estimated due to the summable nature of F2 sketches. Our key innovation here is to establish tight theoretical guarantees of our algorithm, under certain reasonable assumptions, using an interesting combination of convex ordering theory and large deviation techniques.

Ping Li, Cornell University

Title: Compressed Counting for Data Stream Computation and Entropy Estimation

Estimating the p-th frequency moment of data stream is a very heavily studied problem. The problem is actually trivial when p = 1, assuming the strict Turnstile model. The sample complexity of our proposed algorithm is essentially O(1) near p=1. This is a very large improvement over the previously believed O(1/eps2) bound. The proposed algorithm makes the long-standing problem of entropy estimation an easy task.

Yi Lu, UIUC

Title:Compressed Network Measurement via Counter Braids

Networks collect detailed statistics on the data (bytes, packets, etc) sent by each flow. This requires large and fast memories for housing the counters. However, such memories are prohibitively expensive. We introduce a novel architecture, called Counter Braids, which compresses flow sizes while counting. A simple, local message passing algorithm subsequently recovers all flow sizes exactly. The dimensionality reduction rate of the message passing algorithm essentially matches that for sparse signal recovery via L1 minimization, while the complexity is linear in n. We also discuss approximate flow label collection and an error-resilient decoding algorithm.

Srikanta Tirthapura, Iowa State University

Title: Processing Asynchronous Data Streams

The sliding window model in data stream processing considers computations over the sequence of W most recently observed elements of the stream. This simple model is widely used to ensure that the function estimates are "fresh" and reflect the current state of the stream. In many applications involving distributed data, observed streams are asynchronous, i.e. the observed order of data is not the same as the time order in which the data was generated. We generalize the notion of a sliding window to such streams, and present algorithms for basic aggregate functions. We also present connections to correlated aggregate computation over streams.

Shobha Venkataraman, AT&T Research

Title: Tracking Dynamic Sources of Malicious Activity at Internet-Scale

We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different "experts" algorithms and online paging algorithms. We prove guarantees on our algorithm's performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithmachieves high accuracy on large real-world data sets, with significant improvements over existing approaches

Walter Willinger, AT&T Research

Title: Internet Traffic Matrices and Compressive Sensing

Internet traffic matrices (TMs) specify the traffic volumes between origins and destinations in a network over some time period. For example, origins and destinations can be individual IP addresses, prefixes, routers, points-of-presence (PoPs), or entire networks or Autonomous Systems (ASes). Past work on TMs has almost exclusively focused on large ASes such as AS7018 (AT&T) and their router- or PoP-level TMs, mainly because the latter are critical inputs to many basic network engineering tasks. In this talk, I will describe why ideas from the area of compressive sensing (CI) are relevant for TM research and report on novel applications of CI-inspired techniques to TM interpolation and TM inference. I will also discuss some challenging open problems that arise in the context of Internet TMs, where little or no progress has been made to date, but where applying CI-like ideas has the potential of significantly advancing the state-of-the-art. (This is joint work with Y. Zhang and L. Qiu (Univ. of Texas) and M. Roughan (Univ. od Adelaide).)

David Woodruff, IBM Almaden

Title: An Optimal Algorithm for the Distinct Elements Problem

We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in FOCS 1983. For a stream of indices in {1, ... ,n}, our algorithm computes a (1+eps)-approximation using an optimal number of bits of space with 2/3 success probability, where 0Joint work with Daniel Kane and Jelani Nelson.

Jun Xu, Georgia Tech

Title: Packet Doppler: Network Monitoring using Packet Shift Detection

Due to recent large-scale deployments of delay and loss-sensitive applications, there are increasingly stringent demands on the monitoring of service level agreement metrics. Although many end-to-end monitoring methods have been proposed, they are mainly based on active probing and thus inject measurement traffic into the network. In this paper, we propose a new scheme for monitoring service level agreement metrics, in particular, delay distribution. Our scheme is passive and therefore will not cause perturbation to real traffic. Using realistic delay and traffic demands, we show that our scheme achieves high accuracy and can detect burst events that will be missed by probing based methods.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on October 12, 2010.