DIMACS Workshop on Internet and WWW Measurement, Mapping and Modeling

February 13-15, 2002
DIMACS Center, CoRE Bldg., Room 431, Rutgers University, Piscataway, NJ

John Byers, Boston University, byers@cs.bu.edu
Danny Raz, Technion, raz@cs.technion.ac.il
Yuval Shavitt, Tel Aviv University, shavitt@eng.tau.ac.il
Presented under the auspices of the Special Year on Networking.



Emergence of scaling in the Internet and the www
Albert Laszlo Barabasi

2. The Internet is like a Jellyfish Michalis Faloutsos, U.C. Riverside (joint work with L. Tauro, G. Siganos, C. Palmer) Abstract - We propose a conceptual visual model for the Internet inter-domain topology. Recently, power-laws were used to describe the topology concisely. Despite their success, the power-laws do not help us visualize the topology, i.e., draw the topology on paper by hand. Here, we deal with the following questions: a. Can we identify a hierarchy in the Internet? b. How can I represent the network in an abstract graphical way? The focus of this talk is threefold. First, we characterize nodes using three metrics of topological "importance"', which we later use to identify a sense of hierarchy. Second, we identify some new topological properties. We then observe that the Internet has a highly connected core and identify layers of nodes in decreasing importance surrounding the core. Finally, we show that our observations suggest an intuitive model. The topology can be seen as a jellyfish, where the core is in the middle of the cap, and one-degree nodes form its legs. In addition, we argue for the need of simple intuitive models to deal with the overwhelming complexity of the real world.
3. Finite metric spaces and clustering large data sets Nati Linial In the last several years there has been very intense interest in the study of finite metric spaces. These spaces and their low-disotrion embeddings have played a key role in the design of new approximation algorithms for hard problems. In this talk, I will discuss their role in the analysis of large data sets. If time permits, I will also discuss several open questions pertaining to the theoretical foundations of clutering.
4. Protocols for building low diameter peer-to-peer networks Gopal Panduragan In this talk, I will describe a distributed protocol for building connected, low-diameter Peer-to-Peer (P2P) networks. These properties are important for performing search (the most common P2P application to date) in P2P networks, especially Gnutella-like networks. Current P2P protocols are typically ad hoc and often lead to network fragmentation. An important feature of our protocol is that it operates with no global knowledge of all the nodes in the network. It is also simple to implement and is easily scalable. We analyze the protocol under a stochastic model (which can be of independent interest) and prove that it results in connected networks of constant degree and logarithmic diameter. This is joint work with Prabhakar Raghavan and Eli Upfal.
5. Graph Theoretic Enhancements of Internet Topology Generators Milena Mihail, Georgia Tech New families of internet topology generators, typified by Brite and Inet, are primarily trying to match the skewed degree sequence statistics of the internet topology at the AS level. (a) We have developed algorithms that, given a target degree sequence, produce topologies meeting the degree sequence but differing substantially from each other and from the real data. We can also produce ``random'' instances matching the real data. Our techniques use graph theoretic primitives as well as the Markov chain simulation method, and substantially enrich the output of existing generators. (They also suggest that degree sequence alone is not sufficient to capture all topology characteristics). (b) For modeling and simulation purposes, it is particularly important to identify global characteristics of AS topologies, such as hierarchy and clustering. Hierarchy has been addressed in several prior works. In this work we initiate a study of clustering using spectral filtering (Singular Value Decomposition - Principal Vector Analysis) methods. We have identified many clusters on the real data (expressing geography or business and cutting across different hierarchical levels). Synthetic data appear to posses significantly weaker clustering properties. Joint work with Chitsos Gkantsidis, Amin Saberi, and Ellen Zegura.
6. On the Marginal Utility of Deploying Measurement Infrastructure Mark Crovella The cost and complexity of deploying measurement infrastructure in the Internet for the purpose of analyzing its structure and behavior is considerable. Basic questions about the utility of increasing the number of measurements and measurement sites have not yet been addressed which has led to a ``more is better'' approach to wide-area measurement studies. In this paper, we step toward a more quantifiable understanding of the marginal utility of performing wide-area measurements in the context of Internet topology discovery. We characterize the observable topology in terms of nodes, links, node degree distribution, and distribution of end-to-end flows. We classify nodes discovered on the routes between a set of 8 sources and 1277 destinations to differentiate nodes which make up the so called ``backbone'' from those which border the backbone and those on links between the border nodes and destination nodes. This process includes reducing nodes that advertise multiple interfaces to single IP addresses. We show that the utility of adding sources beyond the second source quickly diminishes from the perspective of interface, node, link and node degree discovery. We also show that the utility of adding destinations is constant for interfaces, nodes, links and node degree indicating that it is more important to add destinations than sources.
7. On the Influence of Phased Growth of the Internet on Power Laws in the Topology Adam O'Donnell Empirical analysis has revealed several basic large-scale power-law relationships in the Internet topology. Recent work has focused on discerning the underlying phenomena that give rise to the observed scale-free topological properties. Incremental growth with preferential connectivity is one of the phenomena that has been said to have contributed to the power-law relationships. In this work, we present a conjecture that the Internet has undergone not one but several distinct growth phases, each with a different growth process. We present simulation results to corroborate our conjecture and show that a 2-phase growth with preferential connectivity only in the second phase leads to stronger power-law relationships closer to those observed in the real Internet.
8. Explaining Power Laws on the Internet Christos H. Papadimitriou Authors: Alex Fabrikant, Elias Koutsoupias, Milena Mihail, Christos H. Papadimitriou We present a simple model of network creation that predicts power law-distributed degrees as observed by Faloutsos et al. In particular, we assume that nodes arrive at random in a unit square, and each attaches itself to the already existing node that minimizes a linear combination of the Euclidean distance and the hop distance from the center. We show that, if the coefficient is in a particular range, power laws in the degree distribution arise with high probability. Outside this range, degrees are highly concentrated. We also point out that the power law in the distributions of eigenvalues observed in the same paper are a corollary of the degree distributions.
9. Network Topologies: Large-Scale Structure and Hierarchy Ramesh Govindan It has long been thought that the Internet, and its constituent networks, are hierarchical in nature. Consequently, the network topology generators most widely used by the Internet research community, GT-ITM and Tiers, create networks with a deliberately hierarchical structure. However, recent work by Faloutsos et al. revealed that the Internet's degree distribution --- the distribution of the number of connections routers or Autonomous Systems (ASs) have --- is a power-law. The degree distributions produced by the GT-ITM and Tiers generators are not power-laws. To rectify this problem, several new network generators have recently been proposed that produce more realistic degree distributions; these new generators do not attempt to create a hierarchical structure but instead focus solely on the degree distribution. There are thus two families of network generators, structural generators that treat hierarchy as fundamental and degree-based generators that treat the degree distribution as fundamental. In this talk we use several topology metrics to compare the networks produced by these two families of generators to current measurements of the Internet graph. We find that the degree-based generators produce better models, at least according to our topology metrics, of both the AS-level and router-level Internet graphs. We then seek to resolve the seeming paradox that while the Internet certainly has hierarchy, it appears that the Internet graphs are better modeled by generators that do not explicitly construct hierarchies.
10. On the Completeness of Observed AS-level Internet Topology Sugih Jamin Researchers have charactered AS-level Internet topology by exploiting connectivity information contained in BGP routing tables, mainly from those collected by the University of Oregon. The AS-level topology constructed from these routing tables are believed to reflect existing AS-level Internet connectivity reasonably well. However, the completeness of the Oregon topology hasn't been validated by any kind of formal proof or actual measurements. Our measurement result shows that a nonnegligible number of existing AS-level connections are not advertised in many BGP routing tables. We found that the observability of AS-level connectivities in BGP routing tables depends to a large extent on existing AS relationships. Our overall result suggests that the Oregon route server may not capture a significant number of existing AS-level connections.
11. Analysis of Real and Synthetic Internet AS Topology Graphs Danica Vukadinovic, Polly Huang, Thomas Erlebach We study the Internet topology on the autonomous system (AS) level from a graph-theoretic point of view. Our goal is to determine characteristic properties of the real AS graphs and to develop a methodology for comparing real and synthetic topologies. On the one hand, we study many graph parameters that are well-known in graph theory but have rarely been considered in previous investigations of Internet graphs. On the other hand, we find that the normalized Laplacian spectrum provides a concise fingerprint of a network topology. Our results indicate that current state-of-the-art Internet topology generators fail to capture certain distinctive properties of the real AS graphs. This is important for topology-sensitive Internet applications and suggests new directions towards more realistic generators.
12. Observations from Router-level Internet Traces Lisa Amini
13. Characterizing the Internet Hierarchy from Multiple Vantage Points Jennifer Rexford The delivery of IP traffic through the Internet depends on the complex interactions between thousands of Autonomous Systems (ASes) that exchange routing information using the Border Gateway Protocol (BGP). This paper investigates the topological structure of the Internet in terms of customer-provider and peer-peer relationships between ASes, as manifested in BGP routing policies. We describe a technique for inferring AS relationships by exploiting partial views of the AS graph available from different vantage points. Next we apply the technique to a collection of ten publicly-available BGP routing tables to infer the relationships between neighboring ASes. Based on these results, we analyze the hierarchical structure of the Internet and propose a five-level classification of ASes. Our analysis differs from previous characterization studies by focusing on the commercial relationships between ASes rather than simply the connectivity between the nodes. This is joint work with Lakshminarayanan Subramanian, Sharad Agarwal, and Randy Katz at the University of California at Berkeley. A paper describing the work is available at http://www.research.att.com/~jrex. The BGP routing data and our inference results are available at http://www.cs.berkeley.edu/~sagarwal/research/BGP-hierarchy/
14. On Characterizing BGP Routing Table Growth Tian Bu The Internet has experienced explosive growth since its commercialization. The sizes of the routing tables have increased by an order of magnitude over the past six years. This dramatic growth of the routing table can decrease the packet forwarding speed and demand more router memory space. In this work, we explore the extent that various factors contribute to the routing table growth. We first perform measurement study to determine the extent that factors such as multi-homing, failure to aggregate, load balancing, and address fragmentation contribute to routing table size, and find that only 20-30% of prefixes are due to multi-homing, 15 - 20% of prefixes are due to failure to aggregate, 20 - 25% of prefixes are due to load balancing, and more than 75% of prefixes are due to address fragmentation. This leads us to group all prefixes that are not aggregated due to either failure to aggregate or address fragmentation. We find that the number of prefix clusters is no more than 20% of the number of prefixes. We explore the extent that load balancing contributes to the number of prefix clusters. To the best of our knowledge, this is the first study on the explosive growth of routing tables by systematically comparing factors that contribute to the growth and by observing routing table growth patterns.
15. A web graph bestiary Andrei Broder, IBM Research Division Experiments conducted a couple of years ago on a series crawls from AltaVista revealed the intricate structure of the enormous link graph induced by the corpus of a major search engine. But clearly no search engine covers all the pages on the web, and even defining precisely what is "the entire web graph" is challenging. Furthermore, any crawling process is driven by some rules that influence the final result. Last but not least, the web is intentionally created: there are incentives (some economic and some not so) to perturb the "natural" process of birth and death of nodes and links. These perturbation might explain the discrepancies between some reasonable models of web expansion and the observed reality. The main aim of this talk is to present a bestiary of these perturbations; among the monsters to be presented are soft 404's, infinite paths, seldom connected servers, link spamming, robot submitters, cookie crumbs, content generators, and others.
16. Web community identification from hyperlink topology: clustering and ranking algorithms Chris Ding We study methods for effectively identifying web communities and ranking algorithms for placing most informative webpages on the top. Web communities can be discovered with an unsupervised learning method by clustering web pages into clusters utilizing information contained in hyperlink topology, text, and co-citation. The subgraph components are generally sparsely-connected with their sizes following Zipf distribution. We provide a comprehensive analysis on PageRank and HITS ranking algorithms, with improvements and generalizations. Closed-form solutions are obtained which proves the fundamental notion that top authorities always have high in-degrees and top hubs always have high out-degrees. Joint work with Xiaofeng He, Hongyuan Zha, Parry Husbands, Horst Simon.
17. Measurement and modelling of the WWW link graph Andrew Tomkins
18. A general model of web graphs Alan Frieze We describe a general model of a random graph whose degree sequence obeys a power law. Such laws have recently been observed in graphs associated with the world wide web. Joint work with Colin Cooper.
19. Fine Grain Observation of Traffic Dynamic in the Sprint IP Backbone Christophe Diot Network traffic measurements provide essential data for networking research and network management. We describe a passive monitoring system designed to capture the dynamic aspects of the traffic at the finest possible level of granularity. Our system is deployed in 3 POPs in the Sprint E|Solutions IP backbone. It taps fibers at different backbone links and timestamps each IP packet with a GPS-based global clock signal, allowing packet correlation at different locations on the network. We present a set of results to both demonstrate the strength of the system, and identify new characteristics of the Internet traffic. The results include traffic statistics collected on specific OC-12 links in terms of bytes, packets, and flows, as well as single-hop and multiple-hop packet delay. Recent measurement issued from IS-IS and BGP speakers deployed in the same backbone are also discussed.
20. Linking Internet traffic dynamics with the BGP topology: the view of two stub ISPs Steve Uhlig In this presentation, we study two one-week long traces of all the interdomain traffic received by 2 Belgian stub-ISPs. We correlate the traffic received by these two very different ISPs with their BGP routing table. First, we present the topological distribution of their traffic throughout the Internet and show that most of their traffic is received from sources located at a small AS hop distance. Then, we study the ``activity'' of the interdomain traffic sources and show that this ``activity'' does not vary much with the considered timescale. Finally, we show that the distribution of the most important traffic sources in terms of the traffic volume is stable.
21. Using Simulation of Particle Mechanics for Calculating Coordinates of Large Distance Maps Tomer Tankel IDMaps is architecture for estimating Internet distances using Instrumentation boxes, placed in the Internet. Host-to-host distances are calculated as sum of several measured segments because there is no means to 'triangulate' these segments. Ng and Zhang suggested coordinate-based approach to provide this triangulation, however their technique suffers from significant inaccuracy when the number of instrumentation boxes is larger than about 15. We suggest a novel graph embedding technique that can handle hundreds of such boxes at high accuracy. We model network nodes as particles traveling under force field derived from embedding error. Particles escape local minima due to their kinetic energy and achieve global minimum of embedding error. By the use several error 'forces' in different stages of the calculation we converge very fast to very low embedding error. We will present results both power-law networks as well as for traditional Waxman networks. We will also show our embedding is superior in practice to the one based on the theoretical results of Linial, London, and Rabinovich. Joint work with Yuval Shavitt.
22. Inference and Labelling of Metric-Induced Network Topologies Azer Bestavros The deployment of distributed network-aware applications over the Internet requires an accurate representation of the conditions of underlying network resources. To be effective, this representation must be possible at multiple resolutions relative to a metric of interest. In this talk, I will describe an approach for the construction of such representations using end-to-end measurement of internal network loss, delay, and bandwidth characteristics. This work is done in collaboration with John Byers and Khaled Harfoush.
23. Maximum Likelihood Estimation of Network Topology from End-to-End Measurements Robert Nowak Authors: Rui Castro, Mark Coates, Rob Nowak One of the predominant schools of thought in networking today is that monitoring and control of large scale networks is most practical at the edge. Identifying or estimating the routing topology is crucial to edge-based approaches. The focus of this work is a new Maximum Likelihood criterion for topology identification that makes use only of unicast measurements performed between host computers and requires no special support (e.g., ICMP responses) from internal routers. The measurements are based on a novel unicast probing strategy, and theoretical analysis of the likelihood criterion leads to a new, fast algorithm for topology identification.
24. Network Tomography Using Passive End-to-End Measurements Venkata N. Padmanabhan and Lili Qiu We discuss Network Tomography, which refers to the inference of characteristics of internal links in a network using end-to-end measurements. A novel aspect of our work is that we do the inference based on passive observation of existing end-to-end traffic, rather than doing active measurements. We focus on inferring link loss rate and describe two techniques to this end. We evaluate the accuracy of these techniques using packet traces gathered at a busy Web site and also simulation. The initial results are encouraging: our techniques can identify 70 - 80% of the lossy links in a network with a manageable false positive rate.
25. Measurement and Classification of Out-of-Sequence Packets in a Tier-1 Backbone Sharad Jaiswal In this study we present a comprehensive study and classification of out-of-sequence packets based on TCP connections observed over the Sprint IP backbone, using a monitoring infrastructure present in different backbone POPs. We describe the measurement and classification methodology. We determine and present the relative frequency of events causing of out-of-sequence packets, including packet losses, reordering and duplicates within the network. We also consider the distribution of out-of-sequence packets among the measured flows, based on flow size (e.g., mice versus elephants), and POP location. Analyzing out-of-sequence TCP packets is important since they have a direct bearing on the performance of applications. This study is of considerable interest because (i) the data we analyze is unprecedented in size and scope, (ii) it provides a starting point to address several interesting additional questions about congestion in the Internet.
26. Performance Monitoring for Wide-Area Applications Louiqa Raschid Recent technology advances have enabled wide area applications with Internet accessible sources. A challenge to such applications is the unpredictable behavior of sources in the dynamic WAN. There can be wide variability in the latency (end-to-end delay) of accessing these sources, and this could depend on network and server workloads. The success of these applications will depend on their ability to monitor and predict end-to-end client-side performance. Our objective is to explore open technology for passive performance monitoring to develop performance profiles for clusters of clients and servers. Such technology must be scalable to large numbers of clients and servers.
27. Comparative Analysis of Traffic Matrix Estimation Methods Alberto Medina Knowledge about the traffic matrix of a network is very valuable to perform traffic engineering tasks such as load balancing, routing protocol configuration, load balancing, provisioning, pricing, etc. Traffic matrices must be estimated based on limited available data and to date, very few techniques exist for doing so. We conduct a comparative study of all of these techniques. We discuss the strengths and weaknesses of these techniques in the context of real commercial networks. The analysis of a real environment is done in the context of the Sprint Tier-1 backbone network, and make use of our measurement infrastructure to validate the techniques.
28. Connection-Level Modeling of Network Traffic Rudolf Riedi Aggregate network traffic exhibits strong burstiness and non-Gaussian marginals, which popular models like fractional Gaussian noise (fGn) fail to capture. To better understand the cause of traffic burstiness, we look into connection-level information of traffic traces. A careful study reveals that traffic burstiness is directly related to the heterogeneity in connection bandwidths (and round-trip times), and that a small number of high-bandwidth connections are solely responsible for burstiness. This decomposition has far-reaching implications on network control and leads to a new model for network traffic: the mixture of fGn and stable L\'evy noise, which captures both long-range dependence and burstiness.
29. A New Methodological Approach to Cross-Traffic Estimation Kave Salamatian This paper applies to cross traffic estimation, a new methodology for analyzing and interpreting measurement collected over Internet. This new approach is based on inferring cross traffic characteristics that lead to the observed losses by using an associated a priori constructive model. The constructive model used in this paper is an MMPP/M/1/N single server bottleneck. The originality of this solution is that we start with observed loss process and infer inputs that have lead to these observations. Methods presented in the paper provide a powerful solution to address the complexity of interpreting IP active measurement and empirical network modelling.
30. Internet Traffic: Statistical Multiplexing Gains Jin Cao, William S. Cleveland, Dong Lin, and Don X. Sun Bell Labs, Murray Hill, NJ {cao, wsc, dong, dxsun}@research.bell-labs.com As the number of active connections on an Internet link increases, the long-range dependence of packet traffic changes due to increased statistical multiplexing of packets from different active connections. Four packet traffic variables are studied -- inter-arrival times, sizes, packet counts, and byte counts. Results are based on the following: (1) the mathematical theory of marked point processes; (2) empirical study of 2526 packet traces, 5 min or 90 sec in duration, from 6 Internet monitors measuring 15 interfaces ranging from 100 mbps to 622 mbps; (3) network simulation with NS; and (4) simple statistical models for the traffic variables.
31. Investigating the Structure of the Internet via Game Theory Scott Shenker Auhtors: Elitza Maneva, Christos H. Papadimitriou, Scott Shenker, Kunal Talwar We study network creation as a game between nodes. Each node can create an edge to any other at a fixed cost per edge, and, once the graph corresponding to the choices of all nodes is formed, it incurs an additional cost equal to the sum of the distances to the other nodes. In another variant, the costs correspond to VCG charges, minus VCG income. We are interested in questions such as: How bad are the Nash equilibria? And can bad instances of VCG routing arise as Nash equilibria?
32. The diameter of a scale-free random graph Bela Bollobas We consider a random graph process in which vertices are added to the graph one at a time and joined to each earlier vertex $v$ with probability proportional to the degree of $v$. This process was introduced by Barabasi and Albert, as a simple model of the growth of real-world graphs such as the world wide web. Computer experiments presented by Barabasi, Albert, and Jeong and heuristic arguments given by Newman, Strogatz and Watts suggested that after $n$ steps the resulting graph should have diameter approximately $\log n$. Recently, with Oliver Riordan, we determined the asymptotic value of the diameter; the main aim of the talk is to sketch a proof of this result. We shall also present some related results obtained with Riordan, Spencer and Tusnady about the degree sequence of this random graph.
33. On the power of off line data in approximating Internet distances Prasun Sinha Estimate of Internet distances can be used for finding good solutions to problems such as cache placement and dynamic selection of the nearest mirror site. Proposed approaches for Internet distance measurement are based on either using geographic distance as a metric, or extrapolation using background distance measurements between a fixed set of sites. Only using geographic distances have been shown to have limitations and the latter technique of active distance measurement has several associated problems such as the requirement of dedicated machines for measurements and introduction of additional network traffic. In this work, we explore the use of multi-variable linear regression techniques for distance estimation using multiple metrics such as distance to the nearest backbone network and geographic distance. For our studies, we have used the set of public libraries and the set of universities in U.S. as clients and a fixed set of 8 servers. In particular we study the accuracy of picking up the nearest server from a set of servers using our distance estimation technique. Our preliminary results indicate that off line information allows choosing the nearest server in about 80% of the cases. Joint work with Danny Raz.
34. Comparative Analysis of Traffic Matrix Estimation Methods Alberto Medina Authors: Alberto Medina, Nina Taft, Kave Salamatian, Supratik Bhattacharyya and Christophe Diot. Knowledge about the traffic matrix of a network is very valuable to perform traffic engineering tasks such as load balancing, routing protocol configuration, load balancing, provisioning, pricing, etc. Traffic matrices must be estimated based on limited available data and to date, very few techniques exist for doing so. We conduct a comparative study of all of these techniques. We discuss the strengths and weaknesses of these techniques in the context of real commercial networks. The analysis of a real environment is done in the context of the Sprint Tier-1 backbone network, and make use of our measurement infrastructure to validate the techniques.

Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center

Document last modified on February 7, 2002.