Sheet 1

Massive Graphs, Power Laws, and the World Wide Web

L. Charles Biehl, The Charter School of Wilmington, Delaware, 2003

Imagine selecting two Web pages at random on the World Wide Web. On average, how many links do you think you would need to travel from one page to the other? Surprisingly, the answer is about 19 (based on current statistics). Perhaps even more surprising than this answer is being able to calculate it, given the size of the Web. There are currently nearly 2.5 billion pages on the Web, with more than 162 million hosts. The growth of the Web is summarized in the table below.

Date Host Count Date Host Count

Jan. 1993 1,313,000 Jan. 1998 29,670,000

Jul. 1993 1,776,000 Jul. 1998 36,739,000

Jan. 1994 2,217,000 Jan. 1999 43,230,000

Jul. 1994 3,212,000 Jul. 1999 56,218,000

Jan. 1995 4,852,000 Jan. 2000 72,398,000

Jul. 1995 6,642,000 Jul. 2000 93,047,785

Jan. 1996 9,472,000 Jan. 2001 109,574,429

Jul. 1996 12,881,000 Jul. 2001 125,888,197

Jan. 1997 16,146,000 Jan. 2002 147,344,723

Jul. 1997 19,540,000 Jul. 2002 162,128,493

Jan. 2003 171,638,297

Now imagine the Web as a graph, where each Web page is a vertex, and each link between pages is an edge. Better yet, imagine a directed graph (digraph) in which each arc links pages in the appropriate direction. This is an example of a massive graph- a class of graphs that are simply too large to be analyzed or understood by conventional means.

One example of a massive graph is a “call graph,” which comes from telephone billing records. The vertices are telephone numbers and the edges (or arcs) are calls made from one number to another. One such call graph analyzed by researchers at AT&T Shannon Laboratories has 53,767,087 vertices and 170 million edges. Over a 20-day period, a call graph can grow to 290 million vertices and 4 billion edges. On a slightly smaller scale is an “acquaintance graph,” in which the people make up the vertices, and the edges connect those who know each other. A project was underway to construct a global acquaintance graph online: http://sixdegrees.com (a site which is no longer up). Members were invited to fill out a form with email addresses of everyone they knew, who in turn were invited to do the same. As of this writing, sixdegrees.com had nearly 3 million members. Incidentally, this site’s name is based on the phrase “six degrees of separation,” from a play (then a film) written by John Guare. The basic idea is that every member of the human race is connected by no more than six links (or people) of acquaintance, which has not yet been actually proven or disproven.

Massive graphs in practice have a relatively recent history. Graph theory itself, although originating with Leonhard Euler in the 1700’s, has only come to the forefront of mathematics since the latter part of the 20^th century. In recent years, graph theory, along with a variety of statistical and algorithmic methods, has been used to study these massive graphs. Even more recently, the properties of massive graphs have been related to random graphs, first studied by Hungarian mathematician Paul Erdös in 1960. Random graphs start with a set of isolated vertices, to which edges are added one at a time. Even though graphs such as those for the World Wide Web, telephone calls, and acquaintances are certainly not random, many of the properties they demonstrate can be modeled successfully with algorithms that are used to generate random graphs.

Massive graphs such as those mentioned here have more in common than just their size. Key properties include:

sparseness- they have relatively few edges compared to a complete graph (one in which every pair of vertices is connected by an edge);

clustering- edges are not distributed uniformly, but rather in clumps;

small diameter- the largest number of edges needed to get from any one vertex to another tends to be relatively small, assuming the graph is connected (no unattached sections).

Recent research has focused on the properties of the degrees of the vertices. Most notably, there appears to exist a power law relationship between the degree of vertices and the number of vertices having a given degree. A power law is a function of the form y=ax^b, in which (in this case) x is a degree measure for a vertex and y is the corresponding number of vertices of degree x. In the case of our graphs, a is a constant of proportionality and b is negative. The negative value of b in our graphs indicates that a small number of vertices will have very large degree, and a vast number will have small degree (this is inverse variation as opposed to direct).

John Kleinberg (Cornell University), along with several colleagues from the IBM Almaden Research Center, found evidence of a power law in the Web, as did Lada Adamic and Bernardo Huberman of the Xerox Palo Alto Research Center (1999). Christos, Michalis, and Petros Faloutsos noted similar structure in the general topology of the Internet (1999). While calculating the diameter of the Web, Réka Albert, Hawoong Jeong, and Albert-László Barabasi (1999) made two additional important observations: new Web pages are constantly being introduced, and the more links a page has, the more likely it is to gain even more. These observations led to research into possible algorithms to generate random graphs that could contain these additional properties.

Since 1999 researchers such as William Aiello (AT&T), Fan Chung, and Lincoln Lu (both of UCSD) have produced algorithms which produce models that closely imitate properties of actual networks for which data has been collected statistically. Their model can be summarized as follows: the degree d of an individual vertex is proportional to 1/d^b, where b is some constant greater than zero that measures the growth in the number of vertices compared with the number of edges. Furthermore, if y is the number of vertices with degree d, then y is proportional to 1/d^b, or y = a /d^b, where a is the constant of proportionality. (Don’t worry if you don’t understand all of this; it won’t make the activity any less meaningful.)

It will be up to future research results, from students as well as university and industry mathematicians, to ultimately reveal the power and importance of these observations, algorithms, and models. Already, more intelligent Web search engines are in development that utilize the properties of these massive random graphs. One project called “Clever,” under development by members of Cornell University and the IBM Almaden Research Center, uses graph theoretic analysis of the Web to search for content via Web pages with large numbers of links in and out. One search engine already in wide use is Google (it should come as no surprise that Google’s success is based on mathematics!), developed by Sergey Brin and Lawrence Page of Stanford University, which combines a conventional text search with a search path based on rankings of Web pages by their numbers of links in and out. Another engine called Teoma, developed at Rutgers University, takes the mathematics used in Google even further, although Teoma’s databases are still (as of this writing) under development.

Developing tools that make it possible to model and study massive graphs provides the foundation for understanding them. Such notions as patterns in the degrees of vertices, or the ability to compute diameter, contribute to the development of mathematical models that mimic the structure of these graphs. Models of random massive graphs based on power laws will hopefully provide the means to approximate graphs based on real data, make it possible to discover and analyze currently hidden structures of existing massive graphs, and perhaps provide models for designing future generations of network topologies. Supposedly, the World Wide Web is still in its infancy. Graph theory appears to be an excellent field well suited to explore its structure, evolution and maturation.