Whales of the Web

Posted on 27 April 2015 by Brian Hayes

The average web site has links connecting it with 29 other sites. I came up with this number in the following way. A data set I’ve been playing with for a few weeks lists 43 million web sites and 623 million links between sites; the quotient of those numbers is about 14.5. Since each link has two ends, the per-site total of inbound plus outbound links is double the quotient.

Twenty-nine links is the average, but by no means is it a typical number of links. Almost half of the sites have four or fewer links. At the other end of the spectrum, the most-connected web site (blogspot.com) has almost five million links, and there are six more sites with at least a million each. The distribution of link numbers—the degree sequence—looks like this (both scales are logarithmic, base 2):

spacer

I want to emphasize that these are figures for web sites, not web pages. The unit of aggregation is the “pay-level domain”—the domain name you have to pay to register. Examples are google.com or bbc.co.uk. Subdomains, such as maps.google.com, are all consolidated under the main google.com entry. Any number of links from pages on site A to pages on site B are recorded as a single link from A to B.

The source of these numbers is the Web Data Commons, a project overseen by a group at the University of Mannheim. They extracted the lists of domains and the links between them from a 2012 data set compiled and published by the Common Crawl Foundation (which happens to be the subject of my latest American Scientist column). The Common Crawl does essentially the same thing as the big search engines—download the whole Web, or some substantial fraction of it—but the Common Crawl makes the raw data publicly available.

There are interesting questions about both ends of the degree sequence plotted above. At the far left, why are there so many millions of lonely, disconnected web sites, with just one or two links, or none at all? I don’t yet feel I know enough to tell the story of those orphans of the World Wide Web. I’ve been focused instead on the far right of the graph, on the whales of the Web, the handful of sites with links to or from many thousands of other sites.

From the set of 43 million sites, I extracted all those with at least 100,000 inbound or outbound links; in other words, the criterion for inclusion in my sample was \(\min(indegree, outdegree) \ge 100,000\). It turns out that just 112 sites qualify. In the diagram below, they are grouped according to their top-level domain (com, org, de, and so on). The size of the colored dot associated with each site encodes the total number of links; the color indicates the percent of those links that are incoming. Hover over a site name to see the inbound, outbound and bidirectional links between that site and the other members of this elite 112. (The diagram was built with Mike Bostock’s d3.js framework, drawing heavily on this example.)

Patience, please . . .

The bright red dots signify a preponderance of outgoing links, with relatively few incoming ones. Many of these sites are directories or catalogs, with lists of links classified by subject matter. Such “portal sites” were popular in the early years of the Web, starting with the World Wide Web Home at CERN, circa 1994; another early example was Jerry and David’s Guide to the World Wide Web, which evolved into Yahoo. Search engines have swept aside many of those hand-curated catalogs, but there are still almost two dozen of them in this data set. Curiously, the Netherlands and Germany (nl and de) seem to be especially partial to hierarchical directories.

Bright blue dots are rarer than red ones; it’s easier to build a site with 100,000 outbound links than it is to persuade 100,000 other sites to link to yours. The biggest blue dot is for wordpress.org, and I know the secret of that site’s popularity. If you have a self-hosted WordPress blog (like this one), the software comes with a built-in link back to home base.

Another conspicuous blue dot is gmpg.org, which mystified me when I first noticed that it ranks fourth among all sites in number of incoming links. Having poked around at the site, I can now explain. GMPG is the Global Multimedia Protocols Group, a name borrowed from the Neal Stephenson novel Snow Crash. In 2003, three friends created a real-world version of GMPG as a vehicle for the XHTML Friends Network, which was conceived as a nonproprietary social network. One of the founders was Matt Mullenweg, who was also the principal developer of WordPress. Hence every copy of WordPress includes a link to gmpg.org. (The link is in the <head> section of the HTML file, so you won’t see it on the screen.) At this point GMPG looks to be a moribund organization, but nonetheless more than a million web sites have links to it.

Networkadvertising.org is the web site of a trade group for online advertisers. Presumably, its 143,863 inbound links are embedded in ads, probably in connection with the association’s opt-out program for behavioral tracking. (To opt out, you have to accept a third-party cookie, which most people concerned about privacy would refuse to do.)

Still another blue-dot site, miibeian.gov.cn, gets its inward links in another way. If I understand correctly, all web sites hosted in China are required to register at miibeian.gov.cn, and they must place a link back to that site on the front page. (If this account is correct, the number of inbound links to miibeian.gov.cn tells us the number of authorized web sites in China. The number in the 2012 data is 289,605, which seems low.)

One final observation I find mildly surprising: Measured by connectivity, these 112 sites are the largest on the entire Web, and you might think they would be reasonably stable over time. But in the three years since the data were collected, 10 percent of the sites have disappeared altogether: Attempts to reach them either time out or return a connection error. At least a few more sites have radically changed their character. For example, serebella.com was a directory site that had almost 700,000 outbound links in 2012; it is now a domain name for sale. Among web sites, it seems, nobody is too big to fail.

The table below lays out the numbers for the 112 sites. It’s sortable: Click on any of the column headers to sort on that field; click again to reverse the ordering. If you’d like to play with the data yourself, download the JSON file.

site inlinks outlinks total links % inbound



Posted in computing | 7 Comments

A taxing algorithm

Posted on 16 April 2015 by Brian Hayes

My first encounter with the term algorithm did not come from Don Knuth. I learned it from the Internal Revenue Service. Sometime in the late 1960s or early 70s the IRS introduced a redesigned Form 1040 based on a principle borrowed from the world of computing: the algorithm. What this meant, I soon discovered, was that the wording of the instructions would be procedural rather than declarative. Instead of “Your tax is 15 percent of the amount on line 42,” we had “Multiply the amount on line 42 by 0.15 and write the result on line 43.” I had expected something more revolutionary, but at least I expanded my vocabulary.

I’ve filled out a lot of 1040s since then, but until yesterday I had never become acquainted with Schedule D (Capital Gains and Losses). What a treat I had waiting for me! Tucked inside the Schedule D instruction book, I found a marvel of labyrinthine arithmetic and logic. The tax worksheet on pages D-15 and D-16 might well be the most complex algorithm that ordinary people are ever asked to perform.

Below is my attempt to represent the worksheet as a data-flow diagram. Inputs (mostly dollar amounts copied from elsewhere in the tax return) are in blue; the eventual output (tax owed) is in red; the flow is mostly from bottom to top. The numbers in the boxes correspond to line numbers in the worksheet.

spacer

The directed graph has 82 nodes and 96 edges. Twelve subtractions, seven additions, four multiplications, ten comparisons, and two table lookups. Now that’s an algorithm! It’s gnarlier than calculating the date of Easter.

What are the chances that I correctly followed all the steps of this algorithm? What are the chances that the published algorithm correctly implements the intent of the tax code?

Posted in computing, modern life | 12 Comments

Mrs. Nguyen’s prestidigitation

Posted on 22 February 2015 by Brian Hayes

From a set of 1 through 9 playing cards, I draw five cards and get cards showing 8, 4, 2, 7, and 5. I ask my 6th graders to make a 3-digit number and a 2-digit number that would yield the greatest product. I add, “But do not complete the multiplication — meaning do not figure out the answer. I just want you to think about place value and multiplication.”spacer

The problem above comes from Fawn Nguyen, who teaches math at a public school in southern California and writes a blog called Finding Ways. To her students she is “Mrs. Win,” an English approximation to the Vietnamese pronunciation.

It’s clear that much care and craftsmanship went into formulating this problem. Why did Nguyen choose a two-digit-by-three-digit product, rather than two-by-two or three-by-three? The asymmetry ensures a unique solution. Why did she use playing cards to select the digits, rather than simply asking the kids to call out numbers? The cards produce a set of distinct digits, without duplicates, and they also rule out the possibility of choosing zero. Repeated digits or zeros would not ruin the problem, but they would needlessly complicate it. Nguyen’s classroom procedure eliminates these distractions without even mentioning them. This is the kind of subtle indirection you find in the performance of a first-rate stage magician.

I’ve had some serious fun with Nguyen’s problem. Finding the answer is not difficult—especially if you cheat and set aside her boldfaced injunction forbidding arithmetic. But of course the answer itself isn’t the point, as Nguyen makes clear in her blog post describing the students’ search for a solution. What we really care about is why one particular arrangement of those digits yields a larger product than all other arrangements. We want an explanatory pattern or principle, a rule that works not just for this one set of digits but for any instance of the max-product problem. I had many stumbles on the path to finding such a rule. Here are some notes I made along the way. But before reading on you might want to write down your own best guess at the answer.


  1. First, some notation. Let’s label the digits like so:

    \[\begin{align}
    x_{2}\; x_{1}\; x_{0}&{}\\
    \underline{\times \quad y_{1} \; y_{0}}&{},
    \end{align}\]

    where the subscripts encode the power of 10 associated with each digit. The object is to find a one-to-one mapping between the sets \(\{x_{2}, x_{1}, x_{0}, y_{1}, y_{0}\}\) and \(\{8, 4, 2, 7, 5\}\) that maximizes the product.

  2. Nguyen’s sixth graders were quick to perceive one important principle: Larger digits should generally go to the left, and smaller digits to the right, in order to get the most benefit from place values in decimal notation. No one proposed 258 × 47 as the maximum product; that combination is obviously beaten by 852 × 74.
  3. If we wanted to solve this problem by exhaustive search, how exhausting would the search be? There are \(5! = 120\) permutations of the five digits, and each permutation corresponds to a different multiplication problem. But in view of the observation made in item 2, the number of plausible candidates is much smaller: just \(\binom{5}{2} = \binom{5}{3} = 10\). You could check them all in a matter of minutes. (But Mrs. Nguyen would not approve.)
  4. Again following the logic of item 2, we can stipulate that the 8—the largest of the five digits—must wind up either in position \(x_2\) or in position \(y_1\). The question is which. My first thought was that surely the 8 goes in the three-digit number, because position \(x_2\) gives the 8 a value of 800, whereas it’s only worth 80 as \(y_1\).
  5. My second thought undermined my first thought. Suppose the problem had been formulated without Mrs. Nguyen’s clever card trick, and we had to work with the digits 8, 7, 0, 0, 0. Then the arrangement yielding the maximum product is surely either 800 × 70 or 700 × 80. But of course those two products are both equal to 56,000. In other words, if we consider just the leading digits, it doesn’t matter which goes in the multiplier and which in the multiplicand.
  6. Let’s write out the full six-term polynomial defining the product:\[1000 x_{2}y_{1} + 100 x_{1}y_{1} + 10 x_{0}y_{1} + 100 x_{2}y_{0} + 10 x_{1}y_{0} + x_{0}y_{0}\]The leading term provides another way of understanding why the largest digit can be either \(x_2\) or \(y_1\). It’s just the commutative law. We get \(1000 x_{2}y_{1}\) with either ordering. Thus we need to consider the smaller terms to decide which placement is better. Hint: \(y_1\) appears in three terms but \(x_2\) turns up in only two.
  7. Maybe it would help to look at a smaller version of the problem? For example, maximize the product \(x_{1}\, x_{0} \times y_{0}\) with digits drawn from the set \(\{1, 2, 3\}\). Where does the largest digit belong?
  8. Here’s a vague, daydreamy notion: The task of maximizing the product of these numbers looks something like an isoperimetric problem. If you have a rectangle of perimeter \(2(h + w)\), you maximize the area \(hw\) by making \(h=w\), so that the rectangle becomes a square. Maybe, by analogy, we should make the three-digit and the two-digit numbers as close to equal as possible, while at the same time making both of them as large as possible.
  9. At this point I was ready to commit. I was persuaded by the arguments in items 6 and 7, and 8 that the largest digit should be bound to variable \(y_1\), and the next largest to \(x_2\). Then the same rule could be applied recursively to the remaining digits and variables, yielding this assignment:

    \[\begin{align}
    7\, 4\, 2&{}\\
    \underline{\times \quad 8 \, 5}&{}\\
    6\, 3\, 0\, 7\, 0&{}
    \end{align}\]

  10. Of course you already know I was wrong. Even if you haven’t yet solved the problem for yourself or looked up the answer elsewhere, it’s a well-established principle of proper storytelling that no one ever stumbles on the right answer on the first try.

    My error was revealed by a program running an exhaustive-search algorithm. It showed that exchanging the positions of the 7 and the 8 yields a larger product:

    \[\begin{align}
    8\, 4\, 2&{}\\
    \underline{\times \quad 7 \, 5}&{}\\
    6\, 3\, 1\, 5\, 0&{}
    \end{align}\]

    But that isn’t the right answer either. Instead of switching the 7 and 8, you can exchange the 5 and the 4 to get the following result, which does turn out to be optimal:

    \[\begin{align}
    7\, 5\, 2&{}\\
    \underline{\times \quad 8 \, 4}&{}\\
    6\, 3\, 1\, 6\, 8&{}
    \end{align}\]


So that’s it—the answer we’ve been seeking, the maximum product, the solution to Mrs. Nguyen’s problem. There’s no arguing with arithmetic.

But it’s hardly the end of the trail. Why does that peculiar permutation of the digit set gives the biggest product? Does the same pattern work for other two-digit-by-three-digit products? What about problems of other shapes and sizes? And what is the pattern, anyway? How would you succinctly state the rule for placing digits in Mrs. Nguyen’s boxes?

In trying to make sense of what’s going on here, I’ve found a certain graphic device to be helpful. I call it a lacing diagram, because it reminds me of the various schemes for lacing up a pair of sneakers. The patterns are easier to perceive with larger numbers of digits (i.e., high-top sneakers), so the examples given below show sets of 10 digits arranged to form a five-by-five multiplication problem.

In a lacing diagram the red arrows trace a path that threads its way through all the digits, ordered from largest to smallest. Each segment of this path can either cross from one factor to the other (a trans step) or stay within the same factor (a cis step). The particular lacing shown here is made up of alternating trans and cis segments. The sequence of t’s and c’s below the diagram serves as a linearized encoding of the pattern; the number below the encoding is the product of the two factors.

spacer

Here is the lacing diagram corresponding to the pattern that I initially believed would prevail in all Nguyen products:

spacer

In this case, all the steps are trans, as the arrows zigzag from one side to the other. The linear encoding consists of nine t’s.

And finally here is the lacing diagram for the winning permutation, the arrangement that gives the largest product. The pattern is the same as the second lacing diagram above—the zigzag—except for a single twist: The leftmost digits of the two factors have been switched, and so there is one cis step in the path.

spacer

After a bit of puzzling, I was able to work out why that single twist yields a better score than the plain zigzag. It’s because \((9 \times 7) + (8 \times 6) = 111\), whereas \((9 \times 6) + (8 \times 7) = 110\). Likewise \((9 \times 5) + (8 \times 4) = 77\), whereas \((9 \times 4) + (8 \times 5) = 76\), and so on. Note the difference between the twisted and the zigzag products: \(8439739020\, – 8428629020 = 11110000\). Each of the four pairings of the leftmost digits with those to their right contributes a 1 in that difference.

If one twist is a good thing, maybe more twists would be even better? For example, if we invert the 5 and the 4, we get \((5 \times 3) + (4 \times 2) = 23\) instead of \((5 \times 2) + (4 \times 3) = 22\), again for a net gain. But of course there’s a flaw in this strategy. Flipping the 5 and the 4 increases their products with neighboring digits to the right, but lowers those with digits to the left. Flipping the leftmost digits doesn’t run afoul of this rule because there are no neighbors to the left.

For a more explicit definition of the zigzag-with-a-twist arrangement, here is a Python implementation. The basic idea is to deal out the digits alternately to the \(x\) and \(y\) factors, starting with \(x\) and working through the digits in descending order. When \(y\) (the smaller factor) runs out of room, any remaining digits are tacked onto the end of \(x\). Finally—this is the twist—the leftmost \(x\) and \(y\) digits are swapped. This procedure works in any base.

def twisted_zigzag(digit_set, s):
    s = min(s, len(digit_set) - s)    # len of smaller factor
    digits = sorted(list(digit_set))  # smallest to largest
    x = []
    y = []
    while digits:                     # pop from end of list
        x.append(digits.pop())        # start with x
        if len(y) < s:                # zigzag until y is full
            y.append(digits.pop())
    x[0], y[0] = y[0], x[0]           # swap first digits
    return x, y

Does the zigzag-with-a-twist arrangement give the maximum product for all possible Nguyen-type problems? I can offer substantial evidence supporting that proposition. For base-10 numbers formed without repeated digits there are 4,097 two-factor multiplication problems with digits sorted in descending order. The zigzag-twist pattern gives the correct result for all of them.For digit sets that include 0, the solution is not always unique. Indeed, the algorithm works for all problems in bases between 2 and 15. It’s hardly a daring conjecture to suggest that it works in all bases, but I have not produced a proof. Maybe Mrs. Nguyen’s sixth graders will do so.

Posted in mathematics | 7 Comments

A quasirandom talk

Posted on 2 February 2015 by Brian Hayes

I’ll be giving a talk at Harvard this Friday, February 6: “Orderly Randomness: Quasirandom Numbers and Quasi–Monte Carlo.” If you’re in the neighborhood, please stop by. Lunch at 12:30; quasirandomness at 1:00 pm.

Update 2015-02-08: Slides online. And video.

Posted in computing, mathematics | 1 Comment

600613

Posted on 16 December 2014 by Brian Hayes

Pick a number, N, then try searching for it on the web via Bing or Google (or maybe the leet version of Google). What can you expect to learn? I wasn’t quite sure of the answer, so I ran some experiments.

When N is a small positive integer—less than 100, say—the leading results tend to be mass-audience web pages that happen to display the numeral N in some prominent way, such as in a headline or a title. There are news stories (Packers 43, Falcons 37), TV stations (WXMI Fox 17), a few brand names (Motel 6), references to iconic events (9/11, Apollo 13), listings of Bible verses (Romans 3:23).

With somewhat larger integers—three or four digits—I see a lot of street addresses, area codes, tax forms, statutes and ordinances. With five-digit numbers, Zip codes become prominent. At six digits we enter the land of hex colors, accompanied by a baffling variety of part numbers, account numbers, serial numbers, patent numbers, error numbers, lottery numbers. With a search string of 8 to 10 digits, telephone directories dominate the results. Still further out on the number line, you eventually come to a numerical desert where Google and Bing usually come up empty.


To get a more quantitative sense of how numbers are distributed across the web, I decided to do some sampling. I randomly selected 2,000 positive integers of 1 to 12 decimal digits, and submitted them to Google as web search queries. To construct the query integers I started with 2,000 floating-point numbers selected uniformly at random (with replacement) from the range \(0 \le m \lt 12\). For each \(m\) I calculated \(N = \lfloor 10^{m}\rfloor\), then ran a Google search for N. The work was done by a Python script with a politeness pause of one second between queries.From the results of each search I extracted \(H(N)\), the number of hits, which Google reports near the top of the page. Here’s what I found, plotted on log-log scales:

What an intriguing graph! Over most of the range in the log-log plot, the broad trend looks very nearly linear. What does that mean? If the Google data accurately reflect the state of the web, and if my sampling of the data can be trusted, it means the number of web pages mentioning numbers of magnitude \(10^k\) is roughly constant for all k in the range from \(k = 2\) to \(k = 10\). I don’t mean to suggest that specific large numbers appear just as frequently as specific small numbers. That’s obviously untrue: A typical two- or three-digit number might be found on a billion web pages, whereas a specific nine- or ten-digit number is likely to appear on only one or two pages. But there are only 90 two-digit numbers, compared with 90 billion 10-digit numbers, so the overall number of pages in those two classes is approximately the same.

Here’s another way of saying the same thing: The product of \(N\) and \(H(N)\) is nearly constant, with a geometric mean of roughly \(7 \times 10^{10}\). An equivalent statement is that:

\[\log_{10}{N} + \log_{10}{H(N)} \approx 10.86.\]

You can visualize this fact without doing any arithmetic at all. Just print a series of \(N, H(N)\) tuples in a column and observe that the total number of digits in a tuple is seldom less than 11 or greater than 13.

    N,        H(N)
    96964835, 2120
    2048, 164000000
    476899618, 214
    96416, 374000
    75555964, 3020
    171494, 182000
    154045436, 2160
    1206, 112000000
    761088, 50200
    7500301034, 24
    13211445, 10900
    1289, 77000000
    1507549, 18100
    3488, 3330000
    7507624475, 10
    17592745, 2830
    1430187656, 30
    691, 265000000
    41670244642, 2
    326, 52900000

Although the vast majority of the 2,000 data points lie near the 10.86 “main sequence” line, there are some outliers. One notable example is 25898913. Most numbers of this magnitude garner a few thousand hits on Google, but 25898913 gets 29,500,000. What could possibly make that particular sequence of digits 10,000 times more popular than most of its neighbors? Apparently it’s not just an isolated freak. About half the integers between 25898900 and 25898999 score well below 10,000 hits, and the other half score above 20 million. I can’t discern any trait that distringuishes the two classes of numbers. Sampling from other nearby ranges suggests that such anomalies are rare.


A straight line on a log-log plot often signals a power-law distribution. The classic example is the Zipfian distribution of word frequencies in natural-language text, where the kth most common word can be expected to appear with frequency proportional to \(k^{-\alpha}\), with \(\alpha \approx 1\). Does a similar rule hold for integers on the web? Maybe. I tried fitting a power law to the data with the powerlaw Python package from Jeff Alstott et al. The calculated value of \(\alpha\) was about 1.17, which seems plausible enough, but other diagnostic indicators were not so clear. Identifying power laws in empirical data is notoriously tricky, and I don’t have much confidence in my ability to get it right, even with the help of a slick code library.

I’m actually surprised that the pattern in the graph above looks so Zipfian, because the data being plotted don’t really represent the frequencies of the numbers. Google’s hit count \(H(N)\) is an approximation to the number of web pages on which \(N\) appears, not the number of times that \(N\) appears on the web. Those two figures can be expected to differ because a page that mentions \(N\) once may well mention it more than once. For example, a page about the movie 42 has eight occurrences of 42, and a page about the movie 23 has 13 occurrences of 23. (By the way, what’s up with all these numeric movie titles?)

Another distorting factor is that Google apparently implements some sort of substring matching algorithm for digit strings. If you search for 5551212, the results will include pages that mention 8005551212 and 2125551212, and so on. I’m not sure how far they carry this practice. Does a web page that includes the number 1234 turn up in search results for all nonempty substrings: 1234, 123, 234, 12, 23, 34, 1, 2, 3, 4? That kind of multiple counting would greatly inflate the frequencies of numbers in the Googleverse.

It’s also worth noting that Google does some preprocessing of numeric data both in web pages and in search queries. Commas, hyphens, and parentheses are stripped out (but not periods/decimal points). Thus searches for 5551212, 555-1212, and 5,551,212 all seem to elicit identical results. (Enclosing the search string in quotation marks suppresses this behavior, but I didn’t realize that until late in the writing of this article, so all the results reported here are for unquoted search queries.)


In the graph above, the linear trend seems to extend all the way to the lower righthand corner, but not to the upper lefthand corner. If we take seriously the inferred equation \(N \times H(N) = 7 \times 10^{10}\), then the number of hits for \(N = 1\) should obviously be \(7 \times 10^{10}\). In fact, searches for integers in the range \(1 \le N \le 25\) generally return far fewer hits. Many of the results are clustered around \(10^{7}\), four or five orders of magnitude smaller than would be expected from the trend line.

To investigate this discrepancy, I ran another series of Google searches, recording the number of hits for each integer from 0 through 100. Note that in this graph the y axis is logarithmic but the x axis is linear.

There’s no question that something is depressing the abundance of most numbers less than 25. The abruptness of the dip suggests that this is an artifact of an algorithm or policy imposed by the search engine, rather than a property of the underlying distribution. I have a guess about what’s going on. Small numbers may be so common that they are treated as “stop words,” like “a,” “the,” “and,” etc., and ignored in most searches. Perhaps the highest-frequency numbers are counted only when they appear in an <h1> or <h2> heading, not when they’re in ordinary text.

But much remains unexplained. Why do 2, 3, 4, and 5 escape the too-many-to-count filter? Same question for 23. What’s up with 25 and 43, which stand more than 10 times taller than their nearest neighbors? Finally, in this run of 101 Google searches, the hit counts for small integers are mostly clustered around \(10^6\), whereas the ea