Link Popularity Made Simple—Sort of

Introduction

This article explains link popularity (specifically Google's use of link popularity) as it pertains to search engine rankings. The explanation requires some mathematics, but you don't need a strong mathematical background to understand it—high school algebra will get you through. You won't find specific methods for improving your search engine rankings, but reading this may help you select the more productive options offered to you for improving rankings.

Ever since Yahoo! replaced Inktomi as their search engine with Google, with their link-oriented ranking algorithm, the Web has been abuzz with the importance of "link popularity". It is now "crucial" that we attend to our links. We read, over and over, sentences like, "To push your site towards the top of search results, it's important to have many web sites point or link to your site." Further, we're told that some links are better than others, that (for instance) links from .gov or .edu sites matter more than those from cousin Sally, that free-for-all link sites don't matter at all (or that they help immensely)...and on and on.

This buzz, and the information attending it comes from sources of varying credibility, but almost all offer a specific product or service taking advantage of the information. Oddly missing has been a popular explanation of how ranking algorithms employ link analysis in their calculations—explanations that would provide the background information allowing us to judge the credibility of the offers .

That lack is odd because of all the features of search engine ranking algorithms, link popularity may be the easiest to understand. Not because it is the least complicated, but because it has been so thoroughly described by its developers. Descriptions of specific implementations of the technique are widely published in academic circles, and are available on the Web. This article stems from one of those publications.

Links Measure Importance

Before diving into the details, let's gloss the basics. Link "topology" is really a better term than "popularity" because the technique treats interconnectedness as well as count. Whatever you call it, the analysis measures a page's importance. That's different from relevance. While relevance means how well the content of a Web page pertains to a specific query string, "importance" refers to a page's worth, without regard to the content. A link to a page makes a statement about the page's perceived worth and so confers some value on it. The more value conferred through links, the more important the page is.

But we don't all have the same value to confer. Some of our pages are more important than others, and a link from an important page confers more value than a link from an unimportant one.

So: An important page is one that has important pages pointing to it. Circular reasoning? Absolutely—but easy to grasp intuitively. For instance, a link to your site from the National Institute of Standards and Technology ought to confer more value than a link from your cousin Sally. Not because Sally is partial, but because NIST is more important, in some vague way that we nevertheless seem to understand.

How Importance Is Measured

Although easy enough to grasp with two or three pages, measuring the relative importance of billions of inter-connected pages seems hopelessly complicated. And complicated it is, but not hopelessly—it's really pretty straightforward. It does require a lot of calculations, but luckily we don't have to invent them. We can just get them from the academic literature.

While graduate students at Stanford, Larry Page and Sergey Brin (the inventors of Google the search engine and the founders of Google the company ) published "The Anatomy of a Large-Scale Hypertextual Search Engine", which you can download in PDF format from http://www-db.stanford.edu/pub/papers/google.pdf

Their paper describes PageRank, Google's technique for defining a page's importance on the basis of the pages that link to it. It is this method that I elaborate in this article.

Here's the PageRank formula. It looks difficult, but don't let it intimidate you. We'll take it a little at a time; patience and high school algebra will get you through it.

Assume a web page, called A, that has a number of pages that link to it. Call those pages that link to it T1, T2, T3, and so on to the last one, called Tn

No math so far—this just names things so we can talk about them later. Think of A as your home page, and T1 to Tn as other pages on the Web that contain hypertext links to yours. T2 can even be your cousin Sally's, if that helps.

The PageRank of A is calculated by this equation:

PR(A) = (1-d)+d [PR(T1)/C(T1)+PR(T2)/C(T2)+PR(T3)/C(T3)+...+PR(Tn)/C(Tn)]

This looks really nasty. But ripped into its three components, it's much easier.

PR(A) means the PageRank of A; that's what we are trying to find out. This just states the problem—the real work happens on the other side of the equals sign.

(1-d)+d is a damping factor. Never mind what it does. Page and Brin recommend a setting of 0.85, so we'll set it to that and then forget about it. Although it probably matters a lot if you run a search engine, it doesn't matter much for our purposes here. We're just going to calculate that big mess between the brackets (these things, [ and ]), multiply it by 0.85 and add that result to 0.15 to stay true to the formula.

Now, to the big mess between the brackets. If we reformat this section, like this:

PR(T1)/C(T1)+
PR(T2)/C(T2)+
PR(T3)/C(T3)+...+
PR(Tn)/C(Tn)]

it's easier to recognize T1, T2, and T3 as those pages that link to A, and (I hope) easy to see that an exceedingly simple calculation is being done to them. The apparent complication comes from the quantity of the calculations, not the difficulty.

PR still means PageRank, just like on the left of the equals sign, T1, T2, etc., just name the pages. The only new thing is the C—as in C(T2). This is defined as the number of hypertext links on page T2. The number of outgoing links. This is an outgoing link:

http://www.audettemedia.com.

On the receiving end (at Audette Media's home page), this is an incoming link (or an in-link), but on this page, it is an out-link. Both matter!

Putting back together the three components we earlier ripped, we can now list a set of steps for applying the formula to any page:

  1. make a list of all the pages which point to the target page (we'll call it a link list);

  2. for each page on the link list:
    • determine the PageRank;
    • count the number of outgoing links;
    • divide each page's PageRank by the number of its outgoing links;
  3. sum the results of step 2 for the entire link list;
  4. apply the damping factor to the resulting sum.

Making The Calculations

Four steps. Simple enough, but where do we start?. In order to determine PageRank, this says we have to already know the PageRank of the link list—have to know, that is, exactly what we're trying to find out.

But the PageRank formula achieves its results by repeating the calculation until the results converge on a stable result. Meaning that we can just start someplace arbitrary, and everything will work itself out.

This is a remarkably clever idea, which I'm going to belabor. To demonstrate how it works, I've created a little Web made up of 10 pages, and we'll employ PageRank calculations to rank these pages according to their importance.

The universe looks like this—the circles are Web pages, the lines between them are hyperlinks. The arrows show the direction of the links:

Diagram showing 10 Web sites with interconnections

I suggest you open a new window to keep a copy of this diagram handy.)

What a mess, eh? But things clear up fast once we start the calculations. Stepping through the sequence, first we make a link list. Here's A's:

A has 6 in-links, from B, E, G, H, I, and J

You can do the rest if you want to follow along closely.

Next we find the PageRank for each of the pages on this link list. Of course, at this point we don't know any PageRanks, so, we'll arbitrarily assign each page a PageRank of 1 for the first iteration of the algorithm.

Next we count the number of outgoing links for each page on the link list and divide PageRank by the result of the count. Using A's link list, we generate this table:

Page "A" Link List

Page

PageRank

# out-links

PR/out-links

B

1

6

0.1667

E

1

4

0.2500

G

1

3

0.3333

H

1

2

0.5000

I

1

4

0.2500

J

1

3

0.3333

 

 

Total

1.8333

As the last step, we apply the damping factor by multiplying the sum (1.8333) by 0.85, getting 1.5583, then adding (1-0.85), or 0.15 to get 1.7083. After the first iteration, PR(A)=1.7083

Repeating the steps for each of the ten pages in our Web delivers the following results, listed in order of rank (you can check my work by constructing a table for each of the other 9 pages just like the table I did for A, if you've a mind to):

PR(A)= 1.7083
PR(J)= 1.4250
PR(G)= 1.2833
PR(H)= 1.0708
PR(C)= 0.8583
PR(D)= 0.8583
PR(F)= 0.7875
PR(I)= 0.7167
PR(E)= 0.5042
PR(B)= 0.3625

OK, looking it over, the list makes a little sense. A has the most in-links and is listed as the most important, and B the fewest and is the least important. But J in second place seems a little odd, since G has more incoming links (4 versus 3). So, let's run through the calculations again.

Second Iteration

We use the same sequence of steps, but this time, instead of using an arbitrary 1 for the PageRank value in calculating each page's links-list, we'll use the values in the above table; that is, the results of the first iteration. So, in calculating A's link list for the second iteration, we generate this table:

Link List for Page "A"

Page

PageRank

# out-links

PR/out-links

B

0.3625

6

0.0604

E

0.5042

4

0.1261

G

1.2833

3

0.4278

H

1.0708

2

0.5354

I

0.7167

4

0.1792

J

1.4250

3

0.4750

 

 

Total

1.8039

Adjust that total by the damping factor, and PR(A)=1.6833 after the second iteration.

Look what happened this second time through. Taking B as a example, notice that instead of contributing a value of 0.1667 to A's accumulating PageRank, this time around B added only 0.0604. In other words, after the first go-round, B's importance diminished from the (arbitrary) starting value of 1, and so its referrals diminished in value. Once we get past the initial, arbitrary PageRank of 1, every page starts to contribute according to its own "importance".

I won't go through any calculations in detail (you'll have to trust that I did them right); here are the PageRank values (again in order) after the second iteration of the algorithm:

PR(A)= 1.6833
PR(G)= 1.5442
PR(J)= 1.4870
PR(H)= 1.3335
PR(F)= 1.0502
PR(C)= 0.7731
PR(D)= 0.7173
PR(I)= 0.5361
PR(E)= 0.3537
PR(B)= 0.2572

There's a little more action this time around. G and J swapped places and that seems intuitively better, as we mentioned earlier. F went from 7th to 5th displacing C and D. Let's see if we can figure that one out.

Look again at the link structure diagram. C, D, and F all have 3 in-links, but notice that one of F's comes from A, while A does not link to either C or D. A being the most important (highest ranked) page in this universe, a referral from it carries more weight than referral from any other page, so F gets a big boost that C and D lack. Let's see if anything changes the third time through.

Third Iteration

I'll neglect all the detail, and just go to the results. PageRank results after 3 iterations are:

PR(A)= 1.8020
PR(G)= 1.6515
PR(H)= 1.4019
PR(F)= 0.9920
PR(J)= 0.9496
PR(C)= 0.7774
PR(D)= 0.7389
PR(I)= 0.6328
PR(E)= 0.3004
PR(B)= 0.2260

Only one change: J drops from 3rd to 5th boosting G and H. Why? Notice that G and H have links from A, while J does not. Again, having a link from an important page boosts importance. Compare F and H, both with 3 in-links. F's links come from A, B, and C, while H's are from A, C, and F. That single difference—the difference between the weight of the contribution of B (at the bottom of the list) and F (near the top)—ranks H higher than F.

Although this universe offers no example of this, you can see how a single in-link from A would count more than links from all three of the bottom of the list. With a larger universe, that spread would be even wider, and link quality would play an even larger part than here, where it serves mainly to decide the relative rankings of pages with the same number of in-links.

Fourth Iteration: We're Done!

After 4 iterations, the values differ, but the rankings remain the same:

PR(A)= 1.7132
PR(G)= 1.5575
PR(H)= 1.4126
PR(F)= 1.0230
PR(J)= 0.9764
PR(C)= 0.8162
PR(D)= 0.7844
PR(I)= 0.6036
PR(E)= 0.3165
PR(B)= 0.2138

The rankings stay this way through several more iterations, and seem to be stable, so we'll stop here. This last list contains what we'll call the "official" PageRank values of our little 10 page universe.

Lessons Learned

Simple, right? Not particularly easy, but once you understand a small example you can expand that understanding to encompass the entire Web. You'll never really get your mind around it because the complexity of the interrelationships between a billion or so pages in ungraspable—we can, though, understand the calculations, and appreciate how they determine what pages are "important".

More importantly, for most of us anyway, we can use our new grasp of the algorithm to make sense of some of the buzz about link popularity.

Lets start with the quote that opened this article: "To push your site towards the top of search results, it's important to have many web sites point or link to your site." A true statement, as far as it goes. In general, the more in-links you have the higher you'll be ranked in the search results. But as we've seen, links confer different amounts of importance, and a high quality link can easily outweigh several low quality links.

Now this one: "The quality of the link holds more "weight" than the quantity of links. You will get better results in the search engines if you have link popularity from sites that have considerable traffic." The statement starts off dead on, then misses the point completely. Traffic has nothing to do with link popularity. The "quality" of a site is nothing more than its PageRank. All links (of whatever "quality") contribute to the calculation; "quality" links just contribute more. Traffic contributes nothing.

Or, consider this: "Free for all sites don't boost your rankings." Mostly true. Although all incoming links improve your importance, the FFAs, whose lots of out-links dilute whatever importance they might have, will contribute very little. Also, to the extent they are considered spam, Search Engines may keep them out of their index—a link from a non-indexed site contributes nothing.

And lastly, how about this one: "Links from .gov and .edu sites are better than links from your cousin." Maybe. The .gov or .edu by themselves confer no additional authority, but these sites might be more likely than your cousin's to have lots of in-links themselves, and so have a high PageRank. That alone makes a link "better".

The main lesson here seems to be the same old one. If you want the high search rankings that result from a high PageRank (or whatever other link popularity algorithm a SE uses) pay attention to your content. Easy to navigate pages with high quality content will attract links from Webmasters augmenting their sites by linking to yours, and it is those incoming links, hopefully from sites that are themselves important, that will boost your page's importance.

 

For a quote, a question or just to shoot the breeze:
stevec@topdogstrategy.com
303.818.8590