Benford’s Law

Benford's LawI was going through Statistics Hacks and came across Benford’s Law, which states that in naturally occurring numerical data, the distribution of the first, non-zero significant digit follows a logarithmic probability distribution described as:

P(D1 = d) = log10 (1 + 1/d)

In other words, first number is much more likely going to be a 1 than it is a 9. The pretty graph to the right shows the likely occurrence of the first digit. It’s counter-intuitive, as one would assume the digits would be uniformly distributed. However, it’s been observed in a variety of areas like multiples of numbers[2], blackbody radiation, physical constants, area of rivers, population and New York Times front pages[9].

Bassam Hasan[3] notes data must meet the following criteria:

  • The data must be numeric.
  • There must be an underlying cause for the numbers to occur. For example, business invoice numbers would not work because the numbers are merely labels. (See the lottery question below.)
  • The numbers are not restricted by maximum or minimum values. For example, human heights fall within a limited range that would skew the leading digits.
  • The numbers must occur naturally, and they are not invented or assigned[7], such as telephone numbers (based on a phone switch), postal codes (a postal facility) or social security numbers (state your card was issued in).
  • There must be a large sample size.

And no, lottery numbers would not be a good candidate. As Dr. Nigrini[5] explains:

“[Lottery] balls are not really numbers; they are labeled with numbers, but they could just as easily be labeled with the names of animals. The numbers they represent are uniformly distributed, every number has an equal chance”[4]

As an experiment, I did an analysis of the size of my multimedia files (mp3, wmv, mov) on another hard drive. I expected this to fail one or two tests above. Here’s the distribution:

File size distribution

There are three major categories of multimedia files– songs, podcasts and lectures — whose file sizes are clustered. Songs tend to be 2 – 6 minutes (Freebird, notwithstanding) since I delete stuff that’s shorter. At an average encoding of 128kb/s, this results in a lot of 4Mb – 6Mb sized fies. An informal sample of my favorite podcasts(“Wait Wait, Don’t Tell Me!“, “This American Life“, and “Talking Robots“), suggests it’s common these span 10 minutes up to an hour. At the top end of the size graph are lectures. (I admire Kiri’s stamina in going the full four(!) hours.)

Next, I looked at the set of files on my computer’s primary (OS) hard disk. With a sample size of 119,708 files, this is how the first digit was distributed:
First digit of file size
The curve is “pretty close!”

Because I had already collected the data, here’s the distribution of file sizes:

File size distribution
At the far right, there are two 2Gb files used by the system (pagefile.sys, hiberfile.sys). Zero-length files were ignored, obviously.

Since Benford’s law applies to the first few digits in each place, we can use it to look further at the file sizes. Here’s a table of the percentages of occurrences of the digit in each place[10]:

[I will send ten dollars to the first person who can tell me why the hell WordPress arbitrarily deletes rows and columns in tables — and what I can do to prevent this, short of “don’t use wordpress”.]

Let’s apply this to the first two digits of the file sizes on my computer. The red lines are the distribution of files. The blue line is are the factors multiplied, and the green line is a logarithmic trend line of the Benford numbers, because I may have done something wrong just multiplying out.

Still, we see an interesting result:

There are a lot of files whose size is 8192 bytes!  [Edit: over 1,800 of these are regxxxxx files related to Internet Explorer 7 or Patch installs.]

Benford’s law has had a lot of traction in accounting[1,3,4,7,8,10], where it’s used as a technique to detect fraud. For example, the Department of Justice has been using this in counter-terrorism operations[7], identifying shell corporations used to funnel money. From their “Fiscal Forensics I” article, they note:

A classic example is the organization that has a disproportionate number of transactions in the eight and nine thousand dollar range since they may be structuring transactions (designed to fall below SAR and CTR levels). This fact would likely be revealed during a Benford’s Law analysis as the amount of numbers beginning with the digits 8 or 9 would exceed their expected probability of occurrence.

Georgia Tech Professor Ted Hill proved Benford’s law applies to numbers in other bases[11]. An amusing trick he applies to his classes is asking his students to either (1) flip a coin 200 times, recording the pattern of heads or tails, or (2) make the data up. The next day, he points out most of the made-up data. [4] He notes that a sequence of 200 flips will have a high likelihood of six of the same side in a row. People faking their data rarely apply this.[11]

Benford’s law does have some limitations. For example, on tax forms rounding occurs. Also, it’s a very common, annoying, and effective marketing strategy to price things ending in “95” or “99.” Finally, sales people have tendency to shave expenses. As Dr. Mark J. Nigrini[5] explains:

”People who travel on business often have to submit receipts for any meal costing $25 or more, so they put in lots of claims for $24.90, just under the limit. That’s why we see so many 24’s.”[4]


Sources

5 thoughts on “Benford’s Law”

  1. Hmm… were you looking at the “File Size” or the “File Size on Disk”? If the later, that might explain the prevalence of 8192-byte files – they’re all those small files that occupy less than one file allocation block (assuming your FABs are 8192 bytes).

  2. I came across this law in a class in grad school and immediately went home and tallied the files in my Unix account. Eerily, the distribution matched Benford’s law quite well. I’m still not sure I have an intuitive “yeah, that makes sense” grasp of it, but Laws are descriptive rather than explanatory. 🙂 My best grasp of it is that if you have numbers that really are evenly distributed, then there would be an order of magnitude more 1’s than everything else (e.g., since 1,2,3,4,5,6,7,8,9,10,11,12,13… are all equally likely, more than half of the numbers from 1-20 would start with 1), and so on. If you used leading 0’s to make all numbers the same “length”, then this effect should disappear (all prefixes would be equally likely).

  3. Steve – I was looking at the file size (versus the size on disk, which would allocate whole sectors). Examining the file list further, I discovered a huge cluster of 8,192-byte files were associated with Windows patches and updates and had names of the pattern reg[0-9]{5}. (Over 850 of these were in the “ie7” directory.) As far as I can tell, they’re cruft.

    Kiri – I’m still trying to wrap my head around it. (It was probably the most exhausting post because I kept getting sucked deeper into trying to understand it!) But yeah, I could absolutely picture you running this on Astra 🙂

  4. It looks that way. (The article in question mentions Eliot Spitzer being investigated for financial transactions. The initial concern was he was accepting bribes.

    Quote:

    “We had no interest at all in the prostitution ring until the thing with Spitzer led us to learn about it,” said one Justice Department official.

Comments are closed.