Skip to content

Category Archives: stats

“So​ ​You​ ​Think​ ​You​ ​Have​ ​a​ ​Power​ ​Law”

300px-long_tail1This article about the power law isn’t new, but it’s been making the rounds at Worio. I really wish I could write like this — it makes an interesting and important technical point and it does so with wit and enthusiasm.

Whenever I try to start writing something interesting about my research, I invariably decide it seems too much like work and I’d rather write about my favourite movies. Then I usually decide I’d rather just watch a movie instead of writing about one I’ve already seen. Except sometimes watching a movie seems too challenging, so I watch sitcoms.

I guess that’s why I’m not a science writer. Or a film critic. I’m just a man rewatching last week’s 30 Rock on his laptop.

(Though speaking of Worio, we got Lifehacker-ed on the weekend. Guess all that time we spent optimizing the tropical fish recommender wasn’t wasted.)

unreasonable, effective

I don’t want to bore you with the technical, academic stuff I’ve been wresting with lately, but there is one paper that is probably worth checking out even if you’re not a Machine Learning person. On the Unreasonable Effectiveness of Data is noteworthy because (i) it’s geared to a (science-literate but) general audience; (ii) it’s provocative; and (iii) one of the authors is Peter Norvig, Google Director of Research and one of the most prominent people in AI today.

The most interesting insight to me is that the authors come down against the kind of elegant, engineer-driven (parametric) models that are widely associated with AI, and embrace simple, data-driven (nonparametric) models. The difference is, in machine translation, say, the difference between designing a system that “understands” the grammar and semantics of the two languages and translates one to another trying to preserve it, and one that looks up words and phrases in an enormous table (which kind of reminds me of the Chinese Room thought experiment, though the point is somewhat different). It’s not exactly a new argument, but it’s great to see it so strongly and clearly expressed, and to hear how it arose from Google’s experience.

scatterplot of paranormal activity


The points are individual states. Unfortunately, I don’t know what are which, except that the two rightmost points above the trendline (which have lots of UFO and Bigfoot sightings) are Oregon and Washington.

on anonymous Russian hacker blogspam


Why sure, anonymous Russian commentor! Here’s my password right here!

Assuming the idea is to harvest valuable passwords from the hopelessly naive, and not something more subtle, it’s an interesting economics problem. Spamming (or in this case, phishing so crude that is looks like spam) works on the assumption that even if only one person in ten thousand ever buys your product, you can make a thousand sales if you send to ten million people. So why not ask a million bloggers for their passwords? Maybe a few will slip up and give you good information.

Whoever wrote the script to send this spam might also consider writing letters to every billionaire on Earth asking for a thousand dollars on the assumption one of them is sure to say yes. After all, there are plenty of billionaires, none of them will miss a thousand dollars, and maybe one will be in an indulgent mood, or senile.

The appeal to such an approach comes from an intuitive or actual appreciation of the fact that [tex]P(\mathtt{totalrejection}) = P(\mathtt{individualrejection})^N[/tex]. That is, if there’s a 90% probability of being rejected on one request, if you ask twice, there’s only an 81% probability that both requests will be rejected. Ask ten times and there’s only a 35% chance all ten will reject. By the time you ask 50 times, and there’s only a 0.5% chance you will get 50 rejections. If the cost of making a request is very low and the benefit of even a single acceptance is high, [tex]P(\mathtt{individualrejection})[/tex] can be very high indeed. This is the probabilistic mechanism which makes spam profitable.

So can that work for spamming a million bloggers to ask for passwords? Or a thousand wealthy people asking for money? Maybe, but it’s different from simple spam advertising. If you’re selling, say, Viagra through an online pharmacy, there’s no cost to you until you also receive the benefit: the customer goes to your automated web site, pays via credit card, and only then do you step in to package and deliver the goods. Here, Anonymous Russian Hacker has to undergo cost without any guarantee of benefit. If I send him a password, he has to visit the site, log in, and look for something worthwhile. And since I know this, it makes trying to lure the hacker into a honeypot much more attractive, making it more likely that that cost will have to be undertaken without increasing the (marginal) probability of benefit.