For the past several weeks, I’ve been setting up Machine Learning algorithms that allow machines to learn things about web pages. This is all very well and good — the algorithms work. Now it’s training time. Training means that you have to not only give the computers the data (the web pages), but for each datum, tell it what you want it to learn — “this is a good page”, “this is a bad page”, “this is a page about buy Latuda online overnight Buffy the Vampire Slayer“, whatever. And machines are so damned literal. Give the machine a set of Buffy wikipedia articles and science blogs, and it will most likely learn how to distinguish blogs from wikipedia articles (because it’s simpler), and not physics from Buffy.
My work with Active Learning has been helpful, since it allows a computer to find pages which would be particularly useful to get training information on, but ultimately, it comes down to looking at hundreds and hundreds of random web pages and telling a computer what to learn about each one, and not making too many mistakes.
This is my job right now, 8 or 9 or 10 hours a day staring at random web pages, and it’s making me a little batty. After the first few hours of boredom, patterns emerge, the programmer hyper-focus kicks in and it starts to become kind of fascinating. Predictably, there are tons of blogs bashing Bush and/or Microsoft (and virtually none supporting either, though Hillary and Apple both get to feel the hate every once in a while). The expected number of teenagers who think they are vampires. Unexpectedly many message boards about Marxism and people’s problems with their cars and/or significant others. Surprisingly little porn — it’s out there, but it all seems to be confined to its own little ghetto.
Yes, my job involves looking for porn on the web, and no, it’s not nearly as awesome as you think it is.