Month: July 2010

Minding data’s pedigree

Does it seem to you like data analysis is busting out all over the place? It might become another fun game like chess or Chutes and Ladders — so this might be good time to recall an old admonition: Don’t just consume data, mind its pedigree.

Repeating the warning, though, makes you look like a party-pooper. In 2007 at the TDWI conference in Las Vegas, a keynote speaker raised it one morning. Jonathan Koomey — author of Turning Numbers into Knowledge and one of those voices the BI world needs more of — did his best. But I could see the unfolding disaster from my banquet table, as attendees glanced at each other in scorn. When the lights went up, not one person raised a hand with any question or comment.

Now Sex, Drugs, and Body Counts: The Politics of Numbers in Global Crime and Conflict, edited by Peter Andreas and Kelly M. Greenhill, tries it again.

You may wonder what sex, body counts, and politics have to do with data analysis, but try to keep an open mind here. The book promises to let us spit out the usual cud of business intelligence, data quality, and get to the real spice: the politics of data. I can’t wait to read it. For now, see Jack Shafer’s review on Slate.

I won’t be surprised if the book points out how each organization’s core group subtly chooses the stories its data tells. I’ve just finished Art Kleiner’s Who Really Matters, which goes into detail on these groups’ formation and influence, including how they define who’s in, who’s out, and why. It’s the essence of politics.

Though core-group members may not ever lay their smooth palms on any data, data is nonetheless coiffed to suit these people. Through layers of managerial interpretation and re-interpretation, their influence cascades all the way down to tiny decisions about how data’s summarized, what’s measured, how it’s measured, and who measures it.

Like other forms of expression within an organization — speech, email, jargon, attire, hair style, suit or T-shirt — data is part of the politics. Though this has a big effect on decision making, it seems rare that I find it on a BI-event agenda. BI’s scope needs to widen.

Look, Ma. No ETL

One of the first things you learn about in business intelligence is ETL. Raw data gets harvested, washed and served. But Sandy Steier hadn’t heard.

Sandy had been busy analyzing data. For years on Wall Street, he pored over mortgage-backed securities with a tool he and peers developed for themselves.

He only learned of ETL recently. He’d become acquainted with a data architect with whom he shared a bus ride every day to and from their offices in downtown Manhattan. “I had never really spoken to him before,” Sandy recalls. “He was in a different world even though we both dealt with data.”

Sandy described to him his rapidly maturing tool. As I imagine the scene, the calm data architect suddenly twisted himself on the cramped bus seat to face Sandy. “You don’t do ETL? You work with raw data??”

No, he didn’t do any ETL, Sandy explained. “We didn’t realize how important that was,” he recalled. “We had always just stuck the raw data into the database and then realized, ‘Hey, this data’s a mess.'” He instructed users to clean it themselves. “You get the data from the horse’s mouth. You’re the expert. We didn’t realize how powerful this was.”

In Sandy’s system, you don’t worry about database design. He and his partners not only didn’t worry about ETL, they wondered how data analysis could not be done their way — import first, clean later. “It makes good sense if you can get away with it.”

A crucial factor that lets the tool work as it does is speed. It allows the 1010Data engine to calculate and recalculate repeatedly. The summaries that cubes harbor for anticipated queries are no longer necessary. Parallel processing with a columnar database runs fast enough. In place of ETL, he uses what he now calls “ELTAR,” for extract, load, and transform as required.

A hurdle, he says, is conventional beliefs held by his sales prospects. In one phone call recently, he explained to a prospect that ETL was unnecessary. The man replied, “That’s not credible.” In fine sales form, Sandy said, “Then you’ll be impressed when I prove it to you.” The prospect replied more firmly, “You don’t understand. That’s not credible.”

Actually, the technology’s credibility doesn’t matter much. The company, 1010Data, offers reporting and analytics on the cloud — invisible to customers except for the results. Sandy says, “We could have monkeys writing on scratchpads.” To those willing to try, he offers to prove it with the prospect’s own data.

Their technology’s speed allows them to do the work of dozens with a team of a few people, he says, and to finish large data warehouse projects in weeks that would otherwise take months or years. If multiple customers use the same data, such as stock market data, the time required is even less.

All without ETL.