So I just spent a few days at Data Week in San Francisco, and was it ever eye opening.

I've always claimed that I have more complexity issues, tying together large data sets, than actual Big Data problems. I run a smallish 8 node ~50GB EMR cluster at Amazon, processing a few hours daily. My issues are generally around throughput, getting data in and out of my cluster in a reasonable time (say < 2hrs).

Then I watch the BigQuery guys from Google doing a full table scan on a 500TB table in a minute. Ouch.

In session after session numbers in the 100s of Terabytes or even Petabytes were being geek dropped, dwarfing anything I've seen here in our local Chicago Hadoop/BigData/Machine Learning groups. There were interesting discussions just about moving that volume of information around.

What I liked most what the interest in extracting value (information) from data, not necessarily in creating the biggest pile. I think the R folks (from Revolution Analytics) were spot on here. I need to spend some time there.

It was also nice to see the lack of BI tool vendors pitching how their wares could "integrate" in some lame ass way to a Hadoop cluster (JDBC anyone???). This sometimes infects Big Data gatherings and is insufferable. So happy to avoid it.

A friend in sales recently asked ...

What Do BI Vendors Mean When They Say They Integrate With Hadoop?

My reply was "Bullshit. About 1/3 of what they tell you."

Big Data Bloggin

Thursday, September 27, 2012

Data Week - an exercise in humility

What Do BI Vendors Mean When They Say They Integrate With Hadoop?