Big Data Bloggin: July 2012

I've been thinking about a nice way to find out how much data was in my Hadoop cluster for an embarrassing amount of time. You can easily go table by table in the hive shell, but I have a dozen or so tables making it pretty taxing to ask an easy question. So it is pretty lame.

hive> SHOW TABLE EXTENDED in default LIKE word_frequency;

Another option is to use a reporting tool, like Ganglia (my tool of choice). While it shows my daily max/min/average I don't have a report of what I'm using "at rest". I eventually found a drop down that gives actual memory usage (second image), but that was evenly split across my nodes. (7.6GBx8 = ~30GB). Way too regular.

So finally it dawns on me to just ask HDFS, and after screwing around a bit it seems the "/" is recursive as well as cumulative. Nice! The ~22GB was exactly the number I had roughed by looking at the individual tables. Now why didn't I think of that from the start?

$ hadoop fs -du /
Found 1 items
23670012101 hdfs://10.80.225.26:9000/mnt

As always, the command line rules the roost.

Big Data Bloggin

Wednesday, July 11, 2012

How big is your Hadoop cluster?