Thursday, February 7, 2013

amassing data - beg, borrow, steal

In an ideal world, all the data you need would be readily accessible.

Unfortunately, we don't live in that world. Data is often hidden from view, buried, and hoarded.  Sometimes on purpose, and sometimes for perfectly valid reasons.

The role of a Big Data technologist more American Picker than archeologist.  Be honest with yourself, you probably don't know exactly what looking for nor do you have time to figure it out.  You'll know something cool when you see it.  Make some reasonable assumptions of what you'd like to have and get after it.

You can store 1TB per year for around $600 at Amazon's S3, 10% of that if you put it in their long term storage (Glacier).  Consider storage costs as damn near free, don't be afraid to start piling stuff up.

The following is my guide to getting data en masse.


Beg (good)
  • Ask for it.  
  • Make this as painless as possible on all parties (go direct). 
  • Be thankful and courteous, give props where they are do.  

Borrow (better)
  • Repurpose others' data.
  • Don't wait for any changes, additions, formatting, etc.  DIY.
  • Pulling data now is much better than waiting for someone to deliver it.  Get going.

Steal (best)
  • Find out where data lives and go fishing.  Poke around.  
  • Better to plead forgiveness than ask permission.
  • If you have access, you probably aren't breaking any rules.

Wash, rinse, repeat.

At any given time you'll probably be doing all of these simultaneously.  My only caveat is "don't store junk".  I remove non-Ascii characters and validate that files are complete.  But don't do much more than that.

Reserve the right to rethink any decisions on what to keep by not deleting anything.

Once you get used to working like this, you will be amazed at how much you can acquire relatively painlessly.  Some things will take hours, and some unfortunately years.  Some things you'll need ASAP, and some you'll have no idea what to do with (I started collecting some activity data 7 months ago that I never used until yesterday).  Who knows?

Take what you can get now, and iterate.  Good things will follow.