I recommend watching this over at Info Q for a solid overview of the space: http://www.infoq.com/presentations/Introducing-Apache-Hadoop . I like the frame here; this is a data ‘operating system’. There needs to be a macro level data rationalisation of data in the world and Hadoop is the right ‘base layer’ IMO (based on todays technology offerings out there). Even though this technology is somewhat ‘old’ at this point it is much newer than say old style relational classical SQL DB and it is now at a reasonable point of maturity for general adoption.
Google’s Spanner is worth reading about for what is now ‘newer’ tech, but not ready for mainstream consumption unless you want to build your own.
- Setup some Hadoop for yourself – Ubuntu “quick” guide here: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
- Setup Hive here: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallationandConfiguration
I think the whole ecosystem here consisting of Hadoop, Hive here is great. Add to the list of useful related technologies that are just available now to get and use:
- Flume (get data in!)
- sqoop interop/ cope with traditional relational DBs
- Pig more ETL like tool, not sure if it is redundant with Hive/ other techs yet…
- OpenTSDB time series database – useful for capturing data that is … well – a time series (think app metric streams)
This is all makes the case for a large scale data management environment – using open source tools – that can handle massive amounts of data in many different forms.