The Strata Conference started yesterday (February 28) with a day of Tutorials, Jumpstarts, and Deep Data sessions. I attended two half-day tutorials, Hadoop Data Warehousing with Hive by Dean Wampler (Think Big Analytics), Jason Rutherglen (Think Big Analytics) and The Two Most Important Algorithms in Predictive Modeling Today by Jeremy Howard (Kaggle) and Mike Bowles.
The morning session focused on the use of Hive, the SQL-like language that can be used to perform analysis and create reports for data stored in a Hadoop data warehouse. The session was very informative, although perhaps a little too ambitious. Since it was a tutorial, the speakers prepared a virtual machine that the audience could install on their own machines and execute the same statements as the speakers. This ended up being a difficult process, and I found it easier and more valuable, as much of the audience did, to simply follow along and watch the presentation. Also, the session had a few demo glitches and the quantity of material was too much for the time allotted. Even so, I got a lot out of the session. It made clear to me the business cases where the use of Hive is appropriate, and it also highlighted for me the strengths and limitations of Hive.
I really liked the afternoon session. Jeremy Howard and Mike Bowles each presented their favorite predictive modeling algorithm. Jeremy talked about how he’s used Random Forests to solve numerous types of problems and did a demo showing how to predict who was likely to live or die on the Titanic. Mike discussed how to use the glmnet algorithm and showed how it to apply it to sonar data to find oil under the ocean.
The next two days of the conference are filled with shorter 40 minute sessions. I’m looking forward to getting a lot more info about analytics and big data.