My first Strata Conference came to an end yesterday evening but I’ve already decided it won’t be my last. I found the conference a good place to learn about a large variety of technologies and methodologies, and it was also an excellent networking opportunity. As a whole it was a great collection of people on the leading edge of data technology.
The conference kept me busy from early morning until the evening and I’m happy to say that I didn’t skip one session. I enjoyed hearing Jeremy Howard from Kaggle talk about predictive modeling two times as well as chatting with him afterwards in the bar. I also heard details of the next release of Hadoop from Arun Murthy of Hortonworks, saw a good session on Automated Understanding by Tim Estes of Digital Reasoning, and attended an interesting session called Exploring Social Data by Chris Moody of Gnip, a local company in Boulder, CO.
The best day for me was the first day since this was the tutorial/deep dive day. This is the day we saw more code and demos. I wanted to see more live demos the next two day but most of the speakers didn’t show any, and that’s one thing I’d like to see Strata have more of. The regular sessions were 40 minutes long which make doing a demo difficult, but it’s often helpful to see things in action to go along with the speaker’s words and slide deck.
I also had the opportunity to have good conversations over lunch or coffee with people from Microsoft, Red Gate, the Census Bureau, Kaggle, Shell, and many more. What I found out was that there are so many organizations out there setting the bar higher and doing great things in the realm of data, and it’s going to change our world.
The Strata Conference started yesterday (February 28) with a day of Tutorials, Jumpstarts, and Deep Data sessions. I attended two half-day tutorials, Hadoop Data Warehousing with Hive by Dean Wampler (Think Big Analytics), Jason Rutherglen (Think Big Analytics) and The Two Most Important Algorithms in Predictive Modeling Today by Jeremy Howard (Kaggle) and Mike Bowles.
The morning session focused on the use of Hive, the SQL-like language that can be used to perform analysis and create reports for data stored in a Hadoop data warehouse. The session was very informative, although perhaps a little too ambitious. Since it was a tutorial, the speakers prepared a virtual machine that the audience could install on their own machines and execute the same statements as the speakers. This ended up being a difficult process, and I found it easier and more valuable, as much of the audience did, to simply follow along and watch the presentation. Also, the session had a few demo glitches and the quantity of material was too much for the time allotted. Even so, I got a lot out of the session. It made clear to me the business cases where the use of Hive is appropriate, and it also highlighted for me the strengths and limitations of Hive.
I really liked the afternoon session. Jeremy Howard and Mike Bowles each presented their favorite predictive modeling algorithm. Jeremy talked about how he’s used Random Forests to solve numerous types of problems and did a demo showing how to predict who was likely to live or die on the Titanic. Mike discussed how to use the glmnet algorithm and showed how it to apply it to sonar data to find oil under the ocean.
The next two days of the conference are filled with shorter 40 minute sessions. I’m looking forward to getting a lot more info about analytics and big data.