My Big Data Failure

As I was thinking about posts to write, I became disappointed about a lost opportunity for a topic due to a personal failure. But then I realized that this failure is as much apart of my journey as my successes.

A few months ago, I received a notification that there would be an analytics case competition with Humana and Texas A&M University Mays Business School. These types of competitions are very common in the data analytics field. I had received notifications and invites to previous competitions but none of them really interested me or I just didn’t have enough time. However, this competition really caught my attention. It was healthcare related and I had some time to dedicate to it (or so I thought). The requirements included a team of at least two individuals and all team members had to be enrolled in a master’s program. I elicited the help of my fellow clinical informatics fellow to form a team.

There were a few days between registering our team and receiving the data. I knew the file would be large, something like hundreds of columns and a million records. Although, it was the first time I was dealing with so much data in one place, I was confident I would be able to handle it. As I patiently awaited for the data to be released, I tried to formulate a plan of how I would start the project. I was trying to go through all the steps that I learned in my masters classes. The were so many decisions to be made, even as simple as which program will I use for analysis and to build the predictive model.

Once we received the very large file, I ran into my first problem – loading it on my computer. Albeit this was not a surprise. I was already running into issues with smaller projects for my classes. But this was a major limitation because I could only work on the project on certain computers that were not always accessible.

Next was cleaning the data and exploring what we had, because trash in = trash out. My partner and I painstakingly combed through all the variables. Throwing out those with too many missing values or those that had too few unique values. My partner also went through and rated all the variables as to what he thought would be the most important for the model. Eliminating variables that won’t contribute significantly to your predictive model will help with generalizing it to future data while increasing efficiency.

After several hours/days/weeks of going through the data, we reached a point that we could begin to build our models. But even with cutting down variables, it was still too much for my computer and the computation time was lengthy. The time it took to build and test the models was much more than I anticipated.

By week three, we were definitely on a time crunch. I spent lots of time analyzing on multiple computers and leaving things to run over night. I planned out the last steps and worked on writing a very long report of what we had found. However, I made a devastating error. I wrote the wrong deadline date in my calendar and when I came in on Monday morning, thinking I had one extra day to complete everything, I realized that I had just missed the deadline.

Admittedly, I was very sad, upset and frankly embarrassed that I missed the deadline. I had been so excited for this opportunity. Not only because there was a chance at winning some serious cash but also for the experience of completing such a project. But I did learn a lot from the experience and it was the final straw in the decision to buy a new, more powerful computer (which is amazing by the way). I hope that I can get another chance to complete a project of that magnitude and I’m sure I will succeed someday.

Leave a comment