Every day I hear another company tell me they are implementing a “Data Lake”. This term is the latest hot “big data” buzzword for dumping every imaginable piece of data you can find into Hadoop. It’s done under the pretense of making this data accessible and available to lots of business stakeholders. The reality for most companies is far different.
The major Hadoop vendors (Cloudera, Hortonworks, and MapR), loaded with lots of cash and big enterprise sales teams have browbeat CIO’s into believing if they didn’t have a Data Lake initiative they weren’t part of the “cool kids”. What we are really seeing is the beginnings of the disillusionment phase for these projects. They are big, they are expensive, and they rarely work out as planned. Why? Well there are several reasons, but the main one is all the focus is on getting the data IN the Data Lake, not getting it out, which is what users really need.
The Hadoop Data Lake has been positioned as the one size fits all answer to a company’s data silo problems and can typically take 6-12 months to implement. In reality, Hadoop is often not the most efficient answer to the problem, depending on what users really need, the amount and type of data, and several other factors. As is clearly outlined in other posts, this myopic approach is vendor driven, not requirements or solution driven.
Engineering, speaking at the 2015 IPS conference in Montreal.
I Have A Data Lake, Now What?
So you spent big money and successfully moved data from dozens, maybe hundreds of silos into Hadoop, now what? In most cases this data is highly heterogeneous in structure and types. In order to access this using traditional analytics or BI tools all this messy data will need to be cleaned, fully homogenized and a fixed schema applied. If the starting data is JSON, XML, or any kind of variable or flexible schema data this will be a long, complicated and expensive task, and the results will most likely be less than you had hoped for. Modern data models like JSON take full advantage of flexible schema models (schemaless) which means applying a fixed schema to this data is a bad idea fraught with errors. It forces you to make arbitrary decisions regarding what is data and what is schema. Unlike traditional relational data, in the case of JSON data can be both data AND schema. This creates a host of issues that are well documented in this white paper.
So what are my options to get value from my Data Lake? Basically you have two, and depending on how complex (messy) your data is, neither is great. First, use a data prep tool and a traditional BI tool. Start by extracting out the data you need, prepare it, clean it, generally dumb it down so the analytics tool can understand it. The drawbacks from this approach range from slow performance if the data sets are large, to not being able to answer entire classes of questions due to loss of data fidelity during the prep phase. The second option is using one of the many Hadoop-specific analytics solutions on the market. Depending on which one you choose it will suffer from the same issues as the first option, but also has the added disadvantage of being Hadoop only, which means it’s not useful for any other data stores you might have like RDBMS, NoSQL operational DB’s, and API’s. Imagine if during the original rise of BI you had to have a different solution for every RDBMS? Nobody would have accepted that answer, yet datasource-specific tools are exactly what we have in today’s polyglot enterprise.
Are There Better Options?
Rather than a one size fits all data lake, how about a “Data Pond”, or the more common term Data Hub. For many of the use cases people are applying a Data Lake, there are probably better answers. Unless your datasets are huge (10TB or more) you can likely use one of many NoSQL DB’s that are faster to implement, less costly in most cases, and can store all the same complex messy data just as easily as Hadoop. These stores are in many cases also optimized for specific kinds of data so you can tailor the Datahub to the data in many cases. You can then add a powerful tool like SlamData that allows you to quickly and easily explore, analyze and report on the data, natively, no prep, no lengthy setup. The “Data Pond” can store tons of messy non-relational data like JSON or XML. It’s fast, it scales nicely due to the SlamData “pushdown” architecture, and in many cases the entire solution can be implemented in a week or two — not 6-12 months. This includes full access to the data for business users (non engineers). Imagine extracting a terabyte of data from various internal and external sources, all in different formats, ingesting it into a “Data Pond” overnight, and building an entire white label reporting solution in 1 day! This it totally attainable using this approach to building an agile modern datahub, not a one size fits all Data Lake.
Before jumping into the data lake that may cost 6-7 figures and take months to fully implement, you should start asking some tough questions. Why? How much data do I really have? How will I access the data and get value from it easily? Is it agile to implement and change? Does the solution work across all my data stores and models both now and in the future?
If the answer to most of these is not immediately clear then I’d keep considering options, and make sure to fully understand your needs and requirements.
“Big Data” and terms like “Data Lake” and “Hadoop” continue to be very buzzworthy in today’s enterprise market. While the overall idea of reducing data siloes within a company and increasing data access and insights is important, the one size fits all approach being pushed by big software vendors is counterproductive. Avoid the buzzwords, avoid the vendor hype, focus on your needs, and determine the best way to get there. This includes both usefulness of the solution and cost. Then look before you leap into the Data Lake!