Are You Drowning In The Data Lake?

Jeff Carr
Jeff Carr

CEO & Co-Founder

The Hadoop Data Lake has been positioned as the one size fits all answer to a company's data silo problems. In reality, Hadoop is often NOT the most efficient answer to the problem.

Every day I hear another company tell me they are implementing a “Data Lake”. This term is the latest hot “big data” buzzword for dumping every imaginable piece of data you can find into Hadoop. It’s done under the pretense of making this data accessible and available to lots of business stakeholders. The reality for most companies is far different.

The major Hadoop vendors (Cloudera, Hortonworks, and MapR), loaded with lots of cash  and big enterprise sales teams have browbeat CIO’s into believing if they didn’t have a Data Lake initiative they weren’t part of the “cool kids”. What we are really seeing is the beginnings of the disillusionment phase for these projects. They are big, they are expensive, and they rarely work out as planned.  Why? Well there are several reasons, but the main one is all the focus is on getting the data IN the Data Lake, not getting it out, which is what users really need.

The  Hadoop Data Lake has been positioned as the one size fits all answer to a company’s data silo problems and can typically take 6-12 months to implement. In reality, Hadoop is often not the most efficient answer to the problem, depending on what users really need, the amount and type of data, and several other factors. As is clearly outlined in other posts, this myopic approach is vendor driven, not requirements or solution driven.

Request A Demo

No ETL. No Mapping. No Stale Extracts.

"Sreeni

Engineering, speaking at the 2015 IPS conference in Montreal.

I Have A Data Lake, Now What?

So you spent big money and successfully moved data from dozens, maybe hundreds of silos into Hadoop, now what?  In most cases this data is highly heterogeneous in structure and types. In order to access this using traditional analytics or BI tools all this messy data will need to be cleaned, fully homogenized and a fixed schema applied. If the starting data is JSON, XML, or any kind of variable or flexible schema data this will be a long, complicated and expensive task, and the results will most likely be less than you had hoped for. Modern data models like JSON take full advantage of flexible schema models (schemaless) which means applying a fixed schema to this data is a bad idea fraught with errors. It forces you to make arbitrary decisions regarding what is data and what is schema. Unlike traditional relational data, in the case of JSON data can be both data AND schema. This creates a host of issues that are well documented in this white paper.

So what are my options to get value from my Data Lake? Basically you have two, and depending on how complex (messy) your data is, neither is great. First, use a data prep tool and a traditional BI tool. Start by extracting out the data you need, prepare it, clean it, generally dumb it down so the analytics tool can understand it. The drawbacks from this approach range from slow performance if the data sets are large, to not being able to answer entire classes of questions due to loss of data fidelity during the prep phase. The second option is using one of the many Hadoop-specific analytics solutions on the market. Depending on which one you choose it will suffer from the same issues as the first option, but also has the added disadvantage of being Hadoop only, which means it’s not useful for any other data stores you might have like RDBMS, NoSQL operational DB’s, and API’s. Imagine if during the original rise of BI you had to have a different solution for every RDBMS? Nobody would have accepted that answer, yet datasource-specific tools are exactly what we have in today’s polyglot enterprise.

Are There Better Options?

Rather than a one size fits all data lake, how about a “Data Pond”, or the more common term Data Hub. For many of the use cases people are applying a Data Lake, there are probably better answers. Unless your datasets are huge (10TB or more) you can likely use one of many NoSQL DB’s that are faster to implement, less costly in most cases, and can store all the same complex messy data just as easily as Hadoop. These stores are in many cases also optimized for specific kinds of data so you can tailor the Datahub to the data in many cases. You can then add a powerful tool like SlamData that allows you to quickly and easily explore, analyze and report on the data, natively, no prep, no lengthy setup. The “Data Pond” can store tons of messy non-relational data like JSON or XML. It’s fast, it scales nicely due to the SlamData “pushdown” architecture, and in many cases the entire solution can be implemented in a week or two — not 6-12 months. This includes full access to the data for business users (non engineers). Imagine extracting a terabyte of data from various internal and external sources, all in different formats, ingesting it into a “Data Pond” overnight, and building an entire white label reporting solution in 1 day! This it totally attainable using this approach to building an agile modern datahub, not a one size fits all Data Lake.

TRENDING BLOGS

Now Available: the Definitive Guide To JOINs On MongoDB

Lots of business come to us looking for help with doing BI on MongoDB. Specifically, many people want to just do what they’ve always done: query data with SQL. Here's the definitive guide!

Webinar Replay: Plug-And-Play Analytics for Your SaaS App Built On MongoDB

Check out the replay of this webinar to learn how US Mobile uses SlamData to deliver interactive reporting across its business.

The Five Money-Saving Tricks MongoDB Doesn’t Want You To Know

If you listen to your friendly MongoDB sales rep, it's easy to think they are a one-stop shop for all things MongoDB.

Conclusion

Before jumping into the data lake that may cost 6-7 figures and take months to fully implement, you should start asking some tough questions. Why? How much data do I really have? How will I access the data and get value from it easily? Is it agile to implement and change? Does the solution work across all my data stores and models both now and in the future?

If the answer to most of these is not immediately clear then I’d keep considering options, and make sure to fully understand your needs and requirements.

“Big Data” and terms like “Data Lake” and “Hadoop” continue to be very buzzworthy in today’s enterprise market. While the overall idea of reducing data siloes within a company and increasing data access and insights is important, the one size fits all approach being pushed by big software vendors is counterproductive. Avoid the buzzwords, avoid the vendor hype, focus on your needs, and determine the best way to get there. This includes both usefulness of the solution and cost. Then look before you leap into the Data Lake!

News, Analysis and Blogs

What Our Customers Are Saying

v

We use SlamData to build custom reports and have found the tool is exceptionally easy to use and very powerful. We recently needed to engage the support team and we were very pleased with the turn-around time and the quality of support that we received.

Troy Thompson
Director of Software Engineering
Intermap Technologies, Inc.

v

When our company migrated from SQL database to MongoDB, all our query tools became obsolete. SlamData saved the day! I was able to easily write SQL2 queries. Plus the sharing, charting, and interactive reports were a game changer.

Michael Melmed
VP, Ops and Strategy
US Mobile

v

Slamdata helped shine the light on how our new product was being used. The support staff was awesome and we saved engineering cycles in building all the analytics in-house. I am using it to change the mindset in the teams and shift the focus from product launches to product landings

Engineering Lead
Cisco Systems

WHITEPAPER

The Characteristics of NoSQL Analytics Systems

  • The Nature of NoSQL Data
    • APIs
    • NoSQL Databases
    • Big Data
    • A Generic Data Model for NoSQL
  • Approaches to NoSQL Analytics
    • Coding & ETL
    • Hadoop
    • Real-Time Analytics
    • Relational Model Virtualization
    • First-Class NoSQL Analytics
  • Characteristics of NoSQL Analytics Systems
    • Generic Data Model
    • Isomorphic Data Model
    • Multi-Dimensionality
    • Unified Schema/Data
    • Post-Relational
    • Polymorphic Queries
    • Dynamic Type Discovery & Conversion
    • Structural Patterns

© 2017 SlamData, Inc.

Do NOT follow this link or you will be banned from the site!

SlamData Provides Missing Platform for NoSQL Data Insight

“The ROI has been in saving my time building, refreshing and tweaking reports and saving the time of engineers, who would otherwise have to build custom reports for our analytics portal."

- Michael Melmed, US Mobile
Read the Case Study Now
The study was conducted by Constellation Research.
close-link
Click Me
Tweet
Share
Share
+1
Reddit
Buffer