The Definitive Guide to MongoDB Analytics
Analytics on MongoDB is a different beast than what you’re familiar with. Don’t expect to fire up your existing analytics tool, point it at MongoDB, and go. This article explains why a fundamentally different approach is necessary and what your options are.
A Quick History Lesson
The Same Mental Model of Data
Despite which vendor’s solution you choose, the data model will be practically identical when the architecture is finished because they all operate with the same mental model of data: multiple tables, referencing each other.
When there are standards, an ecosystem can be confidently built and thousands of tools and solutions show up on the scene that help everyone – including the creators of those solutions, ah capitalism! For over twenty years this is what has kept the relational database ecosystem thriving; standards like ANSI SQL, ODBC, JDBC, etc.
It wasn’t always like that. In fact I recommend you read this article by our CEO, Jeff Carr. It provides insight and context you probably didn’t have before. Vendors typically try to create proprietary technologies to lock customers in. If you’re using any of the popular NoSQL databases already, you know what I’m talking about: MongoDB’s query language, Cassandra’s CQL, etc.
New Databases, Great Improvements, But
MongoDB and the new generation of databases brings with them great improvements:
- Improved developer productivity
- Massive scalability
- Unheard of reliability
- Great performance by any standard
Not to mention they’re typically open source and, when compared to big-name vendors, can be had for pennies on the dollar for production support.
But with these advancements comes a new way of thinking about your data.
Data Models? What Data Models?
…No actual schema does not imply that the developer or DBA doesn’t need to coordinate on the best approach…
Why? Consider this: it takes much less time and effort to load a truck full of parts from a single store and deliver it to a customer, than to have a truck (or multiple trucks) pick up parts from multiple stores before delivery to the same customer. The same goes for anything in our physical reality: it takes real energy and real clock cycles to perform X actions. If everything is kept in a single location there is only 1 action to perform, versus X multiplied by the number of relational tables the object is stored in.
Many NoSQL databases, especially MongoDB, prefer to store data as complete objects in one location rather than discrete, normalized, small bits of information in a dozen or even a hundred different locations (tables). This is often the most difficult aspect to accept and understand when moving from relational databases to MongoDB.
Remember: Not only is data stored in the same location, it is stored in a completely different format.
Making the transition to MongoDB often finds developers and DBAs asking questions such as:
- How do I join this data?
- How do I sub-select?
- What key do I use?
- How do I create a 1/N/M to 1/N/M relationship here?
- How can one document (row) be completely different from the next?
- Why can’t I get a standard schema description?
Those questions aren’t as important, and sometimes don’t apply anymore, because remember: all the data for a single object can be stored together. Check out at the example below.
Say there is a database with two tables that store information about books and authors. In a relational database it would be easy to setup:
In JSON (MongoDB’s document format of choice), the following data structure could be used:
This is an important thing to note, and something MongoDB teaches their employees: the data model should be created based on expected query patterns from applications, not what the developer finds most convenient. This is the only way to ensure performance at scale.
As an example: assume a user searches for all books with an ISBN starting with “1234”. While the data model above lends itself well to finding an author and his or her books, and makes sense in a developer’s object-oriented brain, it won’t perform well searching for books by ISBN. Why? Because MongoDB has to look through each author document, then look at each array inside of that document to know whether it matches the user’s query.
An index could be added to the books array, which might help, but now adds another index MongoDB needs to write to when adding, updating or deleting data. Also be careful about adding indexes to arrays with MongoDB – it can be a real problem down the road.
Another option is to store book information in a separate collection and then refer to it in a second query or with a $lookup MongoDB aggregation command. Unfortunately that approach negates the performance benefits of avoiding JOINs originally. This is why the data model must be designed well in advance. MongoDB requires no actual schema but that does not imply the developer and the DBA don’t need to coordinate on the best approach to a long term solution.
The Big Question: What Do I Do For Analytics on NoSQL?
Everything Has Failed
How do you use existing solutions to analyze live, deeply nested, semi-structured, schema-free data in MongoDB?
You don’t. It’s not possible. At all.
What is needed is a way to analyze this new data format in a way that is both obvious to use but also natively understands the nested, schema-less data. Even when it changes. On the fly.
No tools that exist for relational databases can do this. Not one.
So then, what are the options?
Option #1: Custom Coding
Getting simple data out of MongoDB isn’t terribly hard. Usually.
Developers can write apps quickly and utilize the JSON model to rapidly prototype ideas. It’s easy to shove data into MongoDB. Getting simple data out isn’t terribly hard, usually. Getting meaningful data for making business decisions can be a different story. Many readers will relate to this – it’s a major reason this article was written.
You owe it to yourself to read this article by our CTO, John De Goes. It discusses the difficulties around creating dashboards geared toward customers and decision makers, all while avoiding the money pit that comes with custom-coded reporting.
- Maximum flexibility
- Everything is custom
- Significant MongoDB knowledge required
- Long term support of the custom solution
- Significant investment of time
- One-off solution for MongoDB
Consider this approach when you:
- Have employees “on the bench”
- Already have deep knowledge of MongoDB aggregation and mapreduce functions
- Understand third party visualization tool integration
Option #2: ETL
Old Tools Don’t Adapt.
The Extract-Transform-Load approach has been used for a very long time with relational databases. I won’t go into the details since there are better (and longer) articles elsewhere, and you’ve likely investigated this route already. I will, however, give an example of why this approach is very difficult to implement from the technical perspective.
Many people use MongoDB’s ability to leverage the schema itself as data. What does this mean? It means you can use the field name (or column name in relational terms) as part of the data.
Take the following JSON document as an example:
An ETL process takes rich, nested and self-described data and forces it into small, rigidly-typed containers so it can report in a confined, rigidly-typed way. Data fidelity will be lost during conversion with this kind of data.
When considering an ETL approach readers must consider these points:
- How will deeply nested arrays be mapped to two-dimensional tables?
- How will documents in the same collection, but with different schemas, be mapped?
- How will MongoDB schema changes be handled?
- Can the ETL solution handle the volume, variety and velocity of data?
- Will the solution scale to include new MongoDB applications and data?
- How fresh is the data from the ETL process?
Bring your attention back to the first bullet point. Someone will need to manually map the MongoDB data model to the relational ETL model. This will need to happen again whenever the schema changes, and with MongoDB, the schema can change frequently.
While scripts can be written, and some very basic tools exist that can handle the most rudimentary parts of this, the fact is that the vast majority of the data model can only be mapped by a human. Again, see the example document above.
- Most two-dimensional reporting tools can read the transformed data model.
- Weeks or months of work to set up
- Cannot adjust to documents with different schemas
- Part-time or full-time DBA to maintain
- Significant investment in hardware, employee time and process
- Loss of data fidelity
- Accept that analysis is not live
Consider this approach when you:
- Have already heavily invested in legacy reporting tools
- Are expected to use existing tools
- Can accept loss of data fidelity
Option #3: Native NoSQL Analytics with SlamData
A Modern Tool for Modern Data
Any solution designed for multidimensional, NoSQL analytics must be designed and built for this from the ground up. Once a solution for multi-dimensional data is developed it’s possible to then go back and apply it to two-dimensional data. It doesn’t work the other way around.
Unfortunately for existing BI vendors, the ability to natively work with MongoDB is not something that can be bolted on or included in a new version of an existing product. The best they can hope for is to mimic the ETL option. Make no mistake: reporting tools designed for relational databases will not analyze live, nested MongoDB data. Even MongoDB’s official BI Connector performs an ETL process and stores data in PostgreSQL for two-dimensional analytics.
SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB.
The SlamData product is a single analytics solution for business analysts, data scientists, developers, data architects and DBAs working with MongoDB. It natively understands dynamic, nested data and provides an interface built for it.
All actions performed by SlamData occur on live data. Commands are sent to MongoDB in the most performant order based on the user’s requested search. MongoDB performs 100% of the computation and only the results are returned to SlamData. This is a key difference to understand: with ETL and existing BI tools an entire table (typically many tables) are returned and the solution must then perform analytics on the entire data set.
- Developers can dynamically pass values into workspaces to control content and flow.
- DBAs can easily view schema and data.
- Business analysts can use standard SQL queries against MongoDB nested data.
- Users can interact with forms that allow self-service.
- All workflows, or any part of a workflow, can be securely embedded into other applications.
Users can install SlamData and create dashboards on live MongoDB data in less than 60 minutes, regardless of schema.
SlamData connects to any MongoDB database including remote instances and SSL-encrypted deployments too.
- Immediate ROI
- Create embeddable reports in minutes after install
- Natively view, analyze and display deeply nested, semi-structured data
- Use an enhanced SQL dialect that works on both relational and NoSQL data, instead of learning MongoDB’s multiple proprietary approaches
- Graphically layout interactive forms, reports and charts
- Provide Google-like search functionality to MongoDB for end users
- Export data in multiple formats for custom processing
- Restrict data visibility and actions based on user authorization model
- Enterprise-grade multi-tenant security
- Learning a new BI tool and new approach to MongoDB analytics
- Explaining the importance of this approach to non-technical management
- Not as flexible as custom coding
What The Market Wants, What The Market Needs
All data sources, one analytics solution.
Picture this for a moment: All data sources, one analytics solution.
NoSQL databases? Check.
XML, JSON and other nested flat file formats? Check.
Cross-datasource (federated) queries and joins? Check.
Query and display log data, relational data and NoSQL data at the same time? Check.
Pivot Tables and multidimensional data structure viewers? Check
One platform to query and analyze all data sources in your company? Check.
While I’m writing this we have a team of engineers writing connectors for several other databases that will be included in SlamData version 3.1. With our QScript connector technology we can create a connector for any data source (database, file, API) in a matter of weeks. You can expect several new data sources to be supported in each major release of SlamData. All with the same functionality that we currently provide for MongoDB.
With it’s ability to use standard SQL² across various data sources simultaneously, out-of-the-box visualizations, one-click embedding, customizable analytics workflows, enterprise-grade security, multi-tenant hosting capabilities, virtual views, interactive forms, 100% in-database query execution, and more – SlamData is the only sensible approach for database analytics today.