The Easiest Way To Do Cross-Collection JOINs With MongoDB
Damon, each week we kind of get together. We discuss a feature of SlamData that we’re seeing a lot of clients use, inquire about, or one that we feel like is worth exploring because it’s that unique. This is kind of all of those things, right? Because JOINs is changing over time, and we did it … We supported JOINs before MongoDB offered it, so can you talk a little bit about the history of SlamData’s JOIN functionality?
Yes, you’re right. MongoDB officially brought support for the $lookup command, which is part of their aggregation pipeline, in version 3.2, and along with a number other new features as well, which it was highly requested. It’s also the feature that our CTO, John De Goes, kind of actually wrote a blog post about and basically said, “MongoDB, you should really release the $lookup into the open source version as well, because you’re essentially limiting core functionality of the product by trying to restrict it to just commercial versions.” What they did was they reversed their decision and went back and released their $lookup aggregation command into the open source version, which is great, but that’s in 3.2. We’ve been able to do JOINs, like you said, since version 2.6 of MongoDB, so it’s been a couple of years. The reason we’re able to do that is because we take the SQL Squared that you write, including a JOIN, and we compile it down into MongoDB’s API.
Oftentimes, if you’re doing a complex query or doing some kind of aggregate commands of counts, JOINs, all sorts of different combinations, then what we’re going to do is change that query to use the aggregation pipeline or mapReduce, or the Find API as well. What we did is essentially perform a mapReduce in versions 3 dot … I guess you could say 2.8 and 2.6. we provide JOIN functionality through the mapReduce function of MongoDB, and now if a customer is using MongoDB 3.2, along with SlamData, it’s smart enough to say, “You’re using MongoDB 3.2, and it makes sense. I’m going to use the $lookup aggregation command now, because that actually provides some performance bonus improvements as well.” It’s been a bit of a history there, but yeah, we definitely supported JOINs before $lookup came out.
Okay. Can you do a little compare and contrast in maybe how they’re executed, what the benefit of doing them in SlamData is right now?
Right. What we do right now is, depending on the version of SlamData you’re talking … Excuse me, depending on the version of MongoDB that SlamData’s talking to, it will decide whether it needs to do a mapReduce or use the $lookup. What I’m showing here on the screen right now is a collection called patients, and it has a bunch of data. It has a bunch of nested data, as well, as you can see here, and you can see that there’s a codes array inside of each document, inside of this collection, so each patient has a first name, has a last name. Also has diagnosis codes, and that array has both a code and a description. As an example, you can see this person here has at least 4 or 5 different diagnoses, right? What I’m doing is just displaying this so that you can kind of see the structure of the data, and we’re going to actually do a JOIN from this nested data against 1 or 2 other tables. Now, let me show you what the actual information looks like.
Here, you can see in JSON format what one of the tables looks like, or one of the collections. This collection is called bad_codes, and I’ll show you what we’re going to use that for momentarily. This collection is called patient_zips, and so it has 4 different zip codes in there. What the patients record looks like, document looks like, is something like this. If I scroll up, you can see that the codes array has, just like I said, a code and a description in each of the documents. We have codes, which is a nested document, and then we have bad_codes and the patient_zips. What we’re going to do is, we’re going to take the data from this patients collection. We’re going to basically JOIN it against this zip code collection to say, “Only patients in these zip codes display it.” That’s what I’ve done here in this window. Again, this is all the data. This is the other table, zip_code, and rejoining these 2, and I’ll show you how we do that. This is the actual query we’re using.
From the patients table, or patients collection, we’re taking the zip code, we’re drilling down into each code array, and we’re getting the code and the description, and we’re getting that from the patient’s collection. Then we’re joining that data against the patient_zips collection, which is really a subset of patient zip codes. You could think of this as maybe an important collection for some reason. I’m just using it just so you can map these for zip codes against this one. We do the JOIN, and then we give it the condition, which is an equality. That’s something to note, is that we do joined on a number of different things. We can actually do a JOIN differently, where we could say less than or equal to, or greater than or equal to, but to use $lookup, that’s one of the requirements, actually. It’s also a MongoDB requirement. It’s called an equi-join, meaning that when you do this equality match, or when you do this match, it has to be an equal, or is equal to.
It can’t be anything other than that, so you couldn’t say, “Where the patient’s zip code is less than 78666,” or something like that. It has to be an actual equality thing. That’s what we’re doing, and then we’re just grouping it and showing the results, so we’ve grouped 2 tables here. We’ve taken the codes and the descriptions and the zip codes, and printed them out with a JOIN. Now, you’ll see over here on the zip code side, that is only showing zip codes that are actually showing in the zip code window. Right? We’ve done our JOIN. Now, we take this data, so now I want to filter it even further. We also have this collection called code, or bad code, I should say, and these are all the codes that I’m going to say are bad, because I want to know who inside of my patients database, or patients collection, lives in these zip codes and has these diagnoses codes.
That would be fine. I could do that manually without a JOIN, but this is to show you that the zip code data and the bad code data are actually in 2 separate tables from patients, so we’re actually doing a JOIN across 3 different collections here. That’s what this result is, and let me show you the query I use to get that. You can see all the data that we’re grabbing is from patients collection, prefixed with p, and doing that from the patients collection, and then I’m joining all that data against the bad codes. As long as the patient has a code inside of their array that matches something in the bad_codes listing, it’s going to grab it. Then it JOINs all of those against people that, or all those against the zip codes that are residing inside of those patient_zips collection. You can see we’re doing one JOIN here, and then we’re doing another JOIN here, so there’s no limit to the number of collections you can actually run a JOIN against.
There’s only a couple requirements, or a couple restrictions, and that is the one that I talked about earlier. It has to be an equi-join, and also, you have to do this against an indexed field on at least 1 side. If there’s nothing indexed, then it won’t use $lookup, because $lookup definitely relies on indexed fields. At least for now, what we do is, it has to be in the same database. These collections can be completely different collections. You could have 100 collections to JOIN against… As long as they’re in the same database, you’ll be fine. In future versions, that will change, will allow JOINs across different databases and different data sources, but in terms of SlamData 3.0.11 I think we’re on right now …
SlamData Community Edition
…you have the restriction of at least the collections have to be in the same database. When you execute that, we’ll slide over here and look at the results. You can see that the I10 code, listed here, is the S40.929S, and this person is actually listed as well, and you can see that this code matches a zip code from up here. Things need 3 different tables, or 3 different collections were used, to create this result set, and that is JOIN.
Okay, and is this occurring in the real world at an analyst’s desk, or is an operational, a developer needs some operational analytics doing JOINs like this? What, and I’ll probably edit this part out, but where are you seeing, in the client base that we have now, who’s doing a lot of cross-collection JOINs?
That’s a good question, because it’s both developers and business analysts. Business analysts, I’ll describe why, because usually they’re familiar with SQL, and they understand the concept of JOINs. They’re usually the ones that deal with these reporting tools and generate these insights from business data buried inside of these different siloed collections, or even a different database, from different vendors. Definitely business analysts use JOIN frequently, and developers also use it, but usually they’ll use it because they’ve actually created their MongoDB application with a framework of some sort. That framework oftentimes will take what used to be considered kind of an object relational map, which takes an object-oriented object from memory, and basically splits it up into many, many different tables and documents, so that when you save the data, it saves it, let’s say, to 2 or 10 or 50 different collections. Then when you read it, it pulls it all back again. The framework, often, will do the JOIN in the background for that, but they don’t have any real control over that.
If they’re using a framework, they can’t say that, “Okay, this patient has a bunch of codes, and the codes are going to be in a separate collection.” If a developer wrote it by hand, he or she would probably want to keep those codes with the patient data, but maybe they’re not given that option when they create this application. When they go to write these queries, they need something that will do the JOIN, if they’re not writing these applications, like one off. If somebody wanted a new report, that developer would have to go back in, and she might write a brand new report, a brand new query, but it’s limited, of course, to the framework, but with us, we actually can see all the data, and we can choose what we want to pull back, whether it’s a JOIN or not a JOIN. Both, really, are using this.
Okay. You could actually set this up with Markdown to allow people to kind of programmatically do the choosing and parts of the query, allow users to select aspects of each of those queries that you showed to make it kind of interactive and live as a little application somewhere in the company’s intranet or something, or even not a consumer side. It’s, you wanted to expose it to users.
Exactly right. That’s one of the, probably one of the largest things we bring to the table, is the fact that this can be out of the box, immediately usable, but it’s also flexible enough and versatile enough where a developer can create, or even a business analyst, could create a workspace like this, and create drop downs. Let’s say, in this window, you could have checkboxes, radio boxes drop downs, even text fields that you can type stuff in, and that controls the flow of the workflow, so if they typed, or if they selected, let’s say, 15376 from a dropdown, then it would limit the information you see. It might also limit the codes that are associated with that zip code. There’s a number of ways to do it. We have something called the Setup Variables Card, which allows you to actually pass in values into this workflow, so that every time somebody brings it up, it might look different, because the values being passed in are different. Like you just said, or instead of passing variables in, they could use drop downs or selectors.
We also have a Search Card, which means you could have just a plain Google-like search text field and type something in there, and that might change the flow of this whole thing. There’s a number of ways you can input data which would control the workflow for this and make it look completely different based on who’s using it and how.
News, Analysis and Blogs
The Hadoop Data Lake has been positioned as the one size fits all answer to a company’s data silo problems. In reality, Hadoop is often NOT the most efficient answer to the problem.read more
SlamData Inc., the leading open source analytics company for modern unstructured data today announced that it has raised a $6.7M Series A funding round, led by Shasta Ventures.read more
SlamData just released its first update of 2017, SlamData 4.1.1. It delivers a number of new UI enhancements, performance improvements, new charts, as well as commercial releases for the Couchbase, MarkLogic and Spark/Hadoop connectors.read more
Damon provides a quick “getting started” video for SlamData.read more
We’re excited to announce that we’ve been included in the 2017 list!read more
The following is an interview I conducted with Jeff Carr, CEO and Founder of SlamData regarding the trends in enterprise business intelligence.read more