The New JSON Workflow For Data Scientists and JSON Wranglers
Data Scientist this, Data Scientist that. This is the most over-used, hype-infused word or title of 2016. But it’s actually an important and logical rise to prominence. SlamData’s business IS business intelligence. Our customers ARE data scientists. But most importantly our business is about dramatically changing the way business intelligence is done and how (and how fast) business insights are discovered. Describe the job or mandate that Data Scientists face on a day-to-day basis.
Data scientists basically are the catch-all analytics people who you call in when the problem definition is unclear or the solution falls so far out of the boundaries of existing tools that no one but a data scientist can take it on. If you need someone to give you an Excel file with a pie chart, you probably don’t call the data scientist. If you’re rolling up sales and marketing numbers and want to stuff that into a presentation, a data scientist is not the person you call.
On the other hand, if you have let’s say seven terabytes of data piling up in Hadoop–let’s say clickstream data–you’ve got a lot of visitor information and some of these are identified users and some are not. You really have no idea what this data is saying about your business and you want somebody to poke around and play with that data and see if they can find anything interesting or useful–useful patterns of visitor behavior around purchases or something like that. That’s when a data scientist is the person you call. For example, you want to take a clickstream data and you want to join it to SalesForce data and lots of other stuff, and we want to build a predictive model that’s going to tell you what the chances of a lead converting are somewhere down in the pipeline based on what you know about them. That’s a job you have to a data scientist.
As a result, because data scientists get all the slop that no one else can easily do with a high level, special purpose tool, it means that data scientists have to be a jack of all trades. They need to know a little SQL, a little Python, a little R, maybe a little Scala for doing Spark stuff. They need to know what CSV files are, and JSON, and XML, and need to know a little bit of machine learning, and a little bit of historical batch analytics and so forth.
Also they need to have a certain level of ability to work in problems with a high degree of uncertainty about what’s the best way to solve this problem, or “what even am I doing here?” Sometimes when the problem is not well specified, you need someone in there who is comfortable working in those and can just dig into the data and figure out a way to use it if that’s not immediately clear from the outset. They’ll also have the analytical mindset that allows them to pull out answers and tell stories that are helpful for the business based on the data that they have access to.
Ok, from a hands-on perspective, what is the workflow for the data scientist?
I would say what that workflow looks like for a typical data scientist is they’ve got data in a variety of formats… It’s not about any given type of data. It’s about the end product. What does that data tell us? Can we use the output of an analysis to build some model? Can we use that to learn something that has the potential to change the business? That’s what data science is about. So all the tools for data scientists are means to an ends.
Their workflow consists of primarily exploring data. There’s a component of exploratory, ad hoc, “Let me just try to understand what’s inside here so that I can begin to formulate questions.” As well as cataloging and other types of things because most companies don’t have good data management practices. You’ll have lots of data all around the company, it hasn’t been mapped out yet, and you’ll have no idea how one part of it relates to another or what the quality of that is.
A data scientist wrestles with some of those issues trying to figure that stuff out. Then they spend a lot of their time reshaping data, so tying one source of data to another. For example, in clickstream data, some users will be unidentified, but you’ll have another data source over there that has IP information and usernames, and so you’ll be able to relate them at least at some probabilistic rating. You’ll have to go through the process of figuring out how to connect those two datasets using a method that might be a little fuzzy and imprecise.
It’s not just joining but reshaping, resizing, and aggregating as many of the things that you do in data science don’t operate on the raw datasets, rather you do this thing called feature extraction. You aggregate information. You summarize it. You call out interesting features of the original dataset. You make them more exaggerated, more manifest, so you can run them through model builders and build some sort of predictive model that you can then use somewhere in the business. There’s a huge amount of exploratory analysis, ad hoc analytics, and data preparation, data wrangling that goes on in a data scientist’s workflow.
Then once they’ve figured out what data is interesting for this problem, what its structure is, and they’ve reshaped it, resized it, extracted features, a relatively small amount of time after that is spent further aggregating and visualizing that using data visualizations, typically not reports, but ad hoc visualizations telling the story around that data, understanding what it says about the business, and then feeding it into model builders or tools that can build classifiers based on training datasets so that you can apply that knowledge to solving some operational use case that the business has.
That’s the big data slog. Takes a while to uncover gems — gems that you don’t even know if they’re there or not. How does SlamData impact this process? How does it materially change it?
First, it’s important to understand that SlamData will not be the only tool that a data scientist uses. We’re going to be another tool that a data scientist uses to accomplish their mission, because they don’t care about tools, they only care about results
You can create lots of ad hoc data visualizations really easily. You can run SQL queries — in many cases, data scientists prefer SQL to writing code. You can also scale up those SQL queries. Those SQL queries can work on really big datasets, which is not true of the Python code that you might write, and it’s not true of the R code that you might write — which may be limited to the size of data they can effectively handle because Python and R use lots of memory. They’re interpreted languages, they run much more slowly — and they may not be running in a super fast high performance distributed database.
So if you’re using SlamData, you get all these very compelling benefits of using a database, like using SQL and executing that SQL at a really high performance and being able to visually explore your data and easily do visualization on top of that data. You get all those benefits, but you get none of the drawbacks.
The drawbacks for a data scientist of using a database are it’s something I have to set up and maintain. Unless I’m using a database … like if I just want a database so that I can do my data science stuff and load a bunch of a data into it, then I have to maintain that database. Unless IT has got a database there for me that they’re willing to keep running at all points in time, then I’ve got to maintain that. I’ve got to set that database up, maintain it, optimize it, create indexes and so forth.
Worse than all of that is, it doesn’t matter who’s maintaining the database, whether it’s IT or me myself on my laptop. Fundamentally, I can’t just plop a bunch of data into it. That doesn’t even make sense because what I’m going to have to do is I’m going to have to flatten that data, normalize it, and then I’m going to have to go in there and create a bunch of tables that have all the right shapes, and then after I create a bunch of tables with all the right shapes, I’ll have to load that data in. All that adds tremendous overhead to the process. As a result, SQL-based technologies are underutilized in the data science profession. They’re utilized more in Hadoop where you can slap Hive on top of a JSON file. Why? Because it’s far less overhead. I didn’t have to create anything. I created a table definition in HIVE, and then suddenly I could begin running my queries. That’s relatively little overhead. Whereas in a SQL-based environment with a relational database, there’s so much overhead that data scientists just don’t use relational databases and SQL and those visual analytics tools on top.
They don’t use them nearly as much as they could because there are massive, massive drawbacks to doing so. Some of those drawbacks are ameliorated by leveraging the Hadoop environment and leveraging Hive, and being flexible about the data formats that you use. Again, Hadoop is a low-level piece of the infrastructure. You don’t have any of the perks of SlamData in that environment unless you’re licensing tools.
Where they can move stuff in and out really easily, and refine data, and aggregate it, and take advantage of SQL, push the computation into the database where it’s going to run really fast, and then they can come to understand that data and get it down to more manageable portions where they can then export it from SlamData and into that CSV or JSON file or whatever it is and then use it further downstream.
Whether that’s as part of some presentation or whether that’s to feed into RapidMiner or Weka, or some classifier in R or MLab, or whatever you have to do with that data downstream, SlamData makes really easy to get it out and then continue the process of delivering value to the business based on the insights that you learn in that data.
So this allows data scientists to have more time — or they can shift their time from the slog to the analysis — and ultimately the recommendation part, the business impact part.
Where it simplifies things tremendously is you don’t have to write code anymore to do a lot of the exploration stuff. With SlamData you have this visual environment for exploring any type of data — and you can add more data and different data to it. It has all the advantages of using Hadoop.
You can plop any type of data in there, but without the coding, and with a really nice front-end on top that allows you to visually explore and do ad hoc analytics and data viz on top of that data. It’s hard to estimate, but using SlamData significantly reduces the exploration phase of the workflow. Exploration is faster — and it’s more agile.
News, Analysis and Blogs
The Hadoop Data Lake has been positioned as the one size fits all answer to a company’s data silo problems. In reality, Hadoop is often NOT the most efficient answer to the problem.read more
SlamData Inc., the leading open source analytics company for modern unstructured data today announced that it has raised a $6.7M Series A funding round, led by Shasta Ventures.read more
SlamData just released its first update of 2017, SlamData 4.1.1. It delivers a number of new UI enhancements, performance improvements, new charts, as well as commercial releases for the Couchbase, MarkLogic and Spark/Hadoop connectors.read more
Damon provides a quick “getting started” video for SlamData.read more
We’re excited to announce that we’ve been included in the 2017 list!read more
The following is an interview I conducted with Jeff Carr, CEO and Founder of SlamData regarding the trends in enterprise business intelligence.read more