The Librarian, The Scientist, The Alchemist & The Engineer: Anatomy Of A DataOps Expert

Your DataOps team is the one making sure your data scientists do not spend 50-80% of their time preparing data for analysis. Here is what this involves.

The Librarian

First and foremost, a DataOps expert must help you find the data you need. Just like real-world libraries, if you already know exactly what you’re looking for, you can just get it yourself, but if you don’t, you need a librarian.

You should be able to ask generic questions like “find me data that is relevant to building a healthcare anti-fraud app”, and have the librarian come back with datasets about licensed physicians, suspended licenses, pharma payments to physicians, census data matched to physicians’ addresses, grouping of physicians to peer groups by sub-specialties and more.

If you’re asking for data on drug prices, a librarian should be the expert guiding you about the nuances of prices: distributor vs. consumer pricing, Medicare limits versus street pricing, insurance vs. uninsured pricing, and billed vs. actual paid amounts. They may also offer other relevant datasets, like mappings of brand to generic medications, clinically equivalent therapies and list of drug name synonyms.

A great librarian understands your intent and shows you content you wouldn’t have thought of yourself.

The Scientist

Data science fundamentals are another key part of the DataOps job description.

A couple of years ago, we came across a project in which an analytics team wanted to generate a population health cost index, and to do so decided to extrapolate metrics from a large set of Medicare clinical claims. It took someone deeply familiar with the data to point out that Medicare is largely used by senior citizens, meaning that the distribution of diseases, chronic conditions and procedures they bill for is heavily skewed towards that age group – and not representative of the overall population.

Similarly, it requires deep familiarity with the specific dataset, problem being solved and machine learning fundamentals, to know that a given dataset is a poor fit for a given supervised classification problem, because the distribution of classes in the training dataset is not similar to what will be observed in production.

The ways to address such “gotchas” are either by having data scientists with strong domain specific knowledge, or DataOps expert with a strong data science background. Or both.

The Alchemist

What do you do if the necessary data does not exist? You generate it.

Simulated data is usually required either to “smooth” gaps in data coverage, or to reduce reliance on highly sensitive data. For example, in a past project we were asked to provide a dataset of patient stories – full patient histories and inpatient visit records – that will cover the full range of adverse events that can happen within hospitals. While a substantial number of real inpatient records were available, they were not complete, were hard to be allowed into a broad study due to privacy concerns, and still did not cover all adverse events for all relevant demographics, since some of them are relatively rare.

This proved to be a fun & complex data simulation project – each new patient story had to make clinical sense (age, gender, symptoms, medications, order of events, etc.); adverse events had to happen according to their real expected distribution; and patients with no adverse events had to be added, to maintain these distributions, while also keeping the overall distributions (of demographics, specialties, co-morbidities) realistic. That was one example where producing data was harder than the data science project it was used for – and which also required substantial data research to find the relevant adverse event tables, distributions, correlations with demographics and disease states, and others.

The Engineer

You don’t only need the right dataset at the right quality – you also need it right in the platform you do your analysis in, in the optimal format for that platform and toolset.

Let’s assume you running a natural language analysis on ten million clinical records, and your tool of choice is Apache Hadoop or Apache Spark. Your DataOps expert should know the data formatting and access choices for these platforms, and for example recommend Parquet as the read-optimized data serialization format for that data, transform the data into that format, load it for you into the cluster, generate Hive or SparkSQL tables, and only then call you in to do your job.

On the other hand, if you are running a geo-spatial analysis about people’s access to hospital, and ElasticSearch is your platform of choice, then a very different recommendation is in order. Given several thousand hospitals at most, a viable choice would be to format the data as one index of hospitals, using GeoJSON for the geo-spatial coordinates or polygons, and load it all into memory for the analysis.

Formatting and moving data around isn’t fancy, but it’s a core part of preparing data for analysis, and hence of the DataOps job description.

Last but not least

An undertone of all the above roles is that your DataOps partner has to be a deep domain expert in the space you are working on, and also has to be part of the project team. Make sure people know what problem you’re trying to solve and why, and then raise your expectations from them.

8 comments on “The Librarian, The Scientist, The Alchemist & The Engineer: Anatomy Of A DataOps Expert

  1. Great info. Lucky me I discovered your website by
    chance (stumbleupon). I have book marked it for later!

  2. You are so awesome! I do not suppose I’ve read anything like this before.
    So wonderful to find another person with some unique thoughts on this subject matter.
    Really.. many thanks for starting this up. This website is one thing that is needed on the internet, someone with a little originality!

  3. A fascinating discussion is definitely worth comment. I do think that you need to write more
    on this subject, it might not be a taboo matter but
    generally people don’t speak about these topics.
    To the next! Many thanks!!

  4. An impressive share! I have just forwarded this onto a friend who had been doing
    a little homework on this. And he in fact bought me breakfast simply because I discovered it for him…
    lol. So let me reword this…. Thank YOU for the meal!!
    But yeah, thanks for spending time to discuss this matter here on your

  5. Hey! I know this is somewhat off topic but I was wondering if you
    knew where I could locate a captcha plugin for my comment form?
    I’m using the same blog platform as yours and I’m having trouble finding one?
    Thanks a lot!

  6. Im not that much of a online reader to be honest but your blogs really nice, keep
    it up! I’ll go ahead and bookmark your website to come back
    later on. Many thanks

  7. Hi there I am so delighted I found your site, I really found you by error,
    while I was researching on Digg for something else, Anyhow I am
    here now and would just like to say cheers for a remarkable post and a all round thrilling blog (I
    also love the theme/design), I don’t have time to read through it all at the moment but I have saved it and also added your RSS feeds, so
    when I have time I will be back to read more, Please do keep up
    the superb job.

Leave a Reply

Your email address will not be published. Required fields are marked *