DVC pipelines for training and production #6113

kiedanski · 2021-06-03T14:32:56Z

kiedanski
Jun 3, 2021

Hi, we are trying to set up a project using DVC and we are running into some design questions regarding best practices for training locally and deploying a model in production for inference.

As of now:

We have a script that splits the dataset into train & test
We have several scripts that clean the data (read from train & test)
We have several scripts that extract features from the data (read from the clean output of the previous step)
We have a script to train a model on the features extracted from train.
We have a script to loads the features from test, the persisted model and performs inference.

For our local development this works fines: writing intermediate steps removes the need to re-compute all the steps if we change a single script (for example, future extraction).

Our goal is the following:

We want to deploy the model for inference into production. That is, we will have a new dataset Z, which we will need to feed forward into the pipeline (the cleaning steps and feature extraction steps, but not the training steps).

Some of our concerns are as follows:

How do we deal with a new dataset in production that was not hashed in the pipeline and that we don't need?
Can we simply download the output of the trained model without having to reproduce the steps required for training the model?
Can we avoid writing the output of each stage into memory? This works for local development but can hinder the performance of the production system.

dberenbaum · 2021-06-03T15:10:36Z

dberenbaum
Jun 3, 2021
Collaborator

Hi @GUS0K, we are working on addressing this better as far as both documentation (cc @jorgeorpinel) and features, so would be great to get your feedback here. Could you provide some details about how you want to deploy the model? Is it a batch scoring pipeline that will run at regular intervals?

How do we deal with a new dataset in production that was not hashed in the pipeline and that we don't need?

What are the challenges you are facing or foresee?

Can we simply download the output of the trained model without having to reproduce the steps required for training the model?

If you don't need the pipeline in production, you can access any versioned model or other artifact using either the API or commands like dvc get (if you are using it outside of a DVC project) or dvc import (if you are using it inside a DVC project).

Can we avoid writing the output of each stage into memory? This works for local development but can hinder the performance of the production system.

If you want to reuse the pipeline in production, DVC currently only supports dependencies and outputs that are written to disk.

2 replies

jorgeorpinel Jun 4, 2021

Thanks for the mention. I'll keep an eye on this since it may relate to iterative/dvc.org#2490.

It's not the first time the need to deploy only sections of a pipeline is expressed, see #5924 (inspiration for iterative/dvc.org#862) although in this case the motivation is pretty different (preparing data for inference on a model trained by that same pipeline).

On that note, TBH I was under the impression that a trained model would be constructed in such a way that it handles rawish data? Or is it an intended use to deploy (partial) pipelines for the actual application of a model that results from that pipeline? Sorry, probably a bit confused so far 😬 UPDATE: See #6113 (comment)

dberenbaum Jun 4, 2021
Collaborator

I think this is a common scenario for batch scoring, and you've addressed a lot of how to do this in the comments @jorgeorpinel.

kiedanski · 2021-06-03T16:00:57Z

kiedanski
Jun 3, 2021
Author

One of our main challenges is that the versioned model is not sufficient for inference since it does not include the feature extraction steps nor the data cleaning steps (implemented in a different script).

Ideally, we could download the model and steps with dvc pull and "freeze" all the stages related training. Then, we could run only the parts of the pipeline required for inference (clean the new data, extract features and feed it into the model).

2 replies

jorgeorpinel Jun 4, 2021

You can deploy the DVC pipeline and only run those 2 stages (in batch): first place the new data (batch) in that environment using same the dir/file names recorded in dvc.yaml. Then use dvc repro on the specific stages.

The resulting artifact will be in the deployed project's cache, which you can setup in some shared location (accessible to the inference system).

jorgeorpinel Jun 4, 2021

p.s. that was just a simple idea. Clearly questions remain about speeds, and on the interoperability between the batch preparation pipeline and inference system.

dberenbaum · 2021-06-03T16:09:51Z

dberenbaum
Jun 3, 2021
Collaborator

Sorry, I didn't realize you were the same user I had talked with on Discord! You mentioned there that you wanted to potentially split your pipeline into train and test. Are you still pursuing that path? If you have a pipeline for test data, can you reuse that pipeline for production?

0 replies

kiedanski · 2021-06-03T16:29:40Z

kiedanski
Jun 3, 2021
Author

No worries :)

Yes, I have two problems with this:

The test pipeline depends on the train pipeline (so that takes as back to the problem of freezing part of the pipe)
The test dataset ("test.csv") is hard-coded into the testing pipeline and calling the new dataset for inference "test.csv" seems quite hacky

1 reply

jorgeorpinel Jun 4, 2021

freezing part of the pipe

Can be done: https://dvc.org/doc/command-reference/freeze

calling the new dataset for inference "test.csv" seems quite hacky

Yes, these would be workarounds for now. Suggestions:
a) call it something generic e.g. data.csv
b) parametrize dvc.yaml so that the deps file name(s) can vary. You may also need to generate a params file e.g. using env vars (with a script in a preceding stage) though (similar to #1416 (comment))

kiedanski · 2021-06-03T18:08:45Z

kiedanski
Jun 3, 2021
Author

Update: we will have to run real-time inference (no batch), so the speed at which we can run the pipeline (and not running unnecessary stages) will be crucial.

0 replies

dberenbaum · 2021-06-04T01:02:43Z

dberenbaum
Jun 4, 2021
Collaborator

If you are running real-time, it will be difficult to reuse your development pipeline as is since DVC is built around file inputs and outputs. Even if you can keep everything in memory, running as a DVC pipeline is not ideal since time will be wasted calculating md5 hashes, caching data, etc.

Is your development pipeline otherwise simple and fast enough that you would expect it to meet your latency requirements in production? Is your input data still coming in as a csv in realtime? Reusing a versioned model file and keeping it in memory is pretty simple using the DVC Python API. If you want to reuse your data processing code, could you package the necessary parts of the code for reuse, or even just import those files from your DVC pipeline if packaging is too much?

It seems like ideally you'd like to:

Reuse particular stages from your pipeline
Change the stage inputs and outputs to read/write in memory
Run stages without any of the usual DVC checks or caching (just execute the stages in the DAG)

Does that sound right?

cc @aguschin

2 replies

jorgeorpinel Jun 4, 2021

I agree. If the pipeline is designed for training (and testing/evaluating) it may be too much to also expect it to also handle actual application. Also the data format and I/O channels seem to be totally different between training and production.

The most straightforward solution may be to build and deploy a separate system that reuses (some of) the same code as the data prep/ featurization stages from the Git repo, and that separately reuses the model from the DVC repo (e.g. with dvc get or dvc.api) for inference with that stream of data. This way you're free to implement this system as a worker (always running) that gets loaded to memory once and can operate in real-time.

kiedanski Jun 4, 2021
Author

Thank you all for your comments and ideas. I will try to follow this route and I will keep you posted on how it performs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DVC pipelines for training and production #6113

{{title}}

Replies: 6 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

DVC pipelines for training and production #6113

kiedanski Jun 3, 2021

Replies: 6 comments · 7 replies

dberenbaum Jun 3, 2021 Collaborator

jorgeorpinel Jun 4, 2021

dberenbaum Jun 4, 2021 Collaborator

kiedanski Jun 3, 2021 Author

jorgeorpinel Jun 4, 2021

jorgeorpinel Jun 4, 2021

dberenbaum Jun 3, 2021 Collaborator

kiedanski Jun 3, 2021 Author

jorgeorpinel Jun 4, 2021

kiedanski Jun 3, 2021 Author

dberenbaum Jun 4, 2021 Collaborator

jorgeorpinel Jun 4, 2021

kiedanski Jun 4, 2021 Author

kiedanski
Jun 3, 2021

Replies: 6 comments 7 replies

dberenbaum
Jun 3, 2021
Collaborator

dberenbaum Jun 4, 2021
Collaborator

kiedanski
Jun 3, 2021
Author

dberenbaum
Jun 3, 2021
Collaborator

kiedanski
Jun 3, 2021
Author

kiedanski
Jun 3, 2021
Author

dberenbaum
Jun 4, 2021
Collaborator

kiedanski Jun 4, 2021
Author