Replies: 6 comments 7 replies
-
Hi @GUS0K, we are working on addressing this better as far as both documentation (cc @jorgeorpinel) and features, so would be great to get your feedback here. Could you provide some details about how you want to deploy the model? Is it a batch scoring pipeline that will run at regular intervals?
What are the challenges you are facing or foresee?
If you don't need the pipeline in production, you can access any versioned model or other artifact using either the API or commands like
If you want to reuse the pipeline in production, DVC currently only supports dependencies and outputs that are written to disk. |
Beta Was this translation helpful? Give feedback.
-
One of our main challenges is that the versioned model is not sufficient for inference since it does not include the feature extraction steps nor the data cleaning steps (implemented in a different script). Ideally, we could download the model and steps with |
Beta Was this translation helpful? Give feedback.
-
Sorry, I didn't realize you were the same user I had talked with on Discord! You mentioned there that you wanted to potentially split your pipeline into train and test. Are you still pursuing that path? If you have a pipeline for test data, can you reuse that pipeline for production? |
Beta Was this translation helpful? Give feedback.
-
No worries :) Yes, I have two problems with this:
|
Beta Was this translation helpful? Give feedback.
-
Update: we will have to run real-time inference (no batch), so the speed at which we can run the pipeline (and not running unnecessary stages) will be crucial. |
Beta Was this translation helpful? Give feedback.
-
If you are running real-time, it will be difficult to reuse your development pipeline as is since DVC is built around file inputs and outputs. Even if you can keep everything in memory, running as a DVC pipeline is not ideal since time will be wasted calculating md5 hashes, caching data, etc. Is your development pipeline otherwise simple and fast enough that you would expect it to meet your latency requirements in production? Is your input data still coming in as a csv in realtime? Reusing a versioned model file and keeping it in memory is pretty simple using the DVC Python API. If you want to reuse your data processing code, could you package the necessary parts of the code for reuse, or even just import those files from your DVC pipeline if packaging is too much? It seems like ideally you'd like to:
Does that sound right? cc @aguschin |
Beta Was this translation helpful? Give feedback.
-
Hi, we are trying to set up a project using DVC and we are running into some design questions regarding best practices for training locally and deploying a model in production for inference.
As of now:
For our local development this works fines: writing intermediate steps removes the need to re-compute all the steps if we change a single script (for example, future extraction).
Our goal is the following:
We want to deploy the model for inference into production. That is, we will have a new dataset Z, which we will need to feed forward into the pipeline (the cleaning steps and feature extraction steps, but not the training steps).
Some of our concerns are as follows:
Beta Was this translation helpful? Give feedback.
All reactions