Replies: 1 comment
-
Hey @alhaal , let's connect and let's discuss the new tool - DataChain that we are going to open source soon. I think it's potentially a better fit for managing and versioning data in your case. It was made from the very beginning to cover such scenarios properly. DM me - ivan @ dvc.ai or let me know how I can reach out.
Tbh, AFAIR, with Anyways, let's connect, discuss and see what we can do here. We'll try to help, @alhaal . |
Beta Was this translation helpful? Give feedback.
-
We started using dvc few days ago in our team. It is a very useful tool. We have a use case where we need advise on how to realise it in dvc. We have a large image dataset distributed in several folders and subfolders. Each folder/ subfolder contains several images. We created a data registery and added the subfolders seperately to dvc where a dvc file is created to each subfolder, i.e.
images
folder1
subfolder1
subfolder1.dvc
folder2
subfolder2
subsubfolder2
subsubfolder2.dvc
We use s3 as remote and utilise the --version-aware functionality. We also use fiftyone to select a subset of the images for annotation (i.e. most unique images). we do not change the images from the folders/subfolders above, but only sample from them based on some criterias. We want to create a new dataset in the data registery from the selected images which can be imported (dvc import) by other users in their repos to work on, i.e. we want to have something like this:
images
folder1
subfolder1
subfolder1.dvc
folder2
subfolder2
subsubfolder2
subsubfolder2.dvc
dataset.dvc
the dataset.dvc should point to the selected images from other folders, without having to copy them to a new subfolder and also without the need to push the data again to the s3 remote.
How to achieve this in dvc?
Beta Was this translation helpful? Give feedback.
All reactions