Version a subset of already tracked data without duplication #10480

alhaal · 2024-07-13T19:34:58Z

alhaal
Jul 13, 2024

We started using dvc few days ago in our team. It is a very useful tool. We have a use case where we need advise on how to realise it in dvc. We have a large image dataset distributed in several folders and subfolders. Each folder/ subfolder contains several images. We created a data registery and added the subfolders seperately to dvc where a dvc file is created to each subfolder, i.e.

images
folder1
subfolder1
subfolder1.dvc
folder2
subfolder2
subsubfolder2
subsubfolder2.dvc

We use s3 as remote and utilise the --version-aware functionality. We also use fiftyone to select a subset of the images for annotation (i.e. most unique images). we do not change the images from the folders/subfolders above, but only sample from them based on some criterias. We want to create a new dataset in the data registery from the selected images which can be imported (dvc import) by other users in their repos to work on, i.e. we want to have something like this:

images
folder1
subfolder1
subfolder1.dvc
folder2
subfolder2
subsubfolder2
subsubfolder2.dvc
dataset.dvc

the dataset.dvc should point to the selected images from other folders, without having to copy them to a new subfolder and also without the need to push the data again to the s3 remote.

How to achieve this in dvc?

shcheklein · 2024-07-14T14:59:41Z

shcheklein
Jul 14, 2024
Maintainer

Hey @alhaal , let's connect and let's discuss the new tool - DataChain that we are going to open source soon. I think it's potentially a better fit for managing and versioning data in your case. It was made from the very beginning to cover such scenarios properly. DM me - ivan @ dvc.ai or let me know how I can reach out.

the dataset.dvc should point to the selected images from other folders, without having to copy them to a new subfolder and also without the need to push the data again to the s3 remote.

Tbh, AFAIR, with --version-aware it's not possible. Also, version-aware is deprecated atm. It's possible (and that's one of the biggest advantages of DVC remotes) with a regular DVC remote. It won't be pushing twice same objects no matter how many directories contain it, it won't copying it (it is utilizing different types of links to avoid duplication on the local disk). So users can create and publish "logical" datasets that mix files from other datasets w/o duplicating stuff.

Anyways, let's connect, discuss and see what we can do here. We'll try to help, @alhaal .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version a subset of already tracked data without duplication #10480

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Version a subset of already tracked data without duplication #10480

alhaal Jul 13, 2024

Replies: 1 comment

shcheklein Jul 14, 2024 Maintainer

alhaal
Jul 13, 2024

shcheklein
Jul 14, 2024
Maintainer