Reconsider gc implementation #5928
Replies: 75 comments 7 replies
-
This is more about UI defaults than implementation, which will surely be affected too. |
Beta Was this translation helpful? Give feedback.
-
yes - very much agree. I am new to dvc, but the gc implementation is already making me nervous. One additional note on gc. Should "metrics" files be treated differently than the rest? I agree that gc should easily collect old model files etc, but keeping the development of performance over time around would be nice. @Suor Why do you think this is more about UI than implementation? I am not familiar in detail with the implementation, but the existence of the "--projects" parameter points to an interesting problem. What should happen to files that aren't referenced anywhere in the current git repository? The conservative answer to this would be to not remove them. However this sounds like a major change in strategy, where only those items are considered for removal that Or am I misunderstanding this? Thanks |
Beta Was this translation helpful? Give feedback.
-
Hi @hhoeflin ! Sorry for the delay.
We usually recommend keeping metrics files in the git (e.g.
In the current state of things, those files are going to be removed, that is why
Currently dvc cache/remote doesn't have any notion of which project it came from, it would require some additional complications to make it aware of that.
It is clear that we need different strategies for gc. Here are a few related tickets about other strategies: #678 #377 #155 #855 Improving gc is around the top of our TODO list, so we will get to dedicating it more time soon. Thank you for the feedback, we really appreciate it! 🙂 |
Beta Was this translation helpful? Give feedback.
-
An excellent overview and answer @efiop, thank you :) A few interesting questions have come to mind reading this. For example, should it be analyzing all commits in all branches in the repo by default (vs only all commits in the current branch)? How do we ensure that we have all the branches we need? How do we ensure (and should we?) that when we run it, the branch is correct? Should it be for example, a global command (in a |
Beta Was this translation helpful? Give feedback.
-
One of the sources of issues is we are treating local gc and remote gc the same now, which is wrong for the very usual scenario of using both - in that case local cache is temporary, so we want to say what we want to keep. However for last/permanent storage such as remote cache we usually want to keep everything referenced by default. This is complicated by the existence of other scenarios like shared machine with shared local cache. We should somehow separate this cases. BTW, when we think like that #2037 becomes obvious. Another thing is, given distributed nature of git, it's impossible to implement |
Beta Was this translation helpful? Give feedback.
-
@Suor |
Beta Was this translation helpful? Give feedback.
-
Extracted from #2147: Since git has a distributed nature and we are collecting everything not referenced we may remove something recently pushed and only referenced in git commits not available locally yet - not pulled or not even pushed yet by an author. The sane method to circumvent this is providing a grace period: do not gc anything newer than N days. In the case someone has just pushed something one shouldn't and wants to remove that it should be still possible to do that: dvc gc -c --grace-period=0 This is "I know what I am doing even thoufgh I just messed up flag" :) |
Beta Was this translation helpful? Give feedback.
-
@Suor grace-period won't work, because a user might use some models that he has generated earlier and by removing those we would break the project for him. Next logical step is LRU, but, once again it is not possible to properly implement it in a distributed nature. I think gc should only be run with the upstream project, to include all important references, and for forks users should use separate storage until their PR is merged. |
Beta Was this translation helpful? Give feedback.
-
@efiop LRU won't work either, if that model is in local cache it won't be accessed anyway. But grace period will work for most cases, the example with using some old model, which is still in use, but never referenced by any tag/branch is artificial. |
Beta Was this translation helpful? Give feedback.
-
@Suor I'm not saying that is not referenced by any tag/commit/branch, I'm saying that you might've added a dataset 2 years ago and still using it on your |
Beta Was this translation helpful? Give feedback.
-
@efiop if it's referenced then it won't be collected, the grace period is needed to protect something only recently added and not referenced in a particular repo |
Beta Was this translation helpful? Give feedback.
-
@hhoeflin "deleting everything that I don't know I need" is the right thing for local repo backed by remote. So the issue is the same approach in both situations. |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
@shcheklein and @skshetry you are going to the rabbit hole again. GC/not-GC is not the most important part here and it is not very important how it corresponds to programming languages. What is the important part - how to clean space? If we just make GC safe - users won't be able to clean the space. So, we cannot take this step without introducing more aggressive strategies or at least without keeping the current one (probably as an optional). Also, the specific remove is needed Actions that I'd suggest (the commands' names TBD) sorted by priorities:
|
Beta Was this translation helpful? Give feedback.
-
Ok, had another round of discussion with @dmpetrov and @shcheklein . Here are the proposed first steps:
And the rest we can leave for the future discussion. What do you think guys, can we start with these steps? |
Beta Was this translation helpful? Give feedback.
-
@efiop thank you for the brief summary. Let's start with this. PS: you can keep only (1) and (2) in this issue and make (3) as a separate issue if you think the scope is too big. Ups to you... |
Beta Was this translation helpful? Give feedback.
-
This is what you proposed last week as well, right @efiop? I stalled (eh? 😄) on it a bit as I wanted to discuss more on the usecases (ref: #2325 (comment)) that I felt like were important from reading this issue. Though, we can start with this than getting stuck. Let's roll with these for now. 🙂 |
Beta Was this translation helpful? Give feedback.
-
How about |
Beta Was this translation helpful? Give feedback.
-
Coming late to this thread. I expected to have a I'd probably suggest requiring either I also find the wording of the
To me, that actually sounds like we will put the garbage into the remote repository, i.e. find garbage locally and upload it to the remote. I think if you just changed "in" to "from", that would help; otherwise I think the following would be more clear:
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
@casperdcl @kenahoo, it's not really possible to provide much information in
Thanks for pointing that out. It seems to be a help message. I'll try making a PR this week for that. |
Beta Was this translation helpful? Give feedback.
-
I just want to also vote for considering interactivity when running
|
Beta Was this translation helpful? Give feedback.
-
@skshetry I don't think |
Beta Was this translation helpful? Give feedback.
-
@kenahoo If you don't mind |
Beta Was this translation helpful? Give feedback.
-
Interactivity is a very good idea but not as a default behavior. It is a bit related to prompt\not-prompt problem (#2498) because it complicates scripting. |
Beta Was this translation helpful? Give feedback.
-
From #2451
|
Beta Was this translation helpful? Give feedback.
-
I don't know if this is still an issue, but: It's possible to limit Other than this, there can be something like a The command names may be modified. |
Beta Was this translation helpful? Give feedback.
-
I think so.
Blowup whole directory can be done on the cloud server side. |
Beta Was this translation helpful? Give feedback.
-
As pointed out in discussion in #1691, we should reconsider
gc
implementation.Currently, if called without any options, dvc will collect current branch dependencies and outputs checksums, and remove everything besides it. We can easily clear history of changes with this command.
gc
should be safer with default options. Straightforward implementation could get all outputs for all revisions in git repo and remove everything that is not on list.As pointed out by @Suor, this approach might be slow for repository with long history.
Beta Was this translation helpful? Give feedback.
All reactions