-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore gix APIs, experiment with gix-blame API #1453
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to see this PR and I am looking forward to witnessing it evolve over time :)!
git commit -q --allow-empty -m c5 | ||
git tag at-c5 | ||
git merge branch1 -m m1b1 | ||
echo "line 2" >> file.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do the same - start extremely simple, maybe do the first rough implementation so it passes this test, and then think of some tougher cases to throw at it, validating that they still come out right.
Maybe it's worth investing into a baseline test which parses the output of git blame
to get the expected output, which then has to be matched by the algorithm. These are typically the second stage as they are more obscure, but make it easier to get correct expectations for a lot of possibly complex inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input, sounds like a great plan!
@Byron I think I’ve now gotten to a point where I’ve learned enough about My goal now is to write a test that runs a complete blame for the sample repo that is part of this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, you already dug deep into the most complex APIs gitoxide
has to offer, I am pretty sure it can't get much worse than that. Having it, however, helps to know what data is needed which will help tremendously with defining the API later.
I think continuing with the test to try and complete the algorithm would be good, even though I'd probably start with refactoring the 'setup' portion of the test a little more to be able to focus on the blame algorithm.
But that's details, I think you are on the right track and this PR already does everything that's needed to do a blame.
Once the core algorithm is proven, as most naive and simple version of course, I'd probably create a baseline of some samples with git
itself, parse the result and use it as expectation for the algorithm created here.
I hope that helps.
gix-blame/tests/blame.rs
Outdated
|
||
use gix_diff::blob::intern::Token; | ||
use gix_hash::ObjectId; | ||
use gix_odb::pack::FindExt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd want to use the equally named gix_object::FindExt
, it's easier to use.
Thanks, that’s very helpful already! I think I wanted to mostly make sure I’m still on the right track and not missing anything. I’ll continue working on the test, then, and worry about the public API later. |
Yes, I think it's great to keep up the explorative approach using |
@Byron This is a quick status update! The algorithm is now able to generate a blame for the sample repo. I extended the test to compare this blame against |
@Byron Sooner than expected, there seems to have been quite a bit of progress! I extended the test repo so that it now also contains commits that don’t touch the file being blamed. I also added a file that contains multiline hunks. There’s now 2 test cases that cover one file each. Also, most of the logic is now inside the |
That's great news!
That's interesting - I just know that Git usually ignores merge-commits, but I don't know if it includes them in blames nonetheless. I never thought about how that would work, and would probably see what Git would do. It's something I do in general, that is, try to understand what Git can do and ask if this would (eventually) be possible with the Besides that, I definitely recommend to keep clippy happy and run it locally at least occasionally - ideally CI can stay green even during early development, and I found it worth it as it typically leads to cleaner code. Often I find myself doing a push&run, just to return one more time to quickly fix CI if it breaks so I can continue 'clean' in the morning :D. |
@Byron I think I’m now at a point where I’d appreciate a brief sync or a bit of feedback on how to best continue. Overall, I’m happy with the results so far. I’ve tested the implementation on linear commit histories with up to 90 commits, and for the files that I’ve blamed, the results match These are some observations and remaining issues:
This is a bit of a braindump and it probably is missing some important context. I hope that is’s useful anyway and am happy to elaborate on anything as needed. :-) |
Also, there’s errors in the CI pipeline, but they don’t seem to be related to any change in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I had a lot of fun working on this PR, so thanks for giving me the space! :-)
I am glad to hear that, and love the progress you made. This is the PoC already, proving that it generally works - awesome!
@Byron I think I’m now at a point where I’d appreciate a brief sync or a bit of feedback on how to best continue. Overall, I’m happy with the results so far. I’ve tested the implementation on linear commit histories with up to 90 commits, and for the files that I’ve blamed, the results match
git
’s results.
I could also imagine to work towards getting a bare-bones version of gix blame
quickly so that the API can already be evolved to make the non-interactive case usable, while also opening it up for more manual and performance tests.
- Merges still need to be taken into account, in the sense that while
git
indeed seems to ignore them, my implementation currently only follows one parent instead of all parents.
As gix-blame
is on a plumbing level, it wouldn't have to care about this at all. However, upper layers will make decisions about that, and these typically align with Git when feasible.
- There seem to be differences between
imara-diff
andgit diff
. I want to do more research on this topic as it directly affects the result of a blame.
Once you found one and can reproduce it with a simple diff, we would make it reproducible and work on fixing imara-diff
. Please note that higher-levels also configure the diff algorithm correctly, so git blame
might be different if it picks up a different value for diff.algorithm
. But it's very possible that the issue is legitimate and it will be an ultra-high-priority fix for me.
- Running the current implementation on even slightly larger repositories (I ran some tests inside
gitoxide
) is slow. I suspect this is due to the part where two trees are diffed to find out whether the file under blame was changed. I might also just have missed an API that does exactly what I want.
Tree-diffs need object-caches setup to be competitive, and also need the best possible zlib backend (zlib-ng
). Both are easy to miss.
- Some of the tests are quite verbose, there’s probably opportunities for some de-duplication using macros.
I did some of that, it looks less noisy now. To me it seems fine, particularly because there only appear to be two hard todo!()
s left which I'd think you can lure out with two more of these tests. So it's unlikely these more verbose tests keep growing forever.
- Most of the recent fixes didn’t add test cases to
make_blame_repo.sh
, but I instead added very specific test cases likeprocess_change_works_added_hunk
that only test a single iteration of the basic building block.
I love that.
- The implementation doesn’t follow renames.
It's rename tracking that is setup by the higher-level. But probably blame also needs to know about it, and eventually that should probably also be implemented.
However, it can wait until the API is a bit more fleshed out.
- I even found a difference between
git
andlibgit2
when it comes to coalescing blame entries before returning them to the caller. I’ve documented that fact in a comment.
That's a great catch and the level of diligence that gitoxide
requires. Of course, ideally it's not applied as post-process, but we might get there in stage three: make it work fast, now it's stage one: make it work.
Next Steps
Whenever you feel ready, I recommend to address the TODO's I left in Cargo.toml
and adjust the API to not do any of the initialization work itself.
From there, my next step would be to make it usable in gix blame
(even without wiring it up in gix
(the crate) for now), so be able to validate the API for non-interactive cases. What's most important, and something that's not tested at all right now, is that having a gix blame
will make clear how it shall be possible for callers to access actual lines of code.
To do that correctly, they'd need access to the original interner and the tokens behind the ranges that are used in the blame. It's really important to keep that in sync, and avoid making the impression that the caller should re-split blobs into lines.
I hope that helps to continue, please feel free to ask follow-up questions.
PS: I am very happy with the level of independence of your work - this review and refactor took me a little more than 2 hours, which is on the low-end for the stage the feature is in (i.e. it's seemingly nearly working correctly).
Thanks for the review and the feedback, that’s very helpful! I’ll have a look and continue working or ask follow-up questions. :-) |
It's going to be good, thanks again for all your work on this! My greatest hope is that you keep having fun :).
Maybe re-using the tokens isn't feasible after all as each diff creates a new set of them. The interner can keep all of the tokens, probably a good thing also for performance, but I wonder if it's feasible to keep all tokens around. In the end, each one is only a Lastly, my feeling here is that it would be best if tokens would be made available on a per-blob basis so that a single interner can keep all of these tokens for later access. But the above are just thoughts, |
@Byron I think I was able to identify a case where Before:
After:
The difference is with respect to the newlines surrounding the second |
That's great! Could this be reproduced with a baseline test, maybe also producing versions with Meyers and Histogram respectively? Before reeling in Pascal who can most certainly enlighten us, I'd love to have a failing test along with all the facts that we can collect. Thanks again for your help with this. |
I just added two failing test cases. Since |
I’ve also moved most of the setup out of |
Great, thanks, I will take a closer look later.
Yes, that's exactly it. I thought traversal can be reduced to an iterator yielding object-ids, but it's nothing I validated and I am probably missing an important detail. |
The issue might be that we need to take an The dependency on |
I took a closer look and have factored out the remaining bits that can be removed in a straightforward fashion. Maybe that's mostly it already. The test-case is wonderful by the way as it should be straightforward to narrow down the cause of the issue, or at least get to a point where Pascal can be invited. |
I think cry for help it is, @pascalkuthe 😅. After applying this patch: diff --git a/gix-blame/src/lib.rs b/gix-blame/src/lib.rs
index a737965a6..c4ec75cfa 100644
--- a/gix-blame/src/lib.rs
+++ b/gix-blame/src/lib.rs
@@ -815,6 +815,13 @@ pub fn blame_file<E>(
let outcome = resource_cache.prepare_diff().unwrap();
let input = outcome.interned_input();
+ for token in &input.before {
+ eprintln!("{:?}", input.interner[*token].as_bstr());
+ }
+ dbg!("after");
+ for token in &input.after {
+ eprintln!("{:?}", input.interner[*token].as_bstr());
+ }
let number_of_lines_in_destination = input.after.len();
let change_recorder = ChangeRecorder::new(number_of_lines_in_destination.try_into().unwrap());
let changes = gix_diff::blob::diff(gix_diff::blob::Algorithm::Histogram, &input, change_recorder);
diff --git a/gix-blame/tests/blame.rs b/gix-blame/tests/blame.rs
index 437e3de59..039005e9f 100644
--- a/gix-blame/tests/blame.rs
+++ b/gix-blame/tests/blame.rs
@@ -219,7 +219,7 @@ mktest!(same_line_changed_twice, "same-line-changed-twice", 2);
mktest!(coalesce_adjacent_hunks, "coalesce-adjacent-hunks", 1);
#[test]
-#[ignore = "TBD: figure out what the problem is"]
+// #[ignore = "TBD: figure out what the problem is"]
// As of 2024-09-24, these tests are expected to fail.
//
// Context: https://github.com/Byron/gitoxide/pull/1453#issuecomment-2371013904 it shows the following:
From this I see that the tokens are exactly what's expected as the basis for a The baseline for the expected result is the following. It's well balanced and also quite intuitive. However, our result seems off: Diff < left / right > :
[
BlameEntry {
range_in_blamed_file: 0..2,
range_in_original_file: 0..2,
commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
},
BlameEntry {
range_in_blamed_file: 2..4,
range_in_original_file: 2..4,
commit_id: Sha1(6d23453055b4a855092a0be7b3e4741414c1e547),
},
BlameEntry {
< range_in_blamed_file: 4..5,
< range_in_original_file: 2..3,
> range_in_blamed_file: 4..6,
> range_in_original_file: 2..4,
commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
},
BlameEntry {
< range_in_blamed_file: 5..7,
< range_in_original_file: 5..7,
> range_in_blamed_file: 6..8,
> range_in_original_file: 6..8,
commit_id: Sha1(6d23453055b4a855092a0be7b3e4741414c1e547),
},
BlameEntry {
< range_in_blamed_file: 7..9,
< range_in_original_file: 3..5,
> range_in_blamed_file: 8..9,
> range_in_original_file: 4..5,
commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
},
] Just to be sure that it's not the hunk-recorder that's the actual foundation for the blame, I also printed the hunks that it finds.
They seem to fit. I'd be happy to investigate this further, but wanted to ask you, @pascalkuthe , if you have any ideas to guide this. Maybe it's super-obvious to you what's going on. Something I can imagine is happening here is that blame does something special beyond the normal diffing, I myself didn't have any look at it yet. Thanks for sharing your thoughts. |
at first glance this looks like the slider problem (https://github.com/mhagger/diff-slider-tools/blob/master/README.md). Basically the algorithm can say that your diff is:
or it can say that the diff is:
Both are correct (minimal) difss so there is nothing wrong with the diff algorithm/results it's just that there are multiple different possible choices a diff can make. Git has some postprocessing heuristics to make a choice but from an objective point of view one choice isn't wrong just that some choices feel more intuitive to humans so it's not a bug in the diff algorithm just a missing feature. Git resolves those ambiguities by a heuristic. Basic summary:
I have an implementation of the first two coupled with some (necessary) API change that was aimed to be the first breaking imara-diff change. I started porting helix to the new API but got a bit stuck timewise and didn't get around to finishing that. The third heuristic is very involved implementation wise and also difficult to add from an API perspective since it only makes sense for (line-segmented) text but imara-diff operates on abstract tokens and allows diffing arbitrary data (segmented in arbitrary ways). I think we could aim to get the basic heuristic in, the more advanced indentation based code I am not sure I have time to work on at the moment. For this PR I would suggest to just update the fixtures to match whatever imara-diff produces and then update them as imara-diff improves in that regard. |
Quick side note to my last commit: I’ve had to add another parameter to |
Thanks so much for sharing! I also do remember now that you already shared this with me, probably in person, so it's clear that these heuristics are a tuneable and there isn't really a right answer. After reading at the link location really makes me think that
Thus, this is likely going to be a place where Until then, we go with your suggestion and just keep these examples in our baselines and mark them accordingly. Maybe one day they fit, or one day we mark ours as preferred. PS: I really can't wait for these heuristics to be implemented, can't wait for |
@Byron This is another quick status update: I’ve now refactored and simplified the implementation a bit, mostly to make the concepts used more easy to discern. I’m also quite far already in my approach to being able to blame commits with more than one parent. So, from my side, the point gets closer where I would call the first implementation done. For context, I just wanted to ask for feedback, in particular on how and when you think this should be merged. I’m mainly asking because there’s now more than one area I’d like to focus on, and I can imagine it might be easier to spread the work into PRs that are more specific to certain aspects, instead of keeping everything in this one. I can still follow it, but it’s already very layered in the sense that there’s quite a few discussions and threads overlapping. :-) What are your thoughts? I’m also fine with continuing in this PR, so whatever works best for you! |
Thanks for the update :). Generally I am happy to follow your preference when it comes to merging, so what follows is just my preference, which isn't the deciding factor here. What I'd do before merging is to 'quickly' wire it up into And since you probably have some more areas of work already in your mind, I think it would be good to collect them in a task list or tracking issue, so we both can keep an overview. It's probably best if you'd create a tracking issue for that so you can edit it primarily. In summary, a tracking issue we should have, and everything else is up to you then - you can ask me to review and merge, and that's what I will do, with or without I hope that helps. |
Thanks, that helps a lot! I think I’ll give wiring it up into |
62d3b11
to
b2bd2cb
Compare
Quick question: did I put the code for wiring up |
Sounds good! I recommend starting with 'oneline' mode without anything else. Regarding traversal performance, there isn't anything one can do except for setting an object cache. There are a couple of tricks as well, all of the ones I could find are used with |
P.S. if it is mostly I/O, then maybe io-uring will help Maybe I could try setting up tokio-uring? |
Unfortunately it is not, it's mostly zlib-inflate. If it was IO-bound, it would be so much faster, and I still dream of a different pack format that would allow for a different compression. |
This is the script I use to compare the output of |
.into(); | ||
let file_path: &BStr = gix::path::os_str_into_bstr(file)?; | ||
|
||
let blame_entries = gix::blame::blame_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Byron This makes compiliation on CI fail as gix::blame
is behind a feature flag. I feel like I’ve hit this issue already in the past, but I forgot how to resolve it. Would you be able to help me? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting - looks like the blame
feature disappeared from the list somehow. In any case, it's back now and CI should get past this point (even though clippy has some legit complaints I locally at least).
let absolute_path = worktree_path.join(file_path.to_str().unwrap()); | ||
|
||
// TODO Verify that `imara-diff` tokenizes lines the same way `lines` does. | ||
let number_of_lines = std::fs::read_to_string(absolute_path).unwrap().lines().count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Byron What is the easiest way to get a file’s contents, given the commit id and the file path? In order to be able to more granularly debug gix blame
, I’d like to add an argument that specifies the commit to start the blame at.
git blame
has this as <rev>
in its documentation, enabling use cases like this one: git blame <commit_id> <file_path>
(possibly separated by --
in case of ambiguity as far as I can tell).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a way to start the iteration at a certain commit. That itself should already be possible based on what's there. Getting a blob at a path can be done with Tree::lookup_path()
if I recall correctly.
But probably you knew that already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found Tree::lookup_entry (which used to be called lookup_path
, according to CHANGELOG.md
), but since this is only available in gix
, not in any of the plumbing crates, I figured it was not supposed to be used here. Which is why I’m looking for a lower-level alternative if there is one. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, in my mind I was in gix blame
even though this is gix-blame
the crate 🤦♂️. I am confused sometimes.
gix-blame/src/lib.rs
Outdated
let commit_id = odb.find_commit(&suspect, &mut buffer).unwrap().tree(); | ||
let tree = odb.find_tree(&commit_id, &mut buffer).unwrap(); | ||
|
||
let Some(entry) = tree.bisect_entry(file_path, false) else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Byron Given your comment, it seems like bisect_entry
might be the wrong choice here as well. gix_object
doesn’t seem to provide a method comparable to lookup_entry
, though. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the lower-level cousin doesn't support nested lookup, but there is no reason it shouldn't be able to do that.
It's probably possible to port lookup_path()
from gix
down to gix-object
, or gix-blame
implements its own version of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance, it seems like porting to gix-object
provides the best trade-off (code living in an obvious location, can more easily be re-used than if it were to live in gix-blame
) which is why I’ll try that first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great to hear, thanks a lot!
Let's have it in a separate PR so I can merge it much more quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New PR: #1686.
Note that it's still very early, and this is more of a proof-of-concept with a couple of bits still missing without which its usefulness and real-world applicability is limited.
For now, it doesn't come with a simplified `gix` API though.
That way it's possible to see the `blame` result of any file in the repository. Co-authored-by: Sebastian Thiel <sebastian.thiel@icloud.com>
- Merge commits where a file was added in one chain of ancestors - Merge commits where a conflict was resolved
`bisect_entry` is the wrong function to search for entries here as it does not take nesting in paths into account.
This is a draft PR that is supposed to help me get familiar with
gix
’s APIs as well as to experiment with possible APIs forgix-blame
. Any feedback that helps me not go down the wrong path, is much appreciated! :-)Review Tasks
gix blame
for funnot yet implemented: passing blame to more than one parent is not implemented yet
onREADME.md
:/gix merge-file
as well)Performance Opportunities