Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore gix APIs, experiment with gix-blame API #1453

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

cruessler
Copy link
Contributor

@cruessler cruessler commented Jul 13, 2024

This is a draft PR that is supposed to help me get familiar with gix’s APIs as well as to experiment with possible APIs for gix-blame. Any feedback that helps me not go down the wrong path, is much appreciated! :-)

Review Tasks

  • try gix blame for fun
    • not yet implemented: passing blame to more than one parent is not implemented yet on README.md :/
  • rewrite commits to have one per crate that is touched (simplify handling)
  • CLI improvements
    • make it easier to use a worktree-path through a pathspec (and use that in gix merge-file as well)
    • avoid re-splitting of content, otherwise it's too easy to get the visualization wrong, or using the wrong content
  • thorough review

Performance Opportunities

  • add pathspec (or even: single-path) support to tree-diffing to ideally allow to skip a lot of the work that is otherwise done, or stop the diff earlier.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to see this PR and I am looking forward to witnessing it evolve over time :)!

gix-blame/tests/blame.rs Outdated Show resolved Hide resolved
git commit -q --allow-empty -m c5
git tag at-c5
git merge branch1 -m m1b1
echo "line 2" >> file.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do the same - start extremely simple, maybe do the first rough implementation so it passes this test, and then think of some tougher cases to throw at it, validating that they still come out right.

Maybe it's worth investing into a baseline test which parses the output of git blame to get the expected output, which then has to be matched by the algorithm. These are typically the second stage as they are more obscure, but make it easier to get correct expectations for a lot of possibly complex inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input, sounds like a great plan!

@cruessler
Copy link
Contributor Author

@Byron I think I’ve now gotten to a point where I’ve learned enough about gixoxide’s APIs in order to hopefully be able write the first meaningful test. Would you mind looking at the PR’s current state? I’m not sure whether I it would be better to try and find a good API for struct Blame or to continue adding to the already quite large test it_works.

My goal now is to write a test that runs a complete blame for the sample repo that is part of this PR.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, you already dug deep into the most complex APIs gitoxide has to offer, I am pretty sure it can't get much worse than that. Having it, however, helps to know what data is needed which will help tremendously with defining the API later.

I think continuing with the test to try and complete the algorithm would be good, even though I'd probably start with refactoring the 'setup' portion of the test a little more to be able to focus on the blame algorithm.

But that's details, I think you are on the right track and this PR already does everything that's needed to do a blame.

Once the core algorithm is proven, as most naive and simple version of course, I'd probably create a baseline of some samples with git itself, parse the result and use it as expectation for the algorithm created here.

I hope that helps.


use gix_diff::blob::intern::Token;
use gix_hash::ObjectId;
use gix_odb::pack::FindExt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd want to use the equally named gix_object::FindExt, it's easier to use.

@cruessler
Copy link
Contributor Author

Thanks, that’s very helpful already! I think I wanted to mostly make sure I’m still on the right track and not missing anything. I’ll continue working on the test, then, and worry about the public API later.

@Byron
Copy link
Member

Byron commented Aug 11, 2024

Yes, I think it's great to keep up the explorative approach using cargo test and then put that behind an API for abstraction, along with the first simple 'actual' test. And from there, one can iterate to validate more complex cases, while slowly getting to the kind of observable and interruptible API that will eventually be needed for gitui.

@cruessler
Copy link
Contributor Author

@Byron This is a quick status update! The algorithm is now able to generate a blame for the sample repo. I extended the test to compare this blame against git blame’s output. Next, I will review the algorithm in more detail (the loop count, for example, is still hard-coded), then I would start adding more cases, trying to abstract/extract pieces of code on the way.

@cruessler
Copy link
Contributor Author

@Byron Sooner than expected, there seems to have been quite a bit of progress! I extended the test repo so that it now also contains commits that don’t touch the file being blamed. I also added a file that contains multiline hunks. There’s now 2 test cases that cover one file each. Also, most of the logic is now inside the loop, so I was able to delete a couple of lines. I was thinking of maybe trying to tackle merge commits next. In its current state, the algorithm only diffs against a single parent. What do you think? (Also, I’ll address your comment above.)

@Byron
Copy link
Member

Byron commented Aug 23, 2024

That's great news!

In its current state, the algorithm only diffs against a single parent.

That's interesting - I just know that Git usually ignores merge-commits, but I don't know if it includes them in blames nonetheless. I never thought about how that would work, and would probably see what Git would do. It's something I do in general, that is, try to understand what Git can do and ask if this would (eventually) be possible with the gitoxide implementation. When it comes to diffing, I definitely have a lot to learn.

Besides that, I definitely recommend to keep clippy happy and run it locally at least occasionally - ideally CI can stay green even during early development, and I found it worth it as it typically leads to cleaner code. Often I find myself doing a push&run, just to return one more time to quickly fix CI if it breaks so I can continue 'clean' in the morning :D.

@cruessler
Copy link
Contributor Author

@Byron I think I’m now at a point where I’d appreciate a brief sync or a bit of feedback on how to best continue. Overall, I’m happy with the results so far. I’ve tested the implementation on linear commit histories with up to 90 commits, and for the files that I’ve blamed, the results match git’s results. At this point, my next goals are centered around gaining more confidence in the implementation and to adding parts that are still missing compared to git’s implementation. Also, I had a lot of fun working on this PR, so thanks for giving me the space! :-)

These are some observations and remaining issues:

  • Merges still need to be taken into account, in the sense that while git indeed seems to ignore them, my implementation currently only follows one parent instead of all parents.
  • There seem to be differences between imara-diff and git diff. I want to do more research on this topic as it directly affects the result of a blame.
  • Running the current implementation on even slightly larger repositories (I ran some tests inside gitoxide) is slow. I suspect this is due to the part where two trees are diffed to find out whether the file under blame was changed. I might also just have missed an API that does exactly what I want.
  • Some of the tests are quite verbose, there’s probably opportunities for some de-duplication using macros.
  • Most of the recent fixes didn’t add test cases to make_blame_repo.sh, but I instead added very specific test cases like process_change_works_added_hunk that only test a single iteration of the basic building block.
  • The implementation doesn’t follow renames.
  • I even found a difference between git and libgit2 when it comes to coalescing blame entries before returning them to the caller. I’ve documented that fact in a comment.

This is a bit of a braindump and it probably is missing some important context. I hope that is’s useful anyway and am happy to elaborate on anything as needed. :-)

@cruessler
Copy link
Contributor Author

Also, there’s errors in the CI pipeline, but they don’t seem to be related to any change in this PR.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I had a lot of fun working on this PR, so thanks for giving me the space! :-)

I am glad to hear that, and love the progress you made. This is the PoC already, proving that it generally works - awesome!

@Byron I think I’m now at a point where I’d appreciate a brief sync or a bit of feedback on how to best continue. Overall, I’m happy with the results so far. I’ve tested the implementation on linear commit histories with up to 90 commits, and for the files that I’ve blamed, the results match git’s results.

I could also imagine to work towards getting a bare-bones version of gix blame quickly so that the API can already be evolved to make the non-interactive case usable, while also opening it up for more manual and performance tests.

  • Merges still need to be taken into account, in the sense that while git indeed seems to ignore them, my implementation currently only follows one parent instead of all parents.

As gix-blame is on a plumbing level, it wouldn't have to care about this at all. However, upper layers will make decisions about that, and these typically align with Git when feasible.

  • There seem to be differences between imara-diff and git diff. I want to do more research on this topic as it directly affects the result of a blame.

Once you found one and can reproduce it with a simple diff, we would make it reproducible and work on fixing imara-diff. Please note that higher-levels also configure the diff algorithm correctly, so git blame might be different if it picks up a different value for diff.algorithm. But it's very possible that the issue is legitimate and it will be an ultra-high-priority fix for me.

  • Running the current implementation on even slightly larger repositories (I ran some tests inside gitoxide) is slow. I suspect this is due to the part where two trees are diffed to find out whether the file under blame was changed. I might also just have missed an API that does exactly what I want.

Tree-diffs need object-caches setup to be competitive, and also need the best possible zlib backend (zlib-ng). Both are easy to miss.

  • Some of the tests are quite verbose, there’s probably opportunities for some de-duplication using macros.

I did some of that, it looks less noisy now. To me it seems fine, particularly because there only appear to be two hard todo!()s left which I'd think you can lure out with two more of these tests. So it's unlikely these more verbose tests keep growing forever.

  • Most of the recent fixes didn’t add test cases to make_blame_repo.sh, but I instead added very specific test cases like process_change_works_added_hunk that only test a single iteration of the basic building block.

I love that.

  • The implementation doesn’t follow renames.

It's rename tracking that is setup by the higher-level. But probably blame also needs to know about it, and eventually that should probably also be implemented.
However, it can wait until the API is a bit more fleshed out.

  • I even found a difference between git and libgit2 when it comes to coalescing blame entries before returning them to the caller. I’ve documented that fact in a comment.

That's a great catch and the level of diligence that gitoxide requires. Of course, ideally it's not applied as post-process, but we might get there in stage three: make it work fast, now it's stage one: make it work.

Next Steps

Whenever you feel ready, I recommend to address the TODO's I left in Cargo.toml and adjust the API to not do any of the initialization work itself.
From there, my next step would be to make it usable in gix blame (even without wiring it up in gix (the crate) for now), so be able to validate the API for non-interactive cases. What's most important, and something that's not tested at all right now, is that having a gix blame will make clear how it shall be possible for callers to access actual lines of code.

To do that correctly, they'd need access to the original interner and the tokens behind the ranges that are used in the blame. It's really important to keep that in sync, and avoid making the impression that the caller should re-split blobs into lines.

I hope that helps to continue, please feel free to ask follow-up questions.

PS: I am very happy with the level of independence of your work - this review and refactor took me a little more than 2 hours, which is on the low-end for the stage the feature is in (i.e. it's seemingly nearly working correctly).

@cruessler
Copy link
Contributor Author

Thanks for the review and the feedback, that’s very helpful! I’ll have a look and continue working or ask follow-up questions. :-)

@Byron
Copy link
Member

Byron commented Sep 23, 2024

It's going to be good, thanks again for all your work on this! My greatest hope is that you keep having fun :).

To do that correctly, they'd need access to the original interner and the tokens behind the ranges that are used in the blame. It's really important to keep that in sync, and avoid making the impression that the caller should re-split blobs into lines.

Maybe re-using the tokens isn't feasible after all as each diff creates a new set of them. The interner can keep all of the tokens, probably a good thing also for performance, but I wonder if it's feasible to keep all tokens around. In the end, each one is only a u32, so memory isn't an issue. It's more about ease of use… and… thinking about it more… reproducing the blame lines from just token ranges can be done if the caller would re-access all blobs that were used, and re-split the lines exactly like was done for the diff. So ideally, that's controllable from the outside, making the diff more abstract.

Lastly, my feeling here is that it would be best if tokens would be made available on a per-blob basis so that a single interner can keep all of these tokens for later access.

But the above are just thoughts, gix blame will probably make clear quickly what it 'wants' to be.

@cruessler
Copy link
Contributor Author

@Byron I think I was able to identify a case where imara-diff differs from git diff.

Before:

  line 1

  line 2

  line 3

After:

  line 1

  line in between

  line 2

  line in between

  line 3

imara-diff breaks this up as [Unchanged(0..2), Added(2..4, 0), Unchanged(4..5), Added(5..7, 0), Unchanged(7..9)] (ranges are exclusive at the end) while git diff comes to a different result: [Unchanged(0..2), Added(2..4, 0), Unchanged(4..6), Added(6..8, 0), Unchanged(8..9)]. This is with diff.algorithm = histogram and Algorithm::Histogram.

The difference is with respect to the newlines surrounding the second line in between. Interestingly enough, I was not able to observe this behaviour when there was only one block containing line in between added.

@Byron
Copy link
Member

Byron commented Sep 24, 2024

That's great! Could this be reproduced with a baseline test, maybe also producing versions with Meyers and Histogram respectively?
It's notable that the diff takes newlines into account, and it's something that Git might have special behaviour around, I didn't check.

Before reeling in Pascal who can most certainly enlighten us, I'd love to have a failing test along with all the facts that we can collect.

Thanks again for your help with this.

@cruessler
Copy link
Contributor Author

I just added two failing test cases. Since blame_file doesn’t take the algorithm as an argument yet, there’s possibly a mismatch for the Myers case (because the algorithm is hard-coded to Algorithm::Histogram currently), but I wanted to add the test case anyway.

@cruessler
Copy link
Contributor Author

I’ve also moved most of the setup out of blame_file. Is that what you had in mind? There’s still dependencies left on gix-ref and gix-traverse, but they seem to be a bit harder to remove.

@Byron
Copy link
Member

Byron commented Sep 24, 2024

Great, thanks, I will take a closer look later.

I’ve also moved most of the setup out of blame_file. Is that what you had in mind? There’s still dependencies left on gix-ref and gix-traverse, but they seem to be a bit harder to remove.

Yes, that's exactly it. I thought traversal can be reduced to an iterator yielding object-ids, but it's nothing I validated and I am probably missing an important detail.
Generally, I am optimistic that both gix-ref and gix-traverse can go away eventually.

@cruessler
Copy link
Contributor Author

Great, thanks, I will take a closer look later.

I’ve also moved most of the setup out of blame_file. Is that what you had in mind? There’s still dependencies left on gix-ref and gix-traverse, but they seem to be a bit harder to remove.

Yes, that's exactly it. I thought traversal can be reduced to an iterator yielding object-ids, but it's nothing I validated and I am probably missing an important detail. Generally, I am optimistic that both gix-ref and gix-traverse can go away eventually.

The issue might be that we need to take an Iterator that yields Items that themselves have commit_iter() which yields items that have parent_ids(), so there’s a bit of nesting. I imagine it’s certainly possible to abstract that into a trait or something, but it felt less like a low-hanging fruit than the previous changes. :-)

The dependency on gix_ref is only for BStr as far as I can see which can also be imported from gix_object I found out (gix-diff does that, but it also directly imports from bstr).

@Byron
Copy link
Member

Byron commented Sep 24, 2024

I took a closer look and have factored out the remaining bits that can be removed in a straightforward fashion. Maybe that's mostly it already.

The test-case is wonderful by the way as it should be straightforward to narrow down the cause of the issue, or at least get to a point where Pascal can be invited.

@Byron
Copy link
Member

Byron commented Sep 24, 2024

I think cry for help it is, @pascalkuthe 😅.

After applying this patch:

diff --git a/gix-blame/src/lib.rs b/gix-blame/src/lib.rs
index a737965a6..c4ec75cfa 100644
--- a/gix-blame/src/lib.rs
+++ b/gix-blame/src/lib.rs
@@ -815,6 +815,13 @@ pub fn blame_file<E>(
 
         let outcome = resource_cache.prepare_diff().unwrap();
         let input = outcome.interned_input();
+        for token in &input.before {
+            eprintln!("{:?}", input.interner[*token].as_bstr());
+        }
+        dbg!("after");
+        for token in &input.after {
+            eprintln!("{:?}", input.interner[*token].as_bstr());
+        }
         let number_of_lines_in_destination = input.after.len();
         let change_recorder = ChangeRecorder::new(number_of_lines_in_destination.try_into().unwrap());
         let changes = gix_diff::blob::diff(gix_diff::blob::Algorithm::Histogram, &input, change_recorder);
diff --git a/gix-blame/tests/blame.rs b/gix-blame/tests/blame.rs
index 437e3de59..039005e9f 100644
--- a/gix-blame/tests/blame.rs
+++ b/gix-blame/tests/blame.rs
@@ -219,7 +219,7 @@ mktest!(same_line_changed_twice, "same-line-changed-twice", 2);
 mktest!(coalesce_adjacent_hunks, "coalesce-adjacent-hunks", 1);
 
 #[test]
-#[ignore = "TBD: figure out what the problem is"]
+// #[ignore = "TBD: figure out what the problem is"]
 // As of 2024-09-24, these tests are expected to fail.
 //
 // Context: https://github.com/Byron/gitoxide/pull/1453#issuecomment-2371013904

it shows the following:

"  line 1\n"
"\n"
"  line 2\n"
"\n"
"  line 3\n"
[gix-blame/src/lib.rs:821:9] "after" = "after"
"  line 1\n"
"\n"
"  line in between\n"
"\n"
"  line 2\n"
"\n"
"  line in between\n"
"\n"
"  line 3\n"

From this I see that the tokens are exactly what's expected as the basis for a Histogram diff. It's notable that Myers and MyersMinimal both produce the same output.

The baseline for the expected result is the following.

Screenshot 2024-09-24 at 20 40 31

It's well balanced and also quite intuitive.

However, our result seems off:

Diff < left / right > :
 [
     BlameEntry {
         range_in_blamed_file: 0..2,
         range_in_original_file: 0..2,
         commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
     },
     BlameEntry {
         range_in_blamed_file: 2..4,
         range_in_original_file: 2..4,
         commit_id: Sha1(6d23453055b4a855092a0be7b3e4741414c1e547),
     },
     BlameEntry {
<        range_in_blamed_file: 4..5,
<        range_in_original_file: 2..3,
>        range_in_blamed_file: 4..6,
>        range_in_original_file: 2..4,
         commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
     },
     BlameEntry {
<        range_in_blamed_file: 5..7,
<        range_in_original_file: 5..7,
>        range_in_blamed_file: 6..8,
>        range_in_original_file: 6..8,
         commit_id: Sha1(6d23453055b4a855092a0be7b3e4741414c1e547),
     },
     BlameEntry {
<        range_in_blamed_file: 7..9,
<        range_in_original_file: 3..5,
>        range_in_blamed_file: 8..9,
>        range_in_original_file: 4..5,
         commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
     },
 ]

Just to be sure that it's not the hunk-recorder that's the actual foundation for the blame, I also printed the hunks that it finds.

[gix-blame/src/lib.rs:254:9] &before = 2..2
[gix-blame/src/lib.rs:254:9] &after = 2..4
[gix-blame/src/lib.rs:254:9] &before = 3..3
[gix-blame/src/lib.rs:254:9] &after = 5..7

They seem to fit.

I'd be happy to investigate this further, but wanted to ask you, @pascalkuthe , if you have any ideas to guide this. Maybe it's super-obvious to you what's going on.

Something I can imagine is happening here is that blame does something special beyond the normal diffing, I myself didn't have any look at it yet.

Thanks for sharing your thoughts.

@pascalkuthe
Copy link
Contributor

at first glance this looks like the slider problem (https://github.com/mhagger/diff-slider-tools/blob/master/README.md).

Basically the algorithm can say that your diff is:

"Line 1\n"
+"\n"
+"  line in between\n"
\n

or it can say that the diff is:

"Line 1\n"
"\n"
+"  line in between\n"
+ \n

Both are correct (minimal) difss so there is nothing wrong with the diff algorithm/results it's just that there are multiple different possible choices a diff can make. Git has some postprocessing heuristics to make a choice but from an objective point of view one choice isn't wrong just that some choices feel more intuitive to humans so it's not a bug in the diff algorithm just a missing feature.

Git resolves those ambiguities by a heuristic. Basic summary:

  • slide sliders as far down and up as possible to check if multiple hunks can be merged
  • finally slide the slider as far down as possible
  • a fairly complex indentation based heuristic developed and described in that repo I linked (semi-recent addition to git)

I have an implementation of the first two coupled with some (necessary) API change that was aimed to be the first breaking imara-diff change. I started porting helix to the new API but got a bit stuck timewise and didn't get around to finishing that.

The third heuristic is very involved implementation wise and also difficult to add from an API perspective since it only makes sense for (line-segmented) text but imara-diff operates on abstract tokens and allows diffing arbitrary data (segmented in arbitrary ways).

I think we could aim to get the basic heuristic in, the more advanced indentation based code I am not sure I have time to work on at the moment. For this PR I would suggest to just update the fixtures to match whatever imara-diff produces and then update them as imara-diff improves in that regard.

@cruessler
Copy link
Contributor Author

Quick side note to my last commit: I’ve had to add another parameter to blame_file, needed to associate the first UnblamedHunk with an ObjectId. This feels like duplication, though, as that information should, in theory, also be accessible through traverse. Since I did not quickly find a way to get start_id from traverse, I decided to add it as a parameter to blame_file.

@Byron
Copy link
Member

Byron commented Sep 25, 2024

at first glance this looks like the slider problem (https://github.com/mhagger/diff-slider-tools/blob/master/README.md).

Thanks so much for sharing!

I also do remember now that you already shared this with me, probably in person, so it's clear that these heuristics are a tuneable and there isn't really a right answer.

After reading at the link location really makes me think that imara-diff will one day outperform Git (if it offers new heurstics):

An implementation of an alternative slider positioning heuristic that significantly outperforms both git diff and git diff --compaction-heuristic as of Git 2.9.0.

Thus, this is likely going to be a place where gitoxide will differ from Git and hopefully at some point it will differ to the better.

Until then, we go with your suggestion and just keep these examples in our baselines and mark them accordingly. Maybe one day they fit, or one day we mark ours as preferred.

PS: I really can't wait for these heuristics to be implemented, can't wait for ident-new :D.

@cruessler
Copy link
Contributor Author

@Byron This is another quick status update: I’ve now refactored and simplified the implementation a bit, mostly to make the concepts used more easy to discern. I’m also quite far already in my approach to being able to blame commits with more than one parent. So, from my side, the point gets closer where I would call the first implementation done. For context, I just wanted to ask for feedback, in particular on how and when you think this should be merged. I’m mainly asking because there’s now more than one area I’d like to focus on, and I can imagine it might be easier to spread the work into PRs that are more specific to certain aspects, instead of keeping everything in this one. I can still follow it, but it’s already very layered in the sense that there’s quite a few discussions and threads overlapping. :-) What are your thoughts? I’m also fine with continuing in this PR, so whatever works best for you!

@Byron
Copy link
Member

Byron commented Sep 28, 2024

Thanks for the update :).

Generally I am happy to follow your preference when it comes to merging, so what follows is just my preference, which isn't the deciding factor here.

What I'd do before merging is to 'quickly' wire it up into gix blame to be forced to reproduce the lines with the blame information. The last time I checked, this would have been difficult or at least easy to get wrong.

And since you probably have some more areas of work already in your mind, I think it would be good to collect them in a task list or tracking issue, so we both can keep an overview. It's probably best if you'd create a tracking issue for that so you can edit it primarily.

In summary, a tracking issue we should have, and everything else is up to you then - you can ask me to review and merge, and that's what I will do, with or without gix blame.

I hope that helps.

@cruessler
Copy link
Contributor Author

Thanks, that helps a lot! I think I’ll give wiring it up into gix blame a try, and I’ll create a tracking issue.

@cruessler
Copy link
Contributor Author

Quick question: did I put the code for wiring up gix blame in the right locations? (Feel free to change as necessary!)

@Byron
Copy link
Member

Byron commented Oct 21, 2024

I was indeed thinking of something simple to start with, mostly to get familiar with the traversal order and later to investigate what ways there are to speed up traversal.

Sounds good! I recommend starting with 'oneline' mode without anything else.

Regarding traversal performance, there isn't anything one can do except for setting an object cache. There are a couple of tricks as well, all of the ones I could find are used with ein t hours if you want to have a look. For gix log I would be more interested in simple code though.

@NobodyXu
Copy link
Contributor

NobodyXu commented Oct 22, 2024

Regarding traversal performance, there isn't anything one can do except for setting an object cache.

P.S. if it is mostly I/O, then maybe io-uring will help

Maybe I could try setting up tokio-uring?

@Byron
Copy link
Member

Byron commented Oct 23, 2024

Unfortunately it is not, it's mostly zlib-inflate. If it was IO-bound, it would be so much faster, and I still dream of a different pack format that would allow for a different compression.

@cruessler
Copy link
Contributor Author

This is the script I use to compare the output of gix blame and git blame for batches of files in a given repository: https://gist.github.com/cruessler/54fdf1379e03e86be2b7b748750c181e.

.into();
let file_path: &BStr = gix::path::os_str_into_bstr(file)?;

let blame_entries = gix::blame::blame_file(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Byron This makes compiliation on CI fail as gix::blame is behind a feature flag. I feel like I’ve hit this issue already in the past, but I forgot how to resolve it. Would you be able to help me? :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting - looks like the blame feature disappeared from the list somehow. In any case, it's back now and CI should get past this point (even though clippy has some legit complaints I locally at least).

let absolute_path = worktree_path.join(file_path.to_str().unwrap());

// TODO Verify that `imara-diff` tokenizes lines the same way `lines` does.
let number_of_lines = std::fs::read_to_string(absolute_path).unwrap().lines().count();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Byron What is the easiest way to get a file’s contents, given the commit id and the file path? In order to be able to more granularly debug gix blame, I’d like to add an argument that specifies the commit to start the blame at.

git blame has this as <rev> in its documentation, enabling use cases like this one: git blame <commit_id> <file_path> (possibly separated by -- in case of ambiguity as far as I can tell).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a way to start the iteration at a certain commit. That itself should already be possible based on what's there. Getting a blob at a path can be done with Tree::lookup_path() if I recall correctly.

But probably you knew that already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found Tree::lookup_entry (which used to be called lookup_path, according to CHANGELOG.md), but since this is only available in gix, not in any of the plumbing crates, I figured it was not supposed to be used here. Which is why I’m looking for a lower-level alternative if there is one. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, in my mind I was in gix blame even though this is gix-blame the crate 🤦‍♂️. I am confused sometimes.

let commit_id = odb.find_commit(&suspect, &mut buffer).unwrap().tree();
let tree = odb.find_tree(&commit_id, &mut buffer).unwrap();

let Some(entry) = tree.bisect_entry(file_path, false) else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Byron Given your comment, it seems like bisect_entry might be the wrong choice here as well. gix_object doesn’t seem to provide a method comparable to lookup_entry, though. Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the lower-level cousin doesn't support nested lookup, but there is no reason it shouldn't be able to do that.
It's probably possible to port lookup_path() from gix down to gix-object, or gix-blame implements its own version of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, it seems like porting to gix-object provides the best trade-off (code living in an obvious location, can more easily be re-used than if it were to live in gix-blame) which is why I’ll try that first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great to hear, thanks a lot!

Let's have it in a separate PR so I can merge it much more quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New PR: #1686.

cruessler and others added 17 commits November 25, 2024 17:08
Note that it's still very early, and this is more of a proof-of-concept
with a couple of bits still missing without which its usefulness
and real-world applicability is limited.
For now, it doesn't come with a simplified `gix` API though.
That way it's possible to see the `blame` result of any file in the
repository.

Co-authored-by: Sebastian Thiel <sebastian.thiel@icloud.com>
- Merge commits where a file was added in one chain of ancestors
- Merge commits where a conflict was resolved
`bisect_entry` is the wrong function to search for entries here as it
does not take nesting in paths into account.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

4 participants