Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore AST based diff tools #3

Open
0xdevalias opened this issue Jan 30, 2024 · 7 comments
Open

Explore AST based diff tools #3

0xdevalias opened this issue Jan 30, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@0xdevalias
Copy link
Owner

0xdevalias commented Jan 30, 2024

There can be a lot of 'noise' when diffing minimised bundled code, as the bundler will often change the minified variable names it uses at times between builds (even if the rest of the code hasn't changed)

We can attempt to reduce this by using non-default git diff modes such as patience / histogram / minimal:

⇒ git diff --diff-algorithm=default -- unpacked/_next/static/chunks/pages/_app.js | wc -l
  116000

⇒ git diff --diff-algorithm=patience -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35826

⇒ git diff --diff-algorithm=histogram -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35835

⇒ git diff --diff-algorithm=minimal -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35844

Musings

⭐ Suggestion

It would be cool if ast-grep was able to show a diff between 2 files, but do it using the AST rather than just a raw text compare. Ideally we would be able to provide options to this, such as ignoring chunks where the only change is to a variable/function name (eg. for diffing minimised JavaScript webpack builds)

Ideally the output would be text still (not the AST tree), but the actually diffing could be done at the AST level.

💻 Use Cases

This would be really useful for minimising the noise when diffing minimised source builds looking for the 'real changes' between the builds (not just minimised variable names churning, etc)

Looking through current diff output formats shows all of the variable name changes as well, which equates to a lot of noise while looking for the relevant changes.

Some alternative potential workarounds I've considered are either pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names, and then suppressing that chunk. Currently I'm just relying on git diff --diff-algorithm=minimal -- thefile.js

Originally posted by @0xdevalias in ast-grep/ast-grep#901

See Also

Additional links to review

@HerringtonDarkholme
Copy link

I would recommend difftastic for this! Actually rspack has already used it for checking diff between its output with that of webpack.

@0xdevalias
Copy link
Owner Author

0xdevalias commented Jan 30, 2024

I would recommend difftastic for this

rspack has already used it for checking diff between its output with that of webpack

@HerringtonDarkholme Interesting.. do you know if they did so while suppressing the 'noise' of changed variables? Or was it more just generally to ensure they were doing compatible things. I had a quick google, but didn't seem to turn up anything specific beyond the repo/etc:

Edit: Opened the following issue on difftastic:

Edit 2: And this one on diffsitter:


I was also just re-reading through the diffsitter README and noticed this section that I somehow missed in the past; which sounds like it might be exactly like what I want:

  • https://github.com/afnanenayet/diffsitter
    • A tree-sitter based AST difftool to get meaningful semantic diffs

    • You can also filter which tree sitter nodes are considered in the diff through the config file.

    • https://github.com/afnanenayet/diffsitter#node-filtering
      • You can filter the nodes that are considered in the diff by setting include_nodes or exclude_nodes in the config file. exclude_nodes always takes precedence over include_nodes, and the type of a node is the kind of a tree-sitter node.

        This feature currently only applies to leaf nodes, but we could exclude nodes recursively if there's demand for it.

@0xdevalias
Copy link
Owner Author

0xdevalias commented Jan 30, 2024

Though playing with diffsitter just now, it's output format seems to leave a lot to be desired compared to typical git diff unified output; and it doesn't show any context/etc currently:

eg. On a very minimal example, diffsitter:

image

⇒ git difftool --tool diffsitter HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-AOnHKy/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-ahyGIo/_buildManifest.js
===================================================================================

80:
---
-     "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],

80:
---
+     "/search": ["static/chunks/pages/search-d835393483b5432a.js"],

138:
----
+   "static/chunks/5054-e2060ddbea2abdb7.js"

138:
----
-   "static/chunks/5054-8ad3d13d663a6185.js"

Vs git diff (with delta):

image

⇒ git diff HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
diff --git a/unpacked/_next/static/[buildHash]/_buildManifest.js b/unpacked/_next/static/[buildHash]/_buildManifest.js
index 851a8f0..5004cc7 100644
--- a/unpacked/_next/static/[buildHash]/_buildManifest.js
+++ b/unpacked/_next/static/[buildHash]/_buildManifest.js
@@ -78,7 +78,7 @@
     "/payments/success-trial": [
       "static/chunks/pages/payments/success-trial-84597e34390c1506.js",
     ],
-    "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],
+    "/search": ["static/chunks/pages/search-d835393483b5432a.js"],
     "/share/e/[[...shareParams]]": [
       "static/chunks/pages/share/e/[[...shareParams]]-899e50f90dac9ff5.js",
     ],
@@ -136,6 +136,6 @@
   "static/chunks/5017-f7c5e142fc7f0516.js",
   "static/chunks/3975-37a9301353b29c5d.js",
   "static/chunks/3754-ae5dc2fb759ecfc1.js",
-  "static/chunks/5054-8ad3d13d663a6185.js"
+  "static/chunks/5054-e2060ddbea2abdb7.js"
 )),
   self.__BUILD_MANIFEST_CB && self.__BUILD_MANIFEST_CB();

This is the diffsitter config I was using:

  • ~/.config/diffsitter/config.json5
// Default: `diffsitter dump-default-config`
// See also: https://github.com/afnanenayet/diffsitter/blob/v0.8.1/assets/sample_config.json5
// Colours: `color256`, `black`, `red`, `green`, `yellow`, `blue`, `magenta`, `cyan`, `white`
{
  "formatting": {
    "default": "unified",
    "unified": {
      "addition": {
        "highlight": "green",
        "regular-foreground": "green",
        "emphasized-foreground": "white",
        "bold": true,
        "underline": false,
        "prefix": "+ "
      },
      "deletion": {
        "highlight": "red",
        "regular-foreground": "red",
        "emphasized-foreground": "white",
        "bold": true,
        "underline": false,
        "prefix": "- "
      }
    },
    "json": {
      "pretty_print": false
    },
    "custom": {}
  },
  "grammar": {
    "dylib-overrides": null,
    "file-associations": {
      "js": "typescript",
      "jsx": "tsx"
    },
  },
  "input-processing": {
    "split-graphemes": true,
    // You can exclude different tree sitter node types - this rule takes precedence over `include_kinds`.
    "exclude-kinds": null,
    // "exclude-kinds": ["string"],
    // You can specifically allow only certain tree sitter node types
    "include-kinds": null
    // "include-kinds": ["method_definition"],
  },
  // Specify a fallback command if diffsitter can't parse the given input
  // files. This is invoked by diffsitter as:
  //
  // ```sh
  // ${fallback_cmd} ${old} ${new}
  // ```
  "fallback-cmd": null,
  // "fallback-cmd": "diff",
}

And this is the .gitconfig I was using to run it as a git difftool:

# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter"]
  cmd = diffsitter "$LOCAL" "$REMOTE"

# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter-debug"]
  cmd = diffsitter --debug "$LOCAL" "$REMOTE"

Running it with --debug allows us to see the steps it takes during processing, and how long they take. On our small example file above, the output looks like this:

diffsitter debug output from minimal example
⇒ git difftool --tool diffsitter-debug --color=always HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
 2024-01-30T07:09:30.627Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js can be parsed
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.627Z DEBUG diffsitter           > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js can be parsed
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.627Z DEBUG diffsitter           > Extensions for both input files are supported
 2024-01-30T07:09:30.627Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js to string
 2024-01-30T07:09:30.627Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.627Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:09:30.628Z INFO  TimerFinished        > parse::parse_file(), Elapsed=968.896µs
 2024-01-30T07:09:30.628Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js to string
 2024-01-30T07:09:30.628Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:09:30.628Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.628Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.628Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:09:30.629Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:09:30.629Z INFO  TimerFinished        > parse::parse_file(), Elapsed=590.376µs
 2024-01-30T07:09:30.629Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=388.582µs
 2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::process(), Elapsed=1.083248ms
 2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=309.698µs
 2024-01-30T07:09:30.631Z INFO  TimerFinished        > ast::process(), Elapsed=997.049µs
 2024-01-30T07:09:30.631Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=127.357µs
 2024-01-30T07:09:30.631Z INFO  libdiffsitter::render::unified > Using stack style vertical for title
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js
===================================================================================

80:
---
-     "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],

80:
---
+     "/search": ["static/chunks/pages/search-d835393483b5432a.js"],

138:
----
+   "static/chunks/5054-e2060ddbea2abdb7.js"

138:
----
-   "static/chunks/5054-8ad3d13d663a6185.js"

Reducing that output to just show the relevant timing events:

// Start
2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:09:30.628Z INFO  TimerFinished        > parse::parse_file(), Elapsed=968.896µs
// ..snip..
2024-01-30T07:09:30.629Z INFO  TimerFinished        > parse::parse_file(), Elapsed=590.376µs
2024-01-30T07:09:30.629Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=388.582µs
2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::process(), Elapsed=1.083248ms
2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=309.698µs
2024-01-30T07:09:30.631Z INFO  TimerFinished        > ast::process(), Elapsed=997.049µs
2024-01-30T07:09:30.631Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=127.357µs
// ..snip..
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
// Finish

We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long)

diffsitter debug output from large example
⇒ du -sh unpacked/_next/static/chunks/pages/_app.js
8.4M	unpacked/_next/static/chunks/pages/_app.js

⇒ cat unpacked/_next/static/chunks/pages/_app.js | wc -l
  249128

⇒ git difftool --tool diffsitter-debug HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js
 2024-01-30T07:07:58.282Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js can be parsed
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:58.282Z DEBUG diffsitter           > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js can be parsed
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:58.282Z DEBUG diffsitter           > Extensions for both input files are supported
 2024-01-30T07:07:58.287Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js to string
 2024-01-30T07:07:58.287Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:07:58.291Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:58.291Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:58.291Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:58.291Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:07:59.790Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
 2024-01-30T07:07:59.795Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js to string
 2024-01-30T07:07:59.795Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:07:59.800Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:59.800Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:59.800Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:59.800Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:08:01.160Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
 2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
 2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
 2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
 2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
 2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
 2024-01-30T07:20:22.502Z INFO  libdiffsitter::render::unified > Using stack style horizontal for title
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 52698 - 52698)
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Title string has length of 7
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing line 52698
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > End line 52698

// ..snip 17,524 lines..

 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96300
 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > End line 96300
 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301

// I'm not sure if this actually finished successfully, or just silently crashed at this point... I think it might have crashed, as I would have expected to see a log line like:
//   DEBUG libdiffsitter::render::unified > End hunk
// As well as the actual diff output

Reducing that output to just show the relevant timing events:

// Start
2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish

We can see that ~12.29 minutes was spent in diff::compute_edit_script(), and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the final DEBUG libdiffsitter::render::unified > End hunk line that I would have expected to see.

Note: I was piping this command to subl (Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.

As a point of comparison, difftastic was seemingly able to diff this file in ~2.46sec total:

time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   2.63s user 0.46s system 47% cpu 6.494 total
subl  0.01s user 0.02s system 0% cpu 6.746 total

Though with a lot of this printed among the diff output:

_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)

It seems that when DFT_BYTE_LIMIT is exceeded, difftastic falls back to just doing a basic text diff, so those above timings aren't really a fair comparison. We can override this with --byte-limit and DFT_BYTE_LIMIT:

⇒ difftastic -h

Difftastic 0.52.0

..snip..

OPTIONS:

..snip..

        --byte-limit <LIMIT>           Use a text diff if either input file exceeds this size. [env:
                                       DFT_BYTE_LIMIT=] [default: 1000000]

..snip..

        --graph-limit <LIMIT>          Use a text diff if the structural graph exceed this number of
                                       nodes in memory. [env: DFT_GRAPH_LIMIT=] [default: 3000000]

Let's set this to something much higher than we need for now:

20*1024*1024 = 20971520

Updating our .gitconfig:

# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
  cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"

Running that command again:

time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   12.42s user 1.10s system 79% cpu 17.043 total
subl  0.01s user 0.02s system 0% cpu 17.248 total

It takes a bit longer, but then we hit a different error/limit:

_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)

From the help:

⇒ difft --help | subl

..snip..

        --parse-error-limit <LIMIT>
            Use a text diff if the number of parse errors exceeds this value.
            
            [env: DFT_PARSE_ERROR_LIMIT=]
            [default: 0]

There seem to be 674 matches of DFT_PARSE_ERROR_LIMIT in the output, which corresponds to the 674 chunks of output.

I didn't see any specific errors output related to the parsing though.. I wonder how we can see what the actual issue was?

It seems we can dump the AST either from tree-sitter directly, or the difftastic filtered version of it as a debug:

⇒ difft --help | subl

..snip..

DEBUG OPTIONS:
        --dump-syntax <PATH>
            Parse a single file with tree-sitter and display the difftastic syntax tree.

        --dump-ts <PATH>
            Parse a single file with tree-sitter and display the tree-sitter parse tree.

Full tree-sitter AST:

time difft --dump-ts unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic-tree-sitter.ast
difft --dump-ts unpacked/_next/static/chunks/pages/_app.js >   5.08s user 9.49s system 95% cpu 15.312 total

⇒ du -h _app.js-difftastic-tree-sitter.ast
224M	_app.js-difftastic-tree-sitter.ast

⇒ cat _app.js-difftastic-tree-sitter.ast| wc -l
 2533687

difftastic 'syntax' AST:

time difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic.ast
difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js >   434.80s user 43.33s system 93% cpu 8:29.31 total

⇒ du -h _app.js-difftastic.ast
2.2G	_app.js-difftastic.ast

⇒ cat _app.js-difftastic.ast | wc -l
 10571815

@0xdevalias
Copy link
Owner Author

0xdevalias commented Feb 4, 2024

After the above explorations, I ended up taking a different tact and started exploring the 'post processing git diff' approach.

Initial PoC tests seemed to show some merit, so I hacked them together into a script that can filter an existing git diff passed in from a file, or STDIN:

  • https://twitter.com/_devalias/status/1752260576066301956
    • Some alternative potential workarounds I've considered are: pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names; then suppressing that chunk.

  • https://twitter.com/_devalias/status/1752646128372535530
    • After exploring/playing around with a bunch of tools yesterday, and a hyperfocus of hacking out some PoC code this afternoon/evening, I think I have the elements of a workable solution! 🎉

      Just need to pull it all together and refine it once the 🧠 is refreshed/rested..

  • https://twitter.com/_devalias/status/1753322272293896385
    • The code is still pretty hacky.. but pulled the PoC's together into a usable script for filtering the noise in the diff output!

      Still not perfect.. but I am really liking these stats so far!

      For an 8.4mb _app.js file (250,022 lines):

      Original diff: 33,399
      Filtered: 7,516

  • https://twitter.com/_devalias/status/1753322275984867513
    • There are still a bunch of areas where my partial AST parsing is getting errors, which means more noise will slip through till I can fix those.. but this is already super useful!

      Hopefully will get a version of it cleaned up/committed/pushed sooner rather than later.

  • https://twitter.com/_devalias/status/1753711267137892684
    • A few more updates that I posted elsewhere as I was going; from 7100, to 6455, to 4268, and now down to 3913 lines.

      • Added a few more ‘pre AST parsing’ fixups that get the diff down even further: from the previous 7100 lines to 6455 lines

      • Trialled parsing the AST with a more lenient parser as a pre-step; filtering out unknown bits, then generating that back into code; that I then apply some of my previous ‘preparation hacks’ to, before parsing it with the more strict parser and doing the identifier normalisation.

        That’s currently reduced the 6455 lines down to 4268 lines; and there are still leftover parsing errors that could potentially get that number lower still that I’m chipping away at.

        Another alternative could be to completely replace the stricter parser with the more lenient one; but then I would have to rewrite a bunch of other code as well.

        I could also potentially try a different lenient parser; or pre-process the code being passed to this one; as I believe it’s failing to parse some statements even; which is potentially leading to more of that noise filtering through..

      • I added a pre-processing step to the lenient parser which improved things a bit; then I added an actual fix to the AST from the lenient parser to help fix more complex ternaries that were challenging to fix in my simpler string append pre/post fix methods.

        With those, we’re now down to 3913 lines.

        Being able to properly detect and fix object properties that are missing their wrapping object would probably fix a bunch more of the issues.

        Though maybe when I look at this next; I’ll just see whether using the lenient parser on its own gives a quicker/better result.

@0xdevalias 0xdevalias added the enhancement New feature or request label Feb 24, 2024
@0xdevalias
Copy link
Owner Author

0xdevalias commented Feb 24, 2024

With the current PoC implementations in the diff minimiser, we're grouping by the diff chunks, and then by added/removed lines within those chunks.

Sometimes a section of code will churn and move a large amount of lines from one chunk, to another chunk, which won't get noticed by the minimiser currently.

It would be good if we could process these in a similar way to git diff's color-moved, with our 'ignore identifier changes', so that we can suppress both the 'added' and 'removed' chunks if they're otherwise unchanged (or minimise the diff between them to only show what has changed, rather than showing them as large 'disconnected blocks')


Edit: See also:

--

VSCode recently added a new feature called move code detection. The algorithm is surprisingly simple: after normal diff, calculate similarity(deletion, insertion) for every deletion & insertion pair. The code can be found here https://github.com/microsoft/vscode/blob/166097a20cbd06d10d255ef561837c439f372de3/src/vs/editor/common/diff/defaultLinesDiffComputer/computeMovedLines.ts#L44-L85.

But intuitively demonstrating the moving relationship on TUI might be a challenge.

Originally posted by @QuarticCat in Wilfred/difftastic#508 (comment)

--

Have you looked into git diff's --color-moved at all?

Personally I tend to use the zebra mode of it. From my git aliases:

  # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-movedltmodegt
  # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-moved-wsltmodesgt
  # https://git-scm.com/docs/git-config#Documentation/git-config.txt-color
  # https://git-scm.com/docs/git-config#Documentation/git-config.txt-colordiffltslotgt
  diff-refactor = \
    -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal

Originally posted by @0xdevalias in Wilfred/difftastic#539 (comment)

@0xdevalias
Copy link
Owner Author

A lot of this will probably be uninteresting/useless noise, but I just made a repo and pushed all of my hacky WIP/PoC AST exploration scripts to it so they're captured in one place:

What might be of interest though, is that I pushed the current hacky state of my diff minimiser code there too. I haven't checked it for a good few weeks, so can't even remember if it's in a runnable state currently, but if nothing else, there might be some ideas hidden away in there that help explain what I was referring to above:

When I get a chance to get back to them, I plan to finish my research/refinements on the best method(s), and then clean it all up back to a single useful script that I will commit to this repo. But until then, those may be of some interest to someone.

Originally posted by @0xdevalias in #10 (comment)

0xdevalias referenced this issue in 0xdevalias/poc-ast-tools Mar 28, 2024
@0xdevalias
Copy link
Owner Author

0xdevalias commented Apr 24, 2024

Here's a fun little hack I just thought up (while the diff minimiser script still breaks --color-moved):

  • Diff 2 files, run them through the minimiser, save that diff to a file
  • Apply the output of the minimised diff to the original file, save that to a file
  • Re-run diff with --color-moved between the original file, and the original+minimisedDiffPatch file

Currently this doesn't work fully, as I think I'm not properly updating the line metadata in the diff hunks when I've removed sections from them in the diff minimiser.. but something like this (in theory):

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | \
sed 's|a/unpacked/_next/static/chunks/pages/_app.js|a/_app.original.js|g' | \
sed 's|b/unpacked/_next/static/chunks/pages/_app.js|b/_app.patched.js|g' \
> _app.minimised.diff

⇒ git show 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js > _app.original.js

⇒ git apply _app.minimised.diff

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js

Edit: I wrote a new scripts/fix-diff-headers.js (f7deec7) script that seems to get it closer to working..

⇒ ./scripts/fix-diff-headers.js _app.minimised.diff _app.fixed.diff
Reading diff from _app.minimised.diff and writing corrected diff to _app.fixed.diff
Diff successfully corrected and written to _app.fixed.diff

⇒ patch _app.original.js _app.fixed.diff -o _app.patched.js
patching file _app.original.js
6 out of 90 hunks failed--saving rejects to _app.patched.js.rej

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js

That way, instead of just seeing a giant chunk of 'removed' and a giant chunk of 'added' when this module moved; we can see the dimmed lines which were moved (but otherwise unchanged), and spend less time/effort having to consider them:

image

An even better version of this would be a diff --color-moved implementation that was able to ignore identifier names changing entirely (in which case I suspect it would mark this entire block as 'moved but not changed'); but as far as I am currently aware, the only way to do that would be to basically re-implement the whole diff --color-moved logic ourselves; which would basically end up with us writing our own AST differ I believe.

Though the way I was originally intending on fixing/improving the diff minimiser script was to try just maintaining the ANSI colouring of the input lines, and outputting the lines with the same colouring at the end.

Originally posted by @0xdevalias in #10 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants