Skip to content

Repack uncompressed & diff visualizer for ZIP based files stored in git repos

License

Notifications You must be signed in to change notification settings

hoijui/ReZipDoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReZipDoc

A repack uncompressed & diff visualizer for ZIP based files stored in git repos.

Most git repos hosting Open Source Hardware should use ReZipDoc.

What is this?

git does not like binary files. They make the repo grow fast in size in MB (see delta compression), and when you try to see what changed in a commit, you only get this:

Binary files A and B differ!

... not very useful!

ReZipDoc solves both of these issues, though only for ZIP based files, which includes for example FreeCAD and LibreOffice files.

NOTE It does not work for all binary files!

HINT If you are unsure whether a file format is ZIP based, just try to look at it with a software that can peak into ZIP files.
On Linux or OSX: unzip -l someFile.xyz

So if you are storing ZIP based files in your git repo, you probably want to use ReZipDoc.

Index

Project state

This repo contains a heavily revised, refined version of ReZip (and ZipDoc), plus unit tests and helper scripts, which were not available in the original.

License GitHub last commit Issues

master: Build Status Open Hub project report

SonarCloud Status SonarCloud Coverage SonarCloud Bugs SonarCloud Vulnerabilities

How to use

If your git repo makes heavy use of ZIP based files, then you probably want to use ReZipDoc in one of these three ways:

  • install ZipDoc diff viewer - This allows you to see changes within you ZIP based files when looking at git history in a human-readable way. It does not change your past nor future git history.

    To use this, install with --diff only.

  • install ReZip filter - This will change your future git repos history, storing ZIP based files without compression.

    To use this, install with --commit --diff --renormalize.

  • install ReZip filter & filter repo - This changes both the past (<- Caution!) and future history of your repo.

    To use this, create a copy of the repo with filtered history.

Installation

The filter and diff tool require Java 8 or newer.

The helper scripts - which are mostly used for installing the filter - require a POSIX (~= Unix) environment. This is the case on OSX, Linux, BSD, Unix and even Windows, if git is installed.

The recommended procedure is to install the helper scripts once, and then use them to comfortably install the filter into local git repos.

NOTE
This downloads and executes an online script onto your machine, which is a potential security risk. You may want to check-out the script before running it.

Install helper scripts

NOTE
This has to be done once per developer machine.

They get installed into ~/bin/, and if the directory did not exist before, it will get added to PATH.

To install:

curl --silent --location \
  https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh \
  | sh -s install --path

To update (to latest development version):

curl --silent --location \
  https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh \
  | sh -s update --dev

To remove:

curl --silent --location \
  https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh \
  | sh -s remove

Install diff viewer or filter

NOTE
This has to be done once per repo.

This installs the latest release of ReZipDoc into your local git repo.

Make sure you already have installed the helper scripts on your machine.

Switch to the local git repo you want to install this filter to, for example:

cd ~/src/myRepo/

As explained in How to use, you now want to use one of the following:

  1. Install the diff viewer

    rezipdoc-repo-tool.sh install --diff
  2. Install the filter

    rezipdoc-repo-tool.sh install --commit --renormalize
  3. Filter the history & install the filter

    If you filter the repo history, the freshly created, filtered repo will already have the filter installed as above.

To uninstall the diff viewer and/or filter, run:

rezipdoc-repo-tool.sh remove

Install filter manually

Only use this if you can not use the above, for some reason.

  1. Build the JAR

    Run this in bash:

    cd
    mkdir -p src
    cd src
    git clone git@github.com:hoijui/ReZipDoc.git
    cd ReZipDoc
    mvn package
    echo "Created ReZipDoc binary:"
    ls -1 $PWD/target/rezipdoc-*.jar
  2. Install the JAR

    Store rezipdoc-*.jar somewhere locally, either:

    • (global) in your home directory, for example under ~/bin/
    • (repo - tracked) in your repository, tracked, for example under /tools/
    • (repo - local) recommended in your repository, locally only, under /.git/
  3. Install the Filter(s)

    execute these lines:

    # Install the add/commit filter
    git config --replace-all filter.reZip.clean "java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ReZip --uncompressed"
    
    # (optionally) Install the checkout filter
    git config --replace-all filter.reZip.smudge "java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ReZip --compressed"
    
    # (optionally) Install the diff filter
    git config --replace-all diff.zipDoc.textconv "java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ZipDoc"
  4. Enable the filters

    In one of these files:

    • (global) ${HOME}/.gitattributes
    • (repo - tracked) /.gitattributes
    • (repo - local) recommended /.git/info/attributes

    Assign attributes to paths:

    # This forces git to treat files as if they were text-based (for example in diffs)
    [attr]textual     diff merge text
    # This makes git re-zip ZIP files uncompressed on commit
    # NOTE See the ReZipDoc README for how to install the required git filter
    [attr]reZip       textual filter=reZip
    # This makes git visualize ZIP files as uncompressed text with some meta info
    # NOTE See the ReZipDoc README for how to install the required git filter
    [attr]zipDoc      textual diff=zipDoc
    # This combines in-history decompression and uncompressed view of ZIP files
    [attr]reZipDoc    reZip zipDoc
    
    # MS Office
    *.docx   reZipDoc
    *.xlsx   reZipDoc
    *.pptx   reZipDoc
    # OpenOffice
    *.odt    reZipDoc
    *.ods    reZipDoc
    *.odp    reZipDoc
    # Misc
    *.mcdx   reZipDoc
    *.slx    reZipDoc
    # Archives
    *.zip    reZipDoc
    # Java archives
    *.jar    reZipDoc
    # FreeCAD files
    *.fcstd  reZipDoc

Filter repo history

This always creates a new copy of the repository.

NOTE
This only filters a single branch.

Make sure you have the helper scripts installed and in your PATH.

This filters the master branch of the repo at ~/src/myRepo into a new local repo ~/src/myRepo_filtered, using the original commit messages, authors and dates:

rezipdoc-history-filter.sh \
	--source ~/src/myRepo \
	--branch master \
	--orig \
	--target ~/src/myRepo_filtered

It also works with an online source:

rezipdoc-history-filter.sh \
	--source "https://github.com/case06/ZACplus.git" \
	--branch master \
	--orig \
	--target /tmp/ZACplus_filtered

After doing this, the new, filtered repo will already have the filter installed, so future commits will be filtered.

Filtering example

We are going to run a script that filters the Zinc-Oxide Open Hardware battery (ZAC+) project repo, which has a header comment explaining what it does in detail.

In short, it downloads ReZipDoc helper scripts to ~/bin, adds that dir to PATH if it is not there yet, creates temporary git repos in /tmp/, and generates some command-line output.

Run it like this:

curl --silent --location \
  https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-sample-filter-session.sh \
  | sh

Culprits

As described in gitattributes, you may see unnecessary merge conflicts when you add attributes to a file that causes the repository format for that file to change. To prevent this, Git can be told to run a virtual check-out and check-in of all three stages of a file when resolving a three-way merge:

git config --add --bool merge.renormalize true

Motivation

Many popular applications, such as Microsoft Office and Libre/Open Office, save their documents as XML in compressed zip containers. Small changes to these document's contents may result in big changes to their compressed binary container file. When compressed files are stored in a Git repository these big differences make delta compression inefficient or impossible and the repository size is roughly the sum of its revisions.

This small program acts as a Git clean filter driver. It reads a ZIP file from stdin and outputs the same ZIP content to stdout, but without compression.

pros
  • human readable/plain-text diffs of (ZIP based) archives, (if they contain plain-text files)
  • smaller overall repository size if the archive contents change frequently
cons
  • slower git add/git commit process
  • slower checkout process, if the smudge filter is used

How it works

When adding/committing a ZIP based file, ReZip unpacks it and repacks it without compression, before adding it to the index/commit. In an uncompressed ZIP file, the archived files appear as-is in its content (together with some binary meta-info before each file). If those archived files are plain-text files, this method will play nicely with git.

Benefits

The main benefit of ReZip over Zippey, is that the actual file stored in the repository is still a ZIP file. Thus, in many cases, it will still work as-is with the respective application (for example Open Office), even if it is obtained without going through the re-packing-with-compression smudge filter, so for example when downloading the file through a web-interface, instead of checking it out with git.

Observations

The following are based on my experience in real-world cases. Use at your own risk. Your mileage may vary.

SimuLink

  • One packed repository with ReZip was 54% of the size of the packed repository storing compressed ZIPs.
  • Another repository with 280 *.slx files and over 3000 commits was originally 281 MB and was reduced to 156 MB using this technique (55% of baseline).

MS Power-Point

I found that the loose objects stored without this filter were about 5% smaller than the original file size (zLib on top of zip compression). When using the ReZip filter, the loose objects were about 10% smaller than the original files, since zLib could work more efficiently on uncompressed data. The packed repository with ReZip was only 10% smaller than the packed repository storing compressed zips. I think this unremarkable efficiency improvement is due to a large number of *.png files in the presentation which were already stored without compression in the original *.pptx.

Based on

  • ReZip For more efficient Git packing of ZIP based files
  • ZipDoc A Git textconv program to show text-based diffs of ZIP files

Similar Projects

  • png-inflate Does the same uncompressed repack for PNG image files

About

Repack uncompressed & diff visualizer for ZIP based files stored in git repos

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published