Skip to content

Commit

Permalink
v1.6.0
Browse files Browse the repository at this point in the history
  • Loading branch information
circulosmeos committed Apr 19, 2023
1 parent 1896382 commit 4ffcf21
Show file tree
Hide file tree
Showing 4 changed files with 35 additions and 20 deletions.
25 changes: 16 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,14 +98,14 @@ Copy gztool.c to the directory where you compiled zlib, and do:
Usage
=====

gztool (v1.5.2)
gztool (v1.6.0)
GZIP files indexer, compressor and data retriever.
Create small indexes for gzipped files and use them
for quick and random-positioned data extraction.
No more waiting when the end of a 10 GiB gzip is needed!
//github.com/circulosmeos/gztool (by Roberto S. Galende)

$ gztool [-[abLnsv] #] [-[1..9]AcCdDeEfFhilpPrRStTwWxXz|u[cCdD]] [-I <INDEX>] <FILE>...
$ gztool [-[abLnsv] #] [-[1..9]AcCdDeEfFhilpPrRStTwWxXzZ|u[cCdD]] [-I <INDEX>] <FILE>...

Note that actions `-bcStT` proceed to an index file creation (if
none exists) INTERLEAVED with data flow. As data flow and
Expand Down Expand Up @@ -173,6 +173,7 @@ Usage
This is implicit unless `-X` or `-z` are indicated.
-X: like `-x`, but newline character is '\r' (old mac).
-z: create index without line number information.
-Z: adjust index points to a byte boundary: no previous byte needed.

EXAMPLE: Extract data from 1 GiB byte (byte 2^30) on,
from `myfile.gz` to the file `myfile.txt`. Also gztool will
Expand Down Expand Up @@ -313,14 +314,14 @@ The same applies to `-S` though in this case there's no output, as only the inde
$ gztool -ell *.gzi

Checking index file 'accounting.gzi' ...
Size of index file: 184577 Bytes (0.37%/gzip)
Guessed gzip file name: 'accounting.gz' (66.05%) ( 50172261 Bytes )
Number of index points: 15
Size of index file (v0) : 184577 Bytes (0.37%/gzip)
Guessed gzip file name : 'accounting.gz' (66.05%) ( 50172261 Bytes )
Number of index points : 15
Size of uncompressed file: 147773440 Bytes
Compression factor : 66.05%
List of points:
@ compressed/uncompressed byte (index data size in Bytes @window's beginning at index file), ...
#1: @ 10 / 0 ( 0 @56 ), #2: @ 3059779 / 10495261 ( 13127 @80 ), #3: @ 6418423 / 21210594 ( 6818 @13231 ), #4: @ 9534259 / 31720206 ( 7238 @20073 )...
#: @ compressed/uncompressed byte (window data size in Bytes @window's beginning at index file) !bits needed from previous byte, ...
#1: @ 10 / 0 ( 0 @56 ) !0, #2: @ 3059779 / 10495261 ( 13127 @80 ) !2, #3: @ 6418423 / 21210594 ( 6818 @13231 ) !0, #4: @ 9534259 / 31720206 ( 7238 @20073 ) !7...
...

If `gztool` finds the gzip file companion of the index file, some statistics are shown, like the index/gzip size ratio, or the ratio of compression of the gzip file.
Expand All @@ -339,7 +340,13 @@ In this latter case only a pair of index+gzip filenames can be indicated with ea

Take into account that, as shown, the first byte of the truncated `gzip_filename.gz` file is numbered **100001**, that is, the bytes retain the order number in which they appear in the original file (that's the reason why it is not the *1* byte).

Please, note that index point positions at index file **may require also the previous byte** to be available in the truncated gzip file, as a gzip stream is not byte-rounded but a stream of pure bits. Thus **if you're thinking on truncating a gzip file, please do it always at least by one byte before the indicated index point in the gzip** - as said, it may not be needed, but in 7 of 8 cases it is needed.
Please, note that index point positions at index file **may require also the previous byte** to be available in the truncated gzip file, as a gzip stream is not byte-rounded but a stream of pure bits. Thus **if you're thinking on truncating a gzip file, please do it always at least by one byte before the indicated index point in the gzip** - as said, it may not be needed, but in 7 of 8 cases it is needed. **Another option is to use `-Z` when creating the index, as indicated below**.

* Create an index for a gzip file in which every index entry point is adjusted to byte boundary, so no previous byte (bits) is needed. Note that in general the byte at which the index entry point begins does not represent a clear cut point as the gzip window needs up to 7 bits from the previous byte. This is so because *gzip* is a bit-level stream compressor. With `-Z` the cut point is always clean and no bits from the previous byte are required. This will result in index points spaced by more than "span_between_points" bytes between then, and so, may be, less points in the index. But this is completely safe and sound.

$ gztool -Z my_gzip_file.gz

`-Z` exists since gztool **v1.6.0**.

* Since v1.5.0, using `-[fW]` (`-f`: force index overwriting; `-W`: do not write index) with `-[ST]` (`-S`: create index on still-growing gzip file; `-T`: tail and continue decompressing to stdout) indicates `gztool` to continue operations even after the source file is overwritten. If using `-f`, the index file will be overwritten. For example:

Expand Down Expand Up @@ -438,7 +445,7 @@ Other interesting links
Version
=======

This version is **v1.5.2**.
This version is **v1.6.0**.

Please, read the *Disclaimer*. In case of any errors, please open an [issue](https://github.com/circulosmeos/gztool/issues).

Expand Down
2 changes: 1 addition & 1 deletion configure.ac
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
AC_INIT([gztool], [1.5.2], [roberto.s.galende@gmail.com])
AC_INIT([gztool], [1.6.0], [roberto.s.galende@gmail.com])
AM_INIT_AUTOMAKE([-Wall -Werror foreign])
AC_PROG_CC
AC_PROG_CC_C99
Expand Down
24 changes: 16 additions & 8 deletions gztool.1
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
.\" First parameter, NAME, should be all caps
.\" Second parameter, SECTION, should be 1-8, maybe w/ subsection
.\" other parameters are allowed: see man(7), man(1)
.TH gztool 1 "Mar 15 2023" "gztool v1.5.2"
.TH gztool 1 "Apr 20 2023" "gztool v1.6.0"
.\" Please adjust this date whenever revising the manpage.
.\"
.\" Some roff macros, for reference:
Expand All @@ -21,7 +21,7 @@
gztool \- extract random-positioned data from gzip files, even like `tail -f`
.SH SYNOPSIS
.B gztool
.RI \ [\ [-[abLnsv]\ #]\ [-[1..9]AcCdDeEfFhilpPrRStTwWxXz|u[cCdD]]\ [-I\ <INDEX>]\ ]\ "files"\ ...
.RI \ [\ [-[abLnsv]\ #]\ [-[1..9]AcCdDeEfFhilpPrRStTwWxXzZ|u[cCdD]]\ [-I\ <INDEX>]\ ]\ "files"\ ...
.br

Note that actions `-bcStT` proceed to an index file creation (if
Expand Down Expand Up @@ -182,6 +182,9 @@ like `-x`, but newline character is '\\r' (old mac).
.TP
.BR \-z
create index without line number information.
.TP
.BR \-Z
adjust index points to a byte boundary: no previous byte needed.
.br
.SH QUICK EXAMPLE
Extract data from 1 GiB byte (byte 2^30) on,
Expand Down Expand Up @@ -263,19 +266,19 @@ Creating and index for all "*gz" files in a directory:
.br


* Extract all data from a \fBrsyslog's veryRobustZip\fP (//www.rsyslog.com/doc/v8-stable/configuration/modules/omfile.html#veryrobustzip) that contains dirty data. This *corrupted-gzip-files* can arise when using \fBrsyslog's veryRobustZip omfile option\fP and the process that is logging is abruptly terminated and then restarted - this produces an incorrectly-terminated-gzip stream that is followed by another gzip stream **in the same file**. `gzip` (nor `zlib`) cannot read this files beyond the point of error. But `gztool` can correctly extract all data (and only good data) using `-p` (*patch*) parameter:
* Extract all data from a \fBrsyslog's veryRobustZip\fP (//www.rsyslog.com/doc/v8-stable/configuration/modules/omfile.html#veryrobustzip) that contains dirty data. This *corrupted-gzip-files* can arise when using \fBrsyslog's veryRobustZip omfile option\fP and the process that is logging is abruptly terminated and then restarted - this produces an incorrectly-terminated-gzip stream that is followed by another gzip stream \fBin the same file\fP. `gzip` (nor `zlib`) cannot read this files beyond the point of error. But `gztool` can correctly extract all data (and only good data) using `-p` (*patch*) parameter:

.BR \ \ \ \ $\ gztool\ -p\ -b0\ compressed_text_file.gz
.br

This creates, as usual, the index file `compressed_text_file.gzi`. In order to not create it, `-W` (*do not Write index*) can be used:
This creates, as usual, the index file `compressed_text_file.gzi`. In order to not create it, `-W` (\fIdo not Write index\fP) can be used:

.BR \ \ \ \ $\ gztool\ -pWb0\ compressed_text_file.gz
.br

Note that `-p` can require up to twice the time for decompression, because it performs two decompression processes: the usual one, and another one that is performed **in advance** of the usual and which is the one that detects errors, marks them, and finds new entry points to end/begin the decompression circumventing the problems.
Note that `-p` can require up to twice the time for decompression, because it performs two decompression processes: the usual one, and another one that is performed \fBin advance\fP of the usual and which is the one that detects errors, marks them, and finds new entry points to end/begin the decompression circumventing the problems.
.br
Note also that these *corrupted-gzip-files* should be always decompressed with `-p` parameter, even if a `gztool` index file exists for them, because the index file stores entry points, but does not store where do errors occur in the `gzip` file.
Note also that these \fIcorrupted-gzip-files\fP should be always decompressed with `-p` parameter, even if a `gztool` index file exists for them, because the index file stores entry points, but does not store where do errors occur in the `gzip` file.
That said, if the `-[bL]` point of extraction is beyond the point(s) of error in the `gzip` file and an index file exists, then the decompression can proceed fine without `-p`, as the index points stored in the index file are always clean.
.br

Expand Down Expand Up @@ -351,11 +354,16 @@ In this latter case only a pair of index+gzip filenames can be indicated with ea
.BR \ \ \ \ $\ gztool\ -n\ 100001\ -b\ 20M\ gzip_filename.gz
.br

Take into account that, as shown, the first byte of the truncated `gzip_filename.gz` file is numbered **100001**, that is, the bytes retain the order number in which they appear in the original file (that's the reason why it is not the *1* byte).
Take into account that, as shown, the first byte of the truncated `gzip_filename.gz` file is numbered \fB100001\fP, that is, the bytes retain the order number in which they appear in the original file (that's the reason why it is not the \fB1\fP byte).
.br
Please, note that index point positions at index file \fBmay require also the previous byte\fP to be available in the truncated gzip file, as gzip stream is not byte-rounded but a stream of pure bits. Thus \fIif you're thinking on truncating a gzip file, please do it always at least by one byte before the indicated index point in the gzip\fP - as said, it may not be needed, but in 7 of 8 cases it is needed.
Please, note that index point positions at index file \fBmay require also the previous byte\fP to be available in the truncated gzip file, as gzip stream is not byte-rounded but a stream of pure bits. Thus \fIif you're thinking on truncating a gzip file, please do it always at least by one byte before the indicated index point in the gzip\fP - as said, it may not be needed, but in 7 of 8 cases it is needed. \fBAnother option is to use `-Z` when creating the index, as indicated below.\fP
.br

* Create an index for a gzip file in which every index entry point is adjusted to byte boundary, so no previous byte (bits) is needed. Note that in general the byte at which the index entry point begins does not represent a clear cut point as the gzip window needs up to 7 bits from the previous byte. This is so because \fBgzip\fP is a bit-level stream compressor. With `-Z` the cut point is always clean and no bits from the previous byte are required. This will result in index points spaced by more than `-s` bytes between then, and so, may be, less points in the index. But this is completely safe and sound.

$ gztool -Z my_gzip_file.gz

`-Z` exists since gztool \fBv1.6.0\fP.

* Since v1.5.0, using `-[fW]` (`-f`: force index overwriting; `-W`: do not write index) with `-[ST]` (`-S`: create index on still-growing gzip file; `-T`: tail and continue decompressing to stdout) indicates `gztool` to continue operations even after the source file is overwritten. If using `-f`, the index file will be overwritten. For example:

Expand Down
4 changes: 2 additions & 2 deletions gztool.c
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
//
// LICENSE:
//
// v0.1 to v1.5* by Roberto S. Galende, 2019, 2020, 2021, 2022, 2023
// v0.1 to v1.6* by Roberto S. Galende, 2019, 2020, 2021, 2022, 2023
// //github.com/circulosmeos/gztool
// A work by Roberto S. Galende
// distributed under the same License terms covering
Expand Down Expand Up @@ -123,7 +123,7 @@
#include <config.h>
#else
#define PACKAGE_NAME "gztool"
#define PACKAGE_VERSION "1.5.2"
#define PACKAGE_VERSION "1.6.0"
#endif

#define _XOPEN_SOURCE 500 // expose <unistd.h>'s pread()
Expand Down

0 comments on commit 4ffcf21

Please sign in to comment.