g2snapshot

If you are beginning your journey with Senzing, please start with Senzing Quick Start guides.

You are in the Senzing Garage where projects are "tinkered" on. Although this GitHub repository may help you understand an approach to using Senzing, it's not considered to be "production ready" and is not considered to be part of the Senzing product. Heck, it may not even be appropriate for your application of Senzing!

Overview

The following snapshot utilities analyze the data in a Senzing repository to calculate the following reports:

dataSourceSummary - calculates the duplicates, possible matches and relations by data source.
crossSourceSummary - calculates the duplicates, possible matches and relations across data sources.
entitySizeBreakdown - calculates how many entities have how many records, highlighting possible instances of overmatching.

These reports are placed in a json file that can be viewed with the G2Explorer located here: https://github.com/senzing-garage/g2explorer

It can optionally export the entire entity resolution result set for use in the G2Audit utility located here: https://github.com/senzing-garage/g2audit

Taking a snapshot is part of Senzing's Exploratory Data Analysis toolset which you can read more about here: https://senzing.zendesk.com/hc/en-us/sections/360009388534-Exploratory-Data-Analysis-EDA-

You will want to install database access as described in the prerequisites below and use G2Snapshot.py on large databases. If you do not have database access, you will have to use the G2Snapshot-api-only version which runs slower and has less functionality.

Usage:

python3 G2Snapshot.py --help
usage: G2Snapshot.py [-h] [-c CONFIG_FILE_NAME] [-o OUTPUT_FILE_ROOT]
                     [-s SAMPLE_SIZE] [-d DATASOURCE_FILTER]
                     [-f RELATIONSHIP_FILTER] [-a] [-D] [-k CHUNK_SIZE]
                     [-t THREAD_COUNT] [-u]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG_FILE_NAME, --config_file_name CONFIG_FILE_NAME
                        name of the senzing config file, defaults to
                        /etc/opt/senzing/G2Module.ini
  -o OUTPUT_FILE_ROOT, --output_file_root OUTPUT_FILE_ROOT
                        root name for files created such as
                        "/project/snapshots/snapshot1"
  -s SAMPLE_SIZE, --sample_size SAMPLE_SIZE
                        defaults to 1000
  -f RELATIONSHIP_FILTER, --relationship_filter RELATIONSHIP_FILTER
                        filter options 1=No Relationships, 2=Include possible
                        matches, 3=Include possibly related and disclosed.
                        Defaults to 3
  -a, --for_audit       export csv file for audit
  -D, --debug           print debug statements

the following are not in G2Snapshot-api-only

  -d DATASOURCE_FILTER, --datasource_filter DATASOURCE_FILTER
                        data source code to analayze, defaults to all
  -k CHUNK_SIZE, --chunk_size CHUNK_SIZE
                        defaults to 1000000
  -t THREAD_COUNT, --thread_count THREAD_COUNT
                        defaults to 0
  -u, --use_api         use api instead of sql to get resume

Please note the -d data source option was added as normally there is one primary data source you are trying to resolve against itself and see what other data sources match it. For instance you might want to compare your customers against a watch list or reference data such as a list of registered companies.

Do not use the -d option iterively for each data source. It would be more efficient to run the snapshot wide open as it will analyze all the data sources against each other.

Optional parameters:

The -c config_file parameter is only required if your project's G2Module.ini file can't be found in the usual location.
The -s sample_size parameter can be added to inlcude either more or less samples in the json file.
The -f relationship_filter can be included if you don't care about relationships. It runs faster without computing them. However, it is highly recommended that you at least include possible matches.
The -a for_audit parameter can be included if you also want the audit csv file to be generated.
The -k chunk_size parameter may be required if your database server is running out of temp space. Try 500000 (500k) rather than default of 1 million if you have this problem.
The -u use_api parameter can be used if it becomes necessary in the future due to database sharding.

- The -t thread_count parameter can be included to spin up more or less threads than are automatically calculated.

With enough database capacity and application threads, you should see speeds of 3-5k entities processed per second. Its a matter of monitoring the database and sshd container processor utilization. If both are below 80%, you can increase the -t thread-count which currently defaults to 4 per processor.

- The -d datasource_filter parameter can be used to specify a single data source. This can significantly reduce the time it takes to take a snapshot on a small data source.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github		.github
docs		docs
.gitignore		.gitignore
.project		.project
.pylintrc		.pylintrc
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
G2Snapshot.py		G2Snapshot.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

g2snapshot

Overview

Contents

Prerequisites

Installation

Typical use

About

Releases 15

Contributors 7

Languages

License

senzing-garage/g2snapshot

Folders and files

Latest commit

History

Repository files navigation

g2snapshot

Overview

Contents

Prerequisites

Installation

Typical use

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 15

Contributors 7

Languages