Skip to content

Commit

Permalink
add usecases v0.1 of JZFS
Browse files Browse the repository at this point in the history
  • Loading branch information
taoshengshi committed Nov 15, 2024
1 parent ecd8d43 commit e1eab51
Showing 1 changed file with 48 additions and 0 deletions.
48 changes: 48 additions & 0 deletions docs/usecases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# **A Universe of Uses**
JZFS's versatility shines across different industries – making it the multi-purpose tool for the **data centric applications and teams**.

## Data Sharing
Git-style version control allows for decentralized, asynchronous collaboration. Every person gets their own copy of the dataset to read and write. JZFS allows you to coordinate collaboration over the internet with permissions, human review, forks and all the other distributed collaboration tools you are used to from GitHub.

Leveraging the the Git model of code sharing which has scaled to thousands of contributors for open source software, JZFS solve the data sharing as below:
- Do you share data with customers?
- Do they ask you what changed between versions you share?
- Do they want to actively switch versions instead of having data change out from under them?
- Or, are customers or vendors sharing data with you?
- Are you having trouble maintaining quality of scraped data?
- When new data is shared or scraped, do downstream systems break?
- Would you like to see exactly what changed between data versions?
- Do you want to add automated testing to data shared with you?
- Would you like to instantly rollback to the previous version if tests fail?

JZFS is the best way to build DataHub to share data with customers or vendors.
- Customers: Use versions to satisfy both slow and fast upgrading consumers. Let your customers help make your data better. Versions offer better debugging information. Version X works but version Y doesn't. Your customers can even make changes and submit data patches for your review, much like open source.
- Vendors: When you receive data from a vendor, import the data into Dolt. Examine the diff, either with the human eye or programmatically, before putting the data into production. You can now build integration tests for vendor data. If there's a problem, never merge the import branch into main or roll the change back if a bug was discovered in production. Use the problematic diff to debug with your vendor. The same tools you have for software dependencies, you now have for data dependencies.

### Case Studies
Let us know if you would like us to feature your use of JZFS for data sharing here: info@gitdata.ai .

## **Defining artificial intelligence in the context of lineage**

Artificial intelligence (AI) is an umbrella term that covers a variety of techniques and approaches that make it possible for machines to learn, adjust and act with intelligence comparable to the natural intelligence of humans. Lineage has direct implications for many of the techniques and approaches of AI, such as:
- Neural networks. AI classifies data to make predictions and decisions in much the same way a human brain does. A neural network is a computing system made up of interconnected units (like neurons) that process data from external inputs, relaying information between each unit. The neural network requires multiple passes at the data to find connections and derive meaning from undefined data. Neural networks benefit greatly from the movement aspects of data lineage – because connecting those dots directs its search for meaning.
- Natural language processing. AI that enables interaction, understanding and communication between humans and machines by analyzing and generating human language, including speech, is called natural language processing (NLP). NLP allows humans to communicate with computers using normal, everyday language to perform tasks. Natural language processing relies heavily on the human language data descriptions provided by the characteristics aspect of data lineage.
- Machine learning. AI that’s focused on giving machines access to data and letting them learn for themselves is known as machine learning. Machine learning automates analytical model building using methods from neural networks, statistics, operations research and physics – and it finds hidden insights in data without being explicitly programmed where to look or what to conclude. Machine learning delves into the relationships, processes and transformations aspects of data lineage during its undirected exploration of data’s potential.
- Deep learning. With deep learning, AI uses huge neural networks with many layers of processing to learn complex patterns in large amounts of data and perform humanlike tasks, such as recognizing speech or understanding images and videos (also known as computer vision). This method takes advantage of advances in computing power and improved training techniques. Deep learning depends on the users’ aspect of data lineage because its education is guided by analyzing how users interact with data.

## **Solving the Mysteries of Data Science’s Past and Present with Data Lineage**
- Data lineage is an essential aspect of data science and data analytics that enables organizations to understand the journey of data from its origin to its destination. It is a process of tracking the origin, movement, and transformation of data through various stages of its lifecycle. Data lineage plays a critical role in enhancing the performance and productivity of data science and analytics teams.
- By establishing a clear data lineage, data scientists and analysts can easily identify the source of data, its quality, and its dependencies. This information is crucial in ensuring data accuracy, reliability, and consistency. Additionally, data lineage helps teams quickly identify errors or issues in the data processing pipeline, enabling them to take corrective action promptly.
- Moreover, data lineage promotes better collaboration among team members by providing a shared understanding of the data ecosystem. This shared understanding ensures that everyone involved in the data analysis process is on the same page, reducing confusion and errors caused by miscommunication.

## **Enterprise DataHub & Data Collaboration**
Depending on your operating scale, you may even be managing multiple team members, who may be spread across different locations. JZFS enable Collaborative Datasets Version Management at Scale,Share & collaborate easily: Instantly share insights and co-edit with your team.

## **DataOps & Data Products & Data Mesh**
Augmenting Enterprise Data Development and Operations,JZFS ensures Responsible DataOps/AIOps/MLOps by improving Data Versioning, Provenance, and Reproducibility. JZFS makes a fusion of data science and product development and allows data to be containerized into shareable, tradeable, and trackable assets(data products or data NFTs). Versioning data products in a maturing Data Mesh environment via standard processes, data consumers can be informed about both breaking and non-breaking changes in a data product, as well as retirement of data products.

## **Industrial Digital Twin**
Developing digital twins for manufacturing involves managing tons of large files and multiple iterations of a project. All of the data collected and created in the digital twin process (and there is a lot of it) needs to be managed carefully. JZFS allows you to manage changes to files over time and store these modifications in a database.

## **Data Lake Management**
Data lakes are dynamic. New files and new versions of existing files enter the lake at the ingestion stage. Additionally, extractors can evolve over time and generate new versions of raw data. As a result, data lake versioning is a cross-cutting concern across all stages of a data lake. Of course vanilla distributed file systems are not adequate for versioning-related operations. For example, simply storing all versions may be too costly for large datasets, and without a good version manager, just using filenames to track versions can be error-prone. In a data lake, for which there are usually many users, it is even more important to clearly maintain correct versions being used and evolving across different users. Furthermore, as the number of versions increases, efficiently and cost-effectively providing storage and retrieval of versions is going to be an important feature of a successful data lake system.

0 comments on commit e1eab51

Please sign in to comment.