doc: Experiments for building knowledge graph (#36)

* adding draft PR for experiments.md which contains summary of all my experiments * updated EXPERIMENTS.MD and added their artefacts and data respectively * updated EXPERIMENTS.MD with code-embeddings section and added links to models * fixed some spelling errors * updated EXPERIMENT.MD * EXPERIMENT.MD and some more artefacts are being updated
c2siorg · Jul 9, 2024 · 9cf1b43 · 9cf1b43
1 parent 2b23cf1
commit 9cf1b43
Show file tree

Hide file tree

Showing 24 changed files with 5,155 additions and 0 deletions.
diff --git a/graph_rag/experiments/EXPERIMENTS.MD b/graph_rag/experiments/EXPERIMENTS.MD
@@ -0,0 +1,73 @@
+# Experiments
+
+The major portion of my time in the first phase of the GSoC project has been spent experimenting with different models, embeddings, and libraries.
+
+## Knowledge Graph from Documentation
+
+The majority of the documentation for libraries is stored in the form of HTML and markdown files in their GitHub repositories.
+
+We first used llama-index document loaders to load all documents with the .md extension. We then performed chunking and created a Document instance of them.
+
+## Knowledge Graph Using Code Embeddings
+
+Implementation of the idea can be found here: [Colab](https://colab.research.google.com/drive/1uguR76SeMAukN4uAhKuXU_ja8Ik0s8Wj#scrollTo=CUgtX5D1Tl_x).
+
+The idea is to separate code blocks or take code and split it using a code splitter, then pass it to a model for building a Knowledge Graph using code embeddings. I used:
+- Salesforce/codegen2-7B_P quantized (4-bit)
+- Salesforce/codet5p-110m-embedding
+- Python files in Keras-io
+
+### Model Selection
+
+We need a model that is open source and can work on the free Colab version to begin with. For a better knowledge graph, we quantized models above 20GB to 4 bits using bitsandbytes configuration. We tried the following LLMs:
+- gemini pro
+- [QuantiPhy/zephyr-7b-beta(4bit-quantized)**](https://huggingface.co/QuantiPhy/zephyr-7b-beta-4bit-quantized)
+- llama3 (Ollama version)
+- codellama (Ollama version)
+- [QuantiPhy/aya-23-8B (4bit quantized)**](https://huggingface.co/QuantiPhy/aya-23-8B-4bq)
+- gpt-neo-2.7B(4bit-quantized)
+- [Salesforce/codegen2-7B_P(4bit-quantized)**](https://huggingface.co/QuantiPhy/Salesforce_codegen2-7B_P)
+- phi3 (Ollama)
+- phi3:medium (Ollama)
+- neural-chat (Ollama)
+- gemma2 (Ollama)
+- mistral (Ollama)   
+** all these models,I have 4bit-quantized them using bitsandbytes
+### Embeddings
+
+For embeddings, we tried:
+- microsoft/codebert-base
+- Salesforce/codet5p-110m-embedding
+
+### Libraries
+
+In the initial phase, we are looking for libraries in the community that solve the problem of building Knowledge Graphs:
+- [llama-index knowledge-graph builder](https://github.com/run-llama/llama_index/tree/main/llama-index-core/llama_index/core/indices/knowledge_graph)
+- [llm-graph-builder](https://github.com/neo4j-labs/llm-graph-builder)
+- [graph_builder](https://github.com/sarthakrastogi/graph-rag)
+
+### Table
+
+| Model                       | Embeddings           | Libraries                  | Remarks     | Documents                | Artifacts                                                                                                                                                                                                                                                                            |
+|:----------------------------|:---------------------|:---------------------------|:------------|:-------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| gemma2 (Ollama)             | microsoft/codebert-base | llama-index graph builder | nil         | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/gemma2/Graph_visualization_gemma2_mscb.html)<br/>[index](artifacts/gemma2/gemma2graphIndex.pkl)<br/>[collab](https://colab.research.google.com/drive/1q7FED2Lapk3D7ibqkO3NkqZ6iPNZ_x6H?usp=sharing)                                                                  |
+| mistral (Ollama)            | microsoft/codebert-base | llama-index graph builder | nil         | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/mistral/Graph_visualization_mistral_mscb.html)<br/>[index](artifacts/mistral/mistralgraphIndex.pkl)<br/>[collab](https://colab.research.google.com/drive/1q7FED2Lapk3D7ibqkO3NkqZ6iPNZ_x6H?usp=sharing)                                                              |
+| neural-chat (Ollama)        | microsoft/codebert-base | llama-index graph builder | nil         | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/neural_chat/Graph_visualization_neuralchat_mscb.html)<br/>[index](artifacts/neural_chat/graphIndex_neuralchat_mscb.pkl)<br/>[collab](https://colab.research.google.com/drive/1cM6ujhiKM1v0bRYVN9F9UEgjYlwkBTt9?usp=sharing)                                          |
+| phi3:medium (Ollama)        | microsoft/codebert-base | llama-index graph builder | nil         | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/phi3-med/Graph_visualization_phi3-med_mscb.html)<br/>[index](artifacts/phi3-med/graphIndex_phi3_medium_mscb.pkl)<br/>[collab](https://colab.research.google.com/drive/1cM6ujhiKM1v0bRYVN9F9UEgjYlwkBTt9?usp=sharing)                                                 |
+| phi3 (Ollama)               | microsoft/codebert-base | llama-index graph builder | nil         | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/phi3/Graph_visualization_phi3_mscb.html)<br/>[index](artifacts/phi3/graphIndex_phi3_mscb.pkl)<br/>[collab](https://colab.research.google.com/drive/1cM6ujhiKM1v0bRYVN9F9UEgjYlwkBTt9?usp=sharing)                                                                    |
+| gpt-4o                      | open-ai              | Neo4jGraphBuilder          | nil         | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/vizualization/visualisation.png)                                                                                                                                                                                                                                     |
+| Gemini                      | gemini               | llama-index graph builder | nil         | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/ex1.html)                                                                                                                                                                                                                                              |
+| Gemini                      | gemini               | llama-index graph builder | Rate-error  | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) |                                                                                                                                                                                                                                                                                      |
+| Gemini                      | microsoft/codebert-base | llama-index graph builder | nil         | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/gem_mcode_k_nlp.html)                                                                                                                                                                                                                                  |
+| Zypher (4-bit)              | microsoft/codebert-base | llama-index graph builder | nil         | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/zy_knlp.html)                                                                                                                                                                                                                                          |
+| Zypher (4-bit)              | microsoft/codebert-base | llama-index graph builder | nil         | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/vizualization/examp.html)                                                                                                                                                                                                                                            |
+| llama3 (Ollama version)     | microsoft/codebert-base | llama-index graph builder | nil         | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/Graph_visualization.html)                                                                                                                                                                                                                              |
+| codellama (Ollama version)  | microsoft/codebert-base | llama-index graph builder | nil         | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/code_1.html)                                                                                                                                                                                                                                           |
+| gpt-neo-2.7B-4bit-quantized | microsoft/codebert-base | llama-index graph builder | nil         | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/graph_gpt3-neo.html)                                                                                                                                                                                                                                   |
+
+### Notes
+- ### [graph_builder](https://github.com/sarthakrastogi/graph-rag)   
+
+  -  I explored graph_rag by Sarthak. It is fundamentally based on function calling (JSON output), and it works very well for powerful models. However, small-sized LLMs tend to make mistakes regardless of how well the prompt is crafted.
+  - I tried and debugged the library, and this was my experience with it. I modified the system prompts, which led to fewer mistakes, and added a method to download .html files for visualization. Additionally, I added methods to use Ollama OS models.
+  - [rough_codes](https://colab.research.google.com/drive/1q6T8mK-O2XKqY-iGFz6xdrzvqLzu73lm#scrollTo=H0QG6QUVub8T) contains codes/modification/implementation for the rep0
diff --git a/graph_rag/experiments/artifacts/data_keras/index1.md b/graph_rag/experiments/artifacts/data_keras/index1.md
@@ -0,0 +1,93 @@
+# KerasTuner
+
+<a class="github-button" href="https://github.com/keras-team/keras-tuner" data-size="large" data-show-count="true" aria-label="Star keras-team/keras-tuner on GitHub">Star</a>
+
+KerasTuner is an easy-to-use, scalable hyperparameter optimization framework
+that solves the pain points of hyperparameter search. Easily configure your
+search space with a define-by-run syntax, then leverage one of the available
+search algorithms to find the best hyperparameter values for your models.
+KerasTuner comes with Bayesian Optimization, Hyperband, and Random Search algorithms
+built-in, and is also designed to be easy for researchers to extend in order to
+experiment with new search algorithms.
+
+---
+## Quick links
+
+* [Getting started with KerasTuner](/guides/keras_tuner/getting_started/)
+* [KerasTuner developer guides](/guides/keras_tuner/)
+* [KerasTuner API reference](/api/keras_tuner/)
+* [KerasTuner on GitHub](https://github.com/keras-team/keras-tuner)
+
+
+---
+## Installation
+
+Install the latest release:
+
+```
+pip install keras-tuner --upgrade
+```
+
+You can also check out other versions in our
+[GitHub repository](https://github.com/keras-team/keras-tuner).
+
+
+---
+## Quick introduction
+
+Import KerasTuner and TensorFlow:
+
+```python
+import keras_tuner
+import keras
+```
+
+Write a function that creates and returns a Keras model.
+Use the `hp` argument to define the hyperparameters during model creation.
+
+```python
+def build_model(hp):
+  model = keras.Sequential()
+  model.add(keras.layers.Dense(
+      hp.Choice('units', [8, 16, 32]),
+      activation='relu'))
+  model.add(keras.layers.Dense(1, activation='relu'))
+  model.compile(loss='mse')
+  return model
+```
+
+Initialize a tuner (here, `RandomSearch`).
+We use `objective` to specify the objective to select the best models,
+and we use `max_trials` to specify the number of different models to try.
+
+```python
+tuner = keras_tuner.RandomSearch(
+    build_model,
+    objective='val_loss',
+    max_trials=5)
+```
+
+Start the search and get the best model:
+
+```python
+tuner.search(x_train, y_train, epochs=5, validation_data=(x_val, y_val))
+best_model = tuner.get_best_models()[0]
+```
+
+To learn more about KerasTuner, check out [this starter guide](https://keras.io/guides/keras_tuner/getting_started/).
+
+
+---
+## Citing KerasTuner
+
+If KerasTuner helps your research, we appreciate your citations.
+Here is the BibTeX entry:
+
+```bibtex
+@misc{omalley2019kerastuner,
+	title        = {KerasTuner},
+	author       = {O'Malley, Tom and Bursztein, Elie and Long, James and Chollet, Fran\c{c}ois and Jin, Haifeng and Invernizzi, Luca and others},
+	year         = 2019,
+	howpublished = {\url{https://github.com/keras-team/keras-tuner}}
+}
+```
diff --git a/graph_rag/experiments/artifacts/data_keras/index2.md b/graph_rag/experiments/artifacts/data_keras/index2.md
@@ -0,0 +1,146 @@
+# KerasNLP
+
+<a class="github-button" href="https://github.com/keras-team/keras-nlp" data-size="large" data-show-count="true" aria-label="Star keras-team/keras-nlp on GitHub">Star</a>
+
+KerasNLP is a natural language processing library that works natively
+with TensorFlow, JAX, or PyTorch. Built on Keras 3, these models, layers,
+metrics, and tokenizers can be trained and serialized in any framework and
+re-used in another without costly migrations.
+
+KerasNLP supports users through their entire development cycle. Our workflows
+are built from modular components that have state-of-the-art preset weights when
+used out-of-the-box and are easily customizable when more control is needed.
+
+This library is an extension of the core Keras API; all high-level modules are
+[`Layers`](/api/layers/) or
+[`Models`](/api/models/) that receive that same level of polish
+as core Keras. If you are familiar with Keras, congratulations! You already
+understand most of KerasNLP.
+
+See our [Getting Started guide](/guides/keras_nlp/getting_started)
+to start learning our API. We welcome
+[contributions](https://github.com/keras-team/keras-nlp/blob/master/CONTRIBUTING.md).
+
+---
+## Quick links
+
+* [KerasNLP API reference](/api/keras_nlp/)
+* [KerasNLP on GitHub](https://github.com/keras-team/keras-nlp)
+* [List of available pre-trained models](/api/keras_nlp/models/)
+
+## Guides
+
+* [Getting Started with KerasNLP](/guides/keras_nlp/getting_started/)
+* [Uploading Models with KerasNLP](/guides/keras_nlp/upload/)
+* [Pretraining a Transformer from scratch](/guides/keras_nlp/transformer_pretraining/)
+
+## Examples
+
+* [GPT-2 text generation](/examples/generative/gpt2_text_generation_with_kerasnlp/)
+* [Parameter-efficient fine-tuning of GPT-2 with LoRA](/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/)
+* [Semantic Similarity](/examples/nlp/semantic_similarity_with_keras_nlp/)
+* [Sentence embeddings using Siamese RoBERTa-networks](/examples/nlp/sentence_embeddings_with_sbert/)
+* [Data Parallel Training with tf.distribute](/examples/nlp/data_parallel_training_with_keras_nlp/)
+* [English-to-Spanish translation](/examples/nlp/neural_machine_translation_with_keras_nlp/)
+* [GPT text generation from scratch](/examples/generative/text_generation_gpt/)
+* [Text Classification using FNet](/examples/nlp/fnet_classification_with_keras_nlp/)
+
+---
+## Installation
+
+KerasNLP supports both Keras 2 and Keras 3. We recommend Keras 3 for all new
+users, as it enables using KerasNLP models and layers with JAX, TensorFlow and
+PyTorch.
+
+### Keras 2 Installation
+
+To install the latest KerasNLP release with Keras 2, simply run:
+
+```
+pip install --upgrade keras-nlp
+```
+
+### Keras 3 Installation
+
+There are currently two ways to install Keras 3 with KerasNLP. To install the
+stable versions of KerasNLP and Keras 3, you should install Keras 3 **after**
+installing KerasNLP. This is a temporary step while TensorFlow is pinned to
+Keras 2, and will no longer be necessary after TensorFlow 2.16.
+
+```
+pip install --upgrade keras-nlp
+pip install --upgrade keras
+```
+
+To install the latest nightly changes for both KerasNLP and Keras, you can use
+our nightly package.
+
+```
+pip install --upgrade keras-nlp-nightly
+```
+
+**Note:** Keras 3 will not function with TensorFlow 2.14 or earlier.
+
+See [Getting started with Keras](/getting_started/) for more information on
+installing Keras generally and compatibility with different frameworks.
+
+---
+## Quickstart
+
+Fine-tune BERT on a small sentiment analysis task using the
+[`keras_nlp.models`](/api/keras_nlp/models/) API:
+
+```python
+import os
+os.environ["KERAS_BACKEND"] = "tensorflow"  # Or "jax" or "torch"!
+
+import keras_nlp
+import tensorflow_datasets as tfds
+
+imdb_train, imdb_test = tfds.load(
+    "imdb_reviews",
+    split=["train", "test"],
+    as_supervised=True,
+    batch_size=16,
+)
+# Load a BERT model.
+classifier = keras_nlp.models.BertClassifier.from_preset(
+    "bert_base_en_uncased", 
+    num_classes=2,
+)
+# Fine-tune on IMDb movie reviews.
+classifier.fit(imdb_train, validation_data=imdb_test)
+# Predict two new examples.
+classifier.predict(["What an amazing movie!", "A total waste of my time."])
+```
+
+---
+## Compatibility
+
+We follow [Semantic Versioning](https://semver.org/), and plan to
+provide backwards compatibility guarantees both for code and saved models built
+with our components. While we continue with pre-release `0.y.z` development, we
+may break compatibility at any time and APIs should not be consider stable.
+
+## Disclaimer
+
+KerasNLP provides access to pre-trained models via the `keras_nlp.models` API.
+These pre-trained models are provided on an "as is" basis, without warranties
+or conditions of any kind. The following underlying models are provided by third
+parties, and subject to separate licenses:
+BART, DeBERTa, DistilBERT, GPT-2, OPT, RoBERTa, Whisper, and XLM-RoBERTa.
+
+## Citing KerasNLP
+
+If KerasNLP helps your research, we appreciate your citations.
+Here is the BibTeX entry:
+
+```bibtex
+@misc{kerasnlp2022,
+  title={KerasNLP},
+  author={Watson, Matthew, and Qian, Chen, and Bischof, Jonathan and Chollet, 
+  Fran\c{c}ois and others},
+  year={2022},
+  howpublished={\url{https://github.com/keras-team/keras-nlp}},
+}
+```
diff --git a/graph_rag/experiments/artifacts/data_keras/index3.md b/graph_rag/experiments/artifacts/data_keras/index3.md
@@ -0,0 +1,21 @@
+# KerasCV Bounding Boxes
+
+All KerasCV components that process bounding boxes require a `bounding_box_format`
+argument.  This argument allows you to seamlessly integrate KerasCV components into
+your own workflows while preserving proper behavior of the components themselves.
+
+Bounding boxes are represented by dictionaries with two keys: `'boxes'` and `'classes'`:
+
+```
+{
+  'boxes': [batch, num_boxes, 4],
+  'classes': [batch, num_boxes]
+}
+```
+
+To ensure your bounding boxes comply with the KerasCV specification, you can use [`keras_cv.bounding_box.validate_format(boxes)`](https://github.com/keras-team/keras-cv/blob/master/keras_cv/bounding_box/validate_format.py).
+
+The bounding box formats supported in KerasCV
+[are listed in the API docs](/api/keras_cv/bounding_box/formats)
+If a format you would like to use is missing,
+[feel free to open a GitHub issue on KerasCV](https://github.com/keras-team/keras-cv/issues)!
diff --git a/graph_rag/experiments/artifacts/data_keras/index4.md b/graph_rag/experiments/artifacts/data_keras/index4.md
@@ -0,0 +1,7 @@
+# KerasCV
+
+These guides cover the [KerasCV](/keras_cv/) library.
+
+## Available guides
+
+{{toc}}
diff --git a/graph_rag/experiments/artifacts/data_keras/index5.md b/graph_rag/experiments/artifacts/data_keras/index5.md
@@ -0,0 +1,7 @@
+# KerasNLP
+
+These guides cover the [KerasNLP](/keras_nlp/) library.
+
+## Available guides
+
+{{toc}}