Skip to content

Commit

Permalink
doc: Experiments for building knowledge graph (#36)
Browse files Browse the repository at this point in the history
* adding draft PR for experiments.md which contains summary of all my experiments

* updated EXPERIMENTS.MD and added their artefacts and data respectively

* updated EXPERIMENTS.MD with code-embeddings section and added links to models

* fixed some spelling errors

* updated EXPERIMENT.MD

* EXPERIMENT.MD and some more artefacts are being updated
  • Loading branch information
debrupf2946 authored Jul 9, 2024
1 parent 2b23cf1 commit 9cf1b43
Show file tree
Hide file tree
Showing 24 changed files with 5,155 additions and 0 deletions.
73 changes: 73 additions & 0 deletions graph_rag/experiments/EXPERIMENTS.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Experiments

The major portion of my time in the first phase of the GSoC project has been spent experimenting with different models, embeddings, and libraries.

## Knowledge Graph from Documentation

The majority of the documentation for libraries is stored in the form of HTML and markdown files in their GitHub repositories.

We first used llama-index document loaders to load all documents with the .md extension. We then performed chunking and created a Document instance of them.

## Knowledge Graph Using Code Embeddings

Implementation of the idea can be found here: [Colab](https://colab.research.google.com/drive/1uguR76SeMAukN4uAhKuXU_ja8Ik0s8Wj#scrollTo=CUgtX5D1Tl_x).

The idea is to separate code blocks or take code and split it using a code splitter, then pass it to a model for building a Knowledge Graph using code embeddings. I used:
- Salesforce/codegen2-7B_P quantized (4-bit)
- Salesforce/codet5p-110m-embedding
- Python files in Keras-io

### Model Selection

We need a model that is open source and can work on the free Colab version to begin with. For a better knowledge graph, we quantized models above 20GB to 4 bits using bitsandbytes configuration. We tried the following LLMs:
- gemini pro
- [QuantiPhy/zephyr-7b-beta(4bit-quantized)**](https://huggingface.co/QuantiPhy/zephyr-7b-beta-4bit-quantized)
- llama3 (Ollama version)
- codellama (Ollama version)
- [QuantiPhy/aya-23-8B (4bit quantized)**](https://huggingface.co/QuantiPhy/aya-23-8B-4bq)
- gpt-neo-2.7B(4bit-quantized)
- [Salesforce/codegen2-7B_P(4bit-quantized)**](https://huggingface.co/QuantiPhy/Salesforce_codegen2-7B_P)
- phi3 (Ollama)
- phi3:medium (Ollama)
- neural-chat (Ollama)
- gemma2 (Ollama)
- mistral (Ollama)
** all these models,I have 4bit-quantized them using bitsandbytes
### Embeddings

For embeddings, we tried:
- microsoft/codebert-base
- Salesforce/codet5p-110m-embedding

### Libraries

In the initial phase, we are looking for libraries in the community that solve the problem of building Knowledge Graphs:
- [llama-index knowledge-graph builder](https://github.com/run-llama/llama_index/tree/main/llama-index-core/llama_index/core/indices/knowledge_graph)
- [llm-graph-builder](https://github.com/neo4j-labs/llm-graph-builder)
- [graph_builder](https://github.com/sarthakrastogi/graph-rag)

### Table

| Model | Embeddings | Libraries | Remarks | Documents | Artifacts |
|:----------------------------|:---------------------|:---------------------------|:------------|:-------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| gemma2 (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/gemma2/Graph_visualization_gemma2_mscb.html)<br/>[index](artifacts/gemma2/gemma2graphIndex.pkl)<br/>[collab](https://colab.research.google.com/drive/1q7FED2Lapk3D7ibqkO3NkqZ6iPNZ_x6H?usp=sharing) |
| mistral (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/mistral/Graph_visualization_mistral_mscb.html)<br/>[index](artifacts/mistral/mistralgraphIndex.pkl)<br/>[collab](https://colab.research.google.com/drive/1q7FED2Lapk3D7ibqkO3NkqZ6iPNZ_x6H?usp=sharing) |
| neural-chat (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/neural_chat/Graph_visualization_neuralchat_mscb.html)<br/>[index](artifacts/neural_chat/graphIndex_neuralchat_mscb.pkl)<br/>[collab](https://colab.research.google.com/drive/1cM6ujhiKM1v0bRYVN9F9UEgjYlwkBTt9?usp=sharing) |
| phi3:medium (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/phi3-med/Graph_visualization_phi3-med_mscb.html)<br/>[index](artifacts/phi3-med/graphIndex_phi3_medium_mscb.pkl)<br/>[collab](https://colab.research.google.com/drive/1cM6ujhiKM1v0bRYVN9F9UEgjYlwkBTt9?usp=sharing) |
| phi3 (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/phi3/Graph_visualization_phi3_mscb.html)<br/>[index](artifacts/phi3/graphIndex_phi3_mscb.pkl)<br/>[collab](https://colab.research.google.com/drive/1cM6ujhiKM1v0bRYVN9F9UEgjYlwkBTt9?usp=sharing) |
| gpt-4o | open-ai | Neo4jGraphBuilder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/vizualization/visualisation.png) |
| Gemini | gemini | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/ex1.html) |
| Gemini | gemini | llama-index graph builder | Rate-error | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | |
| Gemini | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/gem_mcode_k_nlp.html) |
| Zypher (4-bit) | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/zy_knlp.html) |
| Zypher (4-bit) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/vizualization/examp.html) |
| llama3 (Ollama version) | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/Graph_visualization.html) |
| codellama (Ollama version) | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/code_1.html) |
| gpt-neo-2.7B-4bit-quantized | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/graph_gpt3-neo.html) |

### Notes
- ### [graph_builder](https://github.com/sarthakrastogi/graph-rag)

- I explored graph_rag by Sarthak. It is fundamentally based on function calling (JSON output), and it works very well for powerful models. However, small-sized LLMs tend to make mistakes regardless of how well the prompt is crafted.
- I tried and debugged the library, and this was my experience with it. I modified the system prompts, which led to fewer mistakes, and added a method to download .html files for visualization. Additionally, I added methods to use Ollama OS models.
- [rough_codes](https://colab.research.google.com/drive/1q6T8mK-O2XKqY-iGFz6xdrzvqLzu73lm#scrollTo=H0QG6QUVub8T) contains codes/modification/implementation for the rep0
93 changes: 93 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# KerasTuner

<a class="github-button" href="https://github.com/keras-team/keras-tuner" data-size="large" data-show-count="true" aria-label="Star keras-team/keras-tuner on GitHub">Star</a>

KerasTuner is an easy-to-use, scalable hyperparameter optimization framework
that solves the pain points of hyperparameter search. Easily configure your
search space with a define-by-run syntax, then leverage one of the available
search algorithms to find the best hyperparameter values for your models.
KerasTuner comes with Bayesian Optimization, Hyperband, and Random Search algorithms
built-in, and is also designed to be easy for researchers to extend in order to
experiment with new search algorithms.

---
## Quick links

* [Getting started with KerasTuner](/guides/keras_tuner/getting_started/)
* [KerasTuner developer guides](/guides/keras_tuner/)
* [KerasTuner API reference](/api/keras_tuner/)
* [KerasTuner on GitHub](https://github.com/keras-team/keras-tuner)


---
## Installation

Install the latest release:

```
pip install keras-tuner --upgrade
```

You can also check out other versions in our
[GitHub repository](https://github.com/keras-team/keras-tuner).


---
## Quick introduction

Import KerasTuner and TensorFlow:

```python
import keras_tuner
import keras
```

Write a function that creates and returns a Keras model.
Use the `hp` argument to define the hyperparameters during model creation.

```python
def build_model(hp):
model = keras.Sequential()
model.add(keras.layers.Dense(
hp.Choice('units', [8, 16, 32]),
activation='relu'))
model.add(keras.layers.Dense(1, activation='relu'))
model.compile(loss='mse')
return model
```

Initialize a tuner (here, `RandomSearch`).
We use `objective` to specify the objective to select the best models,
and we use `max_trials` to specify the number of different models to try.

```python
tuner = keras_tuner.RandomSearch(
build_model,
objective='val_loss',
max_trials=5)
```

Start the search and get the best model:

```python
tuner.search(x_train, y_train, epochs=5, validation_data=(x_val, y_val))
best_model = tuner.get_best_models()[0]
```

To learn more about KerasTuner, check out [this starter guide](https://keras.io/guides/keras_tuner/getting_started/).


---
## Citing KerasTuner

If KerasTuner helps your research, we appreciate your citations.
Here is the BibTeX entry:

```bibtex
@misc{omalley2019kerastuner,
title = {KerasTuner},
author = {O'Malley, Tom and Bursztein, Elie and Long, James and Chollet, Fran\c{c}ois and Jin, Haifeng and Invernizzi, Luca and others},
year = 2019,
howpublished = {\url{https://github.com/keras-team/keras-tuner}}
}
```
146 changes: 146 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# KerasNLP

<a class="github-button" href="https://github.com/keras-team/keras-nlp" data-size="large" data-show-count="true" aria-label="Star keras-team/keras-nlp on GitHub">Star</a>

KerasNLP is a natural language processing library that works natively
with TensorFlow, JAX, or PyTorch. Built on Keras 3, these models, layers,
metrics, and tokenizers can be trained and serialized in any framework and
re-used in another without costly migrations.

KerasNLP supports users through their entire development cycle. Our workflows
are built from modular components that have state-of-the-art preset weights when
used out-of-the-box and are easily customizable when more control is needed.

This library is an extension of the core Keras API; all high-level modules are
[`Layers`](/api/layers/) or
[`Models`](/api/models/) that receive that same level of polish
as core Keras. If you are familiar with Keras, congratulations! You already
understand most of KerasNLP.

See our [Getting Started guide](/guides/keras_nlp/getting_started)
to start learning our API. We welcome
[contributions](https://github.com/keras-team/keras-nlp/blob/master/CONTRIBUTING.md).

---
## Quick links

* [KerasNLP API reference](/api/keras_nlp/)
* [KerasNLP on GitHub](https://github.com/keras-team/keras-nlp)
* [List of available pre-trained models](/api/keras_nlp/models/)

## Guides

* [Getting Started with KerasNLP](/guides/keras_nlp/getting_started/)
* [Uploading Models with KerasNLP](/guides/keras_nlp/upload/)
* [Pretraining a Transformer from scratch](/guides/keras_nlp/transformer_pretraining/)

## Examples

* [GPT-2 text generation](/examples/generative/gpt2_text_generation_with_kerasnlp/)
* [Parameter-efficient fine-tuning of GPT-2 with LoRA](/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/)
* [Semantic Similarity](/examples/nlp/semantic_similarity_with_keras_nlp/)
* [Sentence embeddings using Siamese RoBERTa-networks](/examples/nlp/sentence_embeddings_with_sbert/)
* [Data Parallel Training with tf.distribute](/examples/nlp/data_parallel_training_with_keras_nlp/)
* [English-to-Spanish translation](/examples/nlp/neural_machine_translation_with_keras_nlp/)
* [GPT text generation from scratch](/examples/generative/text_generation_gpt/)
* [Text Classification using FNet](/examples/nlp/fnet_classification_with_keras_nlp/)

---
## Installation

KerasNLP supports both Keras 2 and Keras 3. We recommend Keras 3 for all new
users, as it enables using KerasNLP models and layers with JAX, TensorFlow and
PyTorch.

### Keras 2 Installation

To install the latest KerasNLP release with Keras 2, simply run:

```
pip install --upgrade keras-nlp
```

### Keras 3 Installation

There are currently two ways to install Keras 3 with KerasNLP. To install the
stable versions of KerasNLP and Keras 3, you should install Keras 3 **after**
installing KerasNLP. This is a temporary step while TensorFlow is pinned to
Keras 2, and will no longer be necessary after TensorFlow 2.16.

```
pip install --upgrade keras-nlp
pip install --upgrade keras
```

To install the latest nightly changes for both KerasNLP and Keras, you can use
our nightly package.

```
pip install --upgrade keras-nlp-nightly
```

**Note:** Keras 3 will not function with TensorFlow 2.14 or earlier.

See [Getting started with Keras](/getting_started/) for more information on
installing Keras generally and compatibility with different frameworks.

---
## Quickstart

Fine-tune BERT on a small sentiment analysis task using the
[`keras_nlp.models`](/api/keras_nlp/models/) API:

```python
import os
os.environ["KERAS_BACKEND"] = "tensorflow" # Or "jax" or "torch"!

import keras_nlp
import tensorflow_datasets as tfds

imdb_train, imdb_test = tfds.load(
"imdb_reviews",
split=["train", "test"],
as_supervised=True,
batch_size=16,
)
# Load a BERT model.
classifier = keras_nlp.models.BertClassifier.from_preset(
"bert_base_en_uncased",
num_classes=2,
)
# Fine-tune on IMDb movie reviews.
classifier.fit(imdb_train, validation_data=imdb_test)
# Predict two new examples.
classifier.predict(["What an amazing movie!", "A total waste of my time."])
```

---
## Compatibility

We follow [Semantic Versioning](https://semver.org/), and plan to
provide backwards compatibility guarantees both for code and saved models built
with our components. While we continue with pre-release `0.y.z` development, we
may break compatibility at any time and APIs should not be consider stable.

## Disclaimer

KerasNLP provides access to pre-trained models via the `keras_nlp.models` API.
These pre-trained models are provided on an "as is" basis, without warranties
or conditions of any kind. The following underlying models are provided by third
parties, and subject to separate licenses:
BART, DeBERTa, DistilBERT, GPT-2, OPT, RoBERTa, Whisper, and XLM-RoBERTa.

## Citing KerasNLP

If KerasNLP helps your research, we appreciate your citations.
Here is the BibTeX entry:

```bibtex
@misc{kerasnlp2022,
title={KerasNLP},
author={Watson, Matthew, and Qian, Chen, and Bischof, Jonathan and Chollet,
Fran\c{c}ois and others},
year={2022},
howpublished={\url{https://github.com/keras-team/keras-nlp}},
}
```
21 changes: 21 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# KerasCV Bounding Boxes

All KerasCV components that process bounding boxes require a `bounding_box_format`
argument. This argument allows you to seamlessly integrate KerasCV components into
your own workflows while preserving proper behavior of the components themselves.

Bounding boxes are represented by dictionaries with two keys: `'boxes'` and `'classes'`:

```
{
'boxes': [batch, num_boxes, 4],
'classes': [batch, num_boxes]
}
```

To ensure your bounding boxes comply with the KerasCV specification, you can use [`keras_cv.bounding_box.validate_format(boxes)`](https://github.com/keras-team/keras-cv/blob/master/keras_cv/bounding_box/validate_format.py).

The bounding box formats supported in KerasCV
[are listed in the API docs](/api/keras_cv/bounding_box/formats)
If a format you would like to use is missing,
[feel free to open a GitHub issue on KerasCV](https://github.com/keras-team/keras-cv/issues)!
7 changes: 7 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# KerasCV

These guides cover the [KerasCV](/keras_cv/) library.

## Available guides

{{toc}}
7 changes: 7 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# KerasNLP

These guides cover the [KerasNLP](/keras_nlp/) library.

## Available guides

{{toc}}
Loading

0 comments on commit 9cf1b43

Please sign in to comment.