-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Msc placeholder: exploring LLM as a database #7435
Comments
I explored a bit on DB-GPT with Vicunna-7b. But it didn't work well on my local Laptop due to the RAM limit (30GB required) and the model was running on my CPU (this model could not somehow run on CUDA due to configuration). A further investigation could be:
The computing resource I have access to:
|
For now the most simple around seems to be nanoGPT. Simplicity is always the superior starting point for extreme decentralisation. Thus this seems like a good start for fully LLM as a database + decentralisation or local-only. Alternative to a huge SQL database with BM25 search. The data is tokenised and transformed into LLM. The idea is that it might have some superior properties to the old SQL approach. For instance, decentralised learning with a network of 1+ million Android phones. Think TikTok scale and popularity. Concrete proposed ToDos:
EDIT: for decentralised learning its required that we update (e.g. instruction fine-tuning) the model on laptops or even smartphones. Qualcomm is aiming to support this. (another backup direction: take an open source LLM which support inference on Android, provide first-class support for adding a single new training item. Use-case is content discovery, decentralised search engine or (Tiktok-like) content recommendation; new added item in form of tupple: (content item, URL). |
Some inspiration: https://arxiv.org/pdf/2210.06280.pdf |
Thesis introduction: we know that 1 billion SQL servers are a problem. Technology like Bittorrent and Bitcoin scale without effort to 1 billion peers. LLM is mostly done on servers, with only minor on-device or decentralised approaches. This thesis investigates scaling LLM to a billion devices. instruction-tuned PaLM model (1.5 billion parameters) to TFLite and executed through TFLite runtime {PaLM model } Example of manual dataset for a video search engine alternative to Google, Youtube, and Tiktok
Brainstorm on thesis direction:
update Chroma seems to do the heavy lifting inside PrivateGPT: see code and see tutorial example here. Please try to understand how things work! |
The pretrained part of the GPT2 model (baseline) is from https://huggingface.co/gpt2 In PrivateGPT, the custom source fed to the ingesting https://github.com/imartinez/privateGPT/blob/main/ingest.py is mainly from the extracted text from the input documents (e.g. pptx, pdf). |
Discussed the idea again of "tokenize the URL". The embedding contain a static URL list, with one-hot encoding. Normally a generative model only hallucinates URLs. URL2Vec: AI crisis for copyright monopolies{Possible thesis brainstorm} Many have written about the ongoing copyright crisis due to generative AI in the creative industry. This thesis demonstrates that AI, specifically Large Language Models pose another threat. We build upon breakthroughs in on-device machine learning and embedding to create a decentralised Google-ish search engine. We present a tool which is able to learn online URLs for Youtube, Tiktok, Bittorrent, and IPFS. In principle, this tool removes the need for Internet intermediaries such Big Tech and Hollywood. Independent producers or influencers can easily research their audience based on our URL2Vec tooling. This will put further pressure on the legal construct of copyright. Our starting point is the KerasNLP library by Google. This model support text completion with on-device machine learning. We crafted a decentralised search engine by building upon state-of-the-art pretrained models for natural language processing tasks and adding support for a custom tokenizer with URL understanding. Related work to read: https://blog.reachsumit.com/posts/2023/05/tuning-llm-for-recsys/#instruction-finetuned-llms-for-recommendations Naive ToDo list for starting experiments:
|
Working from the "Naive ToDo" list, concrete steps toward publishable results could be the following:
|
It seems my idea for comparison (between transformers and RNNs) has been performed before: https://arxiv.org/pdf/2005.09471.pdf |
Open LLM challenges. Great background read for writing introduction and citations for Problem Description: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html |
|
Some progress has been made:
Some reflections:
|
update with refs no need to alter your thesis direction, just a note on related work. Recent advances in retrieval-augmented text generation plus intro for that: https://blog.lancedb.com/llms-rag-the-missing-storage-layer-for-ai-28ded35fa984 |
|
|
I made little progress this sprint, unfortunately. I reformatted the notebook
|
|
|
|
Amazing related work by Google Research found by our phd student Petru: #7586 (comment) Even more related work for intro + problem description: https://github.com/vectara/hallucination-leaderboard |
Dictionary extracted from titles from US videos datasetdictionary_title_with_stop_words.txt Investigation about the broken code (Notebook)
Findings
Experiments with Word2Vec
|
|
|
|
Example prompt: "Retrieve a video ID to your knowledge given the following text: 'WE WANT TO TALK ABOUT OUR MARRIAGE' and return the video ID (an 11-character string) directly" And the expected output should be: "2kyS6SvSYSE" (from url https://www.youtube.com/watch?v=2kyS6SvSYSE) The training examples could be:
|
|
|
|
|
ToDo next sprint: document your first 2 (additional) master thesis pages. 1 Figure with, example: 20,50,200, 2030, and 6455 samples. Both learning rate figure and precision figure? All lower-case and using your Spacy sub-sampling idea? Please be sure to explain everything you are doing. Another master students should be able to reproduce your results somewhat. (https://www.overleaf.com/read/jnbcnktyfrgq#719f90) |
Here I meant that it requires '20 steps' for |
Updates:
|
Upcoming sprint: please finish all text of the T5 experiments. Then we can move to earlier sections (intro, design). Finally, add the tags-based semantic experiment. Graduate 🏁 |
|
Sprint focus: focus on finishing all experimental work of this master thesis. |
The 3rd Experiment with Tags:
|
|
|
|
Since the T5 experiment I realized that we should also pay attention to other metrics such as precision and F1 score than the recall. I re-evaluated the results for BERT and T5 and updated them in the paper draft
|
|
[Draft.pdf]
|
|
|
|
placeholder for brainstorm. Finished all master courses. (part-time side job)
Exploring for 1 month what a good master thesis direction is around LLM.
Draft master thesis (again placeholder): Adding memory to LLM and large-scale ingestion of facts
Recommended paper to understand your thesis context and goal further. With donations of resources by volunteers it is possible to build a giant foundational model. Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts.
with 22k stars this is more popular: https://github.com/imartinez/privateGPT
LLM: default to [ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). If you prefer a different GPT4All-J compatible model, just download it and reference it in your .env file.
A possible starting point is the Vicuna enhancement, as a database: https://github.com/csunny/DB-GPT
In addition, we provide private domain knowledge base question-answering capability through LangChain. Furthermore, we also provide support for additional plugins, and our design natively supports the Auto-GPT plugin.
Third option: NanoGPT
The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file train.py reproduces GPT-2 (124M) on OpenWebText
Fourth: smaller than medium{nano} is https://github.com/Lightning-AI/Lit-Parrot
Hackable implementation of state-of-the-art open-source large language models.
Concrete ToDo:
Please register here: https://mare.ewi.tudelft.nl/
The text was updated successfully, but these errors were encountered: