Skip to content

PetroIvaniuk/llms-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 

Repository files navigation

LLMs Tools & Research Projects

The repository contains a list of ready-to-use AI Tools, Open Sources, and Research Projects
Apart from LLMs, you can find here new AI research from other areas such as Computer Vision, etc.
Welcome to contribute.

Nobel Prize

The Nobel Prize in Physics 2024 was awarded to John J. Hopfield and Geoffrey E. Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks”.

The Nobel Prize in Chemistry 2024 was awarded with one half to David Baker “for computational protein design” and the other half jointly to Demis Hassabis and John M. Jumper “for protein structure prediction”.

Jürgen Schmidhuber's Post: The NobelPrizeinPhysics2024 for Hopfield & Hinton rewards plagiarism and incorrect attribution in computer science

Large Language Models (LLMs) and Chatbots

DeepLearning.AI Short Courses | Andrew Ng - short courses about LLMs
The Inside Story of ChatGPT’s Astonishing Potential | Greg Brockman | Video TED
State of GPT | Andrej Karpathy | Video
[1hr Talk] Intro to Large Language Models | Andrej Karpathy | Video
Opportunities in AI - 2023 | Andrew Ng | Video
GPT-4 Turbo | OpenAI DevDay, Opening Keynote | Sam Altman | Video
Neural networks | 3Blue1Brown | Videos

Anthropic Quickstarts | Code

2023: The Year of AI | Reading
AI Index Report (Since 2017) | Stanford University | Reading
State of AI Report | October, 2024 | Nathan Benaich & Alex Chalmers | Reading
Prompt Engineering Guide | Reading
Prompt engineering | OpenAI | Reading
Full Stack Retrieval | Greg Kamradt | Reading

The Rise and Rise of A.I. LLMs & their associated bots like ChatGPT | Visualization
Opening up ChatGPT: tracking openness of instruction-tuned LLMs
Generative AI exists because of the transformer | Visualization
Can an AI make a data-driven, visual story? | Visualization

Competitions

AI Mathematical Olympiad - Progress Prize 2 - solve national-level math challenges using artificial intelligence models (Deadline: Mar, 2025)
Google - Unlock Global Communication with Gemma - create Gemma model variants for a specific language or unique cultural aspect (Deadline: Jan, 2024)
Google - Gemini Long Context - demonstrate interesting use cases for Gemini's long context window (Deadline: Dec, 2024)
Gemini API Developer Competition - build incredible apps with the Gemini API, $1 million in cash prizes (Deadline: Sep, 2024)

Models

2021-22 2023 2024
Google LaMDA, GLaM
PaLM, Chinchilla
Bard, PaLM-2, Gemini Gemini 1.5, Gemma,
Gemini 1.5 Flash, Gemma 2
OpenAI ChatGPT GPT-4, GPT-4 Turbo GPT-4o, GPT-4o mini, CriticGPT,
o1-preview, o1-mini
MetaAI Galactica LLaMA, LLaMA2: HF,
Purple Llama
LLaMA3, Llama 3.1, Llama 3.2,
quantized Llama
Mistral AI Mistral 7B, Mixtral of experts Mistral Large, Mistral Large 2,
Mistral NeMo, Pixtral 12B,
Ministral 3B and Ministral 8B
Stability AI Stable Vicuna, StableLM,
Stable LM 3B, Stable Beluga,
Stable Chat, Stable LM Zephyr 3B
Stable LM 2 1.6B, Stable LM 2 12B
Anthropic RL-CAI Claude, Claude2, Claude2.1 Claude 3: Haiku, Sonnet & Opus,
Claude 3.5 Sonnet
EleutherAI GPT-J, GPT-NeoX,
GPT Neo
Pythia
BigScience Bloom
Microsoft phi-1, phi-1.5, phi-2 phi-3, phi-3.5
Inflection AI Inflection-2 Inflection-2.5
Stanford Alpaca
Berkeley-BAIR Koala
Vicuna Team Vicuna
TII Falcon Falcon Mamba 7B
Cohere Command R+, Rerank 3
xAI Grok-1, Grok-1.5, Grok-2
NVIDIA Nemotron-4 340B,
Minitron-4B-Base, NVLM 1.0,
Llama-3.1-Nemotron-70B-Instruct
AI21Lab Jamba, Jamba 1.5
Abacus.AI Giraffe Smaug-72B-v0.1
Alibaba Cloud Qwen, Qwen2, Qwen2.5

Open Source Models

Model Company Date Notes
Qwen2.5 Family Alibaba Cloud 2024-09-19 some versions
phi-3 Microsoft 2023-05-21
Qwen2 Family Alibaba Cloud 2024-06-07 some versions
Llama Family MetaAI
DBRX Databricks 2024-03-27 a general purpose LLM
Gemma Google 2024-02-21
phi-2 Microsoft 2023-12-12
Mistral 7B Mistral 2023-09-27 Apache 2.0

Chats & Assistants

Chat Company Notes
Stable Assistant Stability AI latest text and image generation technology featuring Stable Diffusion 3,
Stable Video, Stable Image Services and Stable LM 2 12B
Moshi Kyutai engaging conversations limited to five minutes, thinks and speaks at the same time
MetaAI MetaAI
character.ai Character.AI talk with fictional AI characters
POE Quora talk to ChatGPT, GPT-4, Claude 3 Opus, DALLE 3, and millions of others
Hume Hume empathic AI voice chat
Pi Inflection AI
Gemini Google
ChatRTX Nvidia runs locally on your PC
Le Chat Mistral AI
Copilot Microsoft
ChatGPT OpenAI
  • Granite, Granite 3.0 - a family of open, performant, and trusted AI models, tailored for business and optimized to scale your AI applications, by IBM
  • Molmo - Multimodal Open Language Model, Molmo is small but punching well above its weight, by Ai2
  • Paperguide - AI Research Assistant, Reference Manager and Writing Assistant that help you understand papers, manage references, annotate/take notes, and supercharge your writing
  • Gemma Scope Demo - a beginner-friendly introduction to interpretability that explores an AI model called Gemma 2 2B. It also contains interesting and relevant content even for those already familiar with the topic
  • Hermes 3 - the latest version in our Hermes series, available in 3 sizes, 8, 70, and 405B parameters
  • SmolLM - a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset, by Hugging Face
  • SearchGPT - a temporary prototype of new AI search features that give you fast and timely answers with clear and relevant sources
  • InternLM 2.5 - outstanding reasoning capability, 1M context window, stronger tool use
  • FILM - repo can help you to reproduce the results of FILM-7B, a 32K-context LLM that overcomes the lost-in-the-middle problem. FILM-7B is trained from Mistral-7B-Instruct-v0.2 by applying Information-Intensie (In2) Training, by Microsoft
  • gpt2-chatbots (aka GPT-4o)
  • Snowflake Arctic - an enterprise-focused LLM designed to provide cost-effective training and openness
  • Reka Core - Multimodal LLM
  • ChatFlow - a no-code platform that lets you set up an OpenAI-powered chatbot for your website
  • Perplexity - the AI-chatbot-powered search engine
  • Ferret - An End-to-End MLLM that Accept Any-Form Referring and Ground Anything in Response, by Apple
  • NotebookLM - a powerful new interface that lets you shift effortlessly from reading to asking questions to writing, with an AI thought partner helping you at every turn
  • LLM360 - enables community-owned AGI through open-source large model research and development (K2-65B, CrystalCoder-7B, Amber-7B)
  • Amazon Titan - a breadth of high-performing image, multimodal, and text model choices, via a fully managed API, by AWS
  • Mirasol - a multimodal model for learning across audio, video, & text that decouples the modeling into separate autoregressive models to process the inputs according to the characteristics of their modalities, for state-of-the-art performance, by Google
  • UniIR - Universal Multimodal Information Retrievers, framework to learn a single retriever to accomplish (possibly) any retrieval task
  • Tulu-2-DPO model - RLHF method DPO scales to 70B parameters, clearly compare PEFT fine-tuning to full-parameter fine-tuning
  • Phind, Phind-70B - model that matches and exceeds GPT-4's coding abilities while running 5x faster
  • FacTool - a tool augmented framework for detecting factual errors of texts generated by LLMs. Factool now supports 4 tasks: knowledge-based QA, code generation, mathematical reasoning, scientific literature review
  • Nougat - Neural Optical Understanding for Academic Documents, a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents, by MetaAI
  • TextFX - AI-powered tools for rappers, writers and wordsmiths
  • Prompt2Model - a system that takes a natural language task description (like the prompts used for LLMs such as ChatGPT) to train a small special-purpose model that is conducive for deployment
  • ToolBench - open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability
  • Platypus - a family of fine-tuned and merged LLMs that achieves the strongest performance and currently stands at first place in HuggingFace's Open LLM Leaderboard as of the release date of this work
  • OpenFlamingo V2 - an open-source effort to replicate DeepMind's Flamingo models
  • MetaGPT - a framework involving LLM-based multi-agents that encodes human standardized operating procedures (SOPs) to extend complex problem-solving capabilities that mimic efficient human workflows
  • Universal and Transferable Adversarial Attacks on Aligned Language Models
  • FlashAttention - an algorithm to speed up attention and reduce its memory footprint—without any approximation
  • Quivr - utilizes the power of Generative AI to store and retrieve unstructured information
  • LongLLaMA - a LLM capable of handling long contexts of 256k tokens or even more
  • OpenLLaMA - open source reproduction of MetaAI’s LLaMA
  • BuboGPT - an advanced LLM that incorporates multi-modal inputs including text, image and audio, with a unique ability to ground its responses to visual objects
  • LAION - Large-scale Artificial Intelligence Open Network
  • Dalai, Code - run LLaMA and Alpaca on your computer
  • LLaMAChat - allows you to chat with LLaMa, Alpaca and GPT4All models all running locally on your CPU
  • GPT4All, Code - an open-source assistant-style LLM that run locally on your CPU
  • SdkVercelAI - you can input a prompt, pick different LLMS, and compare two side by side
  • ChatwithData.ai - AI tool that lets you extract valuable insights and information from data files effortlessly
  • Open Assistant - a completely open-source ChatGPT alternative
  • HuggingChat - first open-source alternative to ChatGPT Powered by Open Assistant's latest model
  • ChatPDF - chat with any PDF
  • PdfGPT - a tool where you can upload pdf and get summaries, answers to your questions by OpenAI
  • Baize - an open-source chat model trained with LoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself
  • Chameleon - a compositional reasoning framework designed to enhance LLMs and overcome their inherent limitations, such as outdated information and lack of precise reasoning

Watermarks

  • SynthID, SynthID Text - watermarks and identifies AI-generated content by embedding digital watermarks directly into AI-generated images, audio, text or video, by Google DeepMind and Hugging Face
  • Stable Signature - a new method for watermarking images, by MetaAI

Offline-Mode

  • msty - the easiest way to use local and online AI models
  • aider - AI pair programming in your terminal
  • Open Interpreter - an open-source, locally running implementation of OpenAI's Code Interpreter
  • ollama - get up and running with Llama 3, Mistral, Gemma, and other LLMs
  • OpenLLM - an open-source platform designed to facilitate the deployment and operation of LLMs in real-world applications
  • LM Studio - an easy way to run open-source LLMs locally
  • Jan - open-source ChatGPT alternative that runs 100% offline on your computer
  • Pinokio - a browser that lets you install, run, and programmatically control ANY application, automatically

Large Visual Language Models (LVLMs)

  • Qwen2-VL - latest version of the VLM based on Qwen2 in the Qwen model familities: SoTA understanding of images of various resolution & ratio; Understanding videos of 20min+; Agent that can operate your mobiles, robots, etc; Multilingual Support
  • Qwen-VL - multimodal version of the large model series. Accepts image, text, and bounding box as inputs, outputs text and bounding box
  • PaliGemma - a powerful open VLM inspired by PaLI-3, optimized for image captioning, visual Q&A and other image labeling tasks, by Google
  • Idefics2 - it can answer questions about images, describe visual content, create stories grounded in multiple images, extract information from documents, and perform basic arithmetic operations
  • Grok-1.5 Vision - can process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs, by xAI
  • CogVLM & CogAgent - an 18 billion parameter visual language model specializing in GUI understanding and navigation; supports high-resolution inputs (1120x1120) and shows abilities in tasks such as visual Q&A, visual grounding, and GUI Agent
  • AnyText - Multilingual Visual Text Generation And Editing
  • AnomalyGPT - the LVLM based Industrial Anomaly Detection (IAD) method that can detect anomalies in industrial images without the need for manually specified thresholds
  • IDEFICS - an open-access VLM based on Flamingo. The model accepts arbitrary sequences of image and text inputs and produces text outputs, aiming to bring transparency to AI systems and serve as a foundation for open research in multimodal AI systems
  • Prismer - a data- and parameter-efficient VLM that leverages an ensemble of diverse, pre-trained domain experts
  • MiniGPT-4 - upload an image, and then use chat to identify what's in the picture and learn more about it
  • MultiModal-GPT - a vision and language model for multi-round dialogue with humans; the model is fine-tuned from OpenFlamingo, with LoRA added in the cross-attention and self-attention parts of the language model
  • LLaVA - a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding
  • TaskMatrix - connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting

Evaluation

  • JailbreakBench - Repository of jailbreak artifacts, Standardized evaluation framework, Leaderboard, Dataset
  • SWE-bench Verified - a benchmark for evaluating LLMs’ abilities to solve real-world software issues sourced from GitHub, by OpenAI
  • SWE-bench - Can Language Models Resolve Real-world Github Issues?
  • promptbench - a Unified Library for Evaluating and Understanding LLM
  • Vibe-Eval - evaluation suite for measuring progress of multimodal language models, by Reka
  • FACET - FAirness in Computer Vision EvaluaTion - a new comprehensive benchmark for evaluating the fairness of computer vision models across classification, detection, instance segmentation, and visual grounding tasks
  • Arthur Bench - an open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models
  • AgentBench - the first benchmark designed to evaluate LLM-as-Agent across a diverse spectrum of different environments
  • L-Eval - a comprehensive long-context language models evaluation suite with 18 long document tasks across multiple domains that require reasoning over long texts, including summarization, question answering, in-context learning with long CoT examples, topic retrieval, and paper writing assistance
  • OpenICL - an open-source toolkit for in-context learning and LLM evaluation; supports various state-of-the-art retrieval and inference methods, tasks, and zero-/few-shot evaluation of LLMs
  • OpenAGI - an open-source AGI research platform, specifically designed to offer complex, multi-step tasks and accompanied by task-specific datasets, evaluation metrics, and a diverse range of extensible models

Leaderboards

  • AgentBoard - a benchmark designed for multi-turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates
  • LLM Hallucination Index - A Ranking & Evaluation Framework For LLM Hallucinations
  • Artificial Analysis - Text to Image AI Model & Provider Leaderboard across quality, generation time, and price
  • SEAL Leaderboards - Safety, Evaluations and Alignment Lab: (i) generate code, (ii) work on Spanish-language inputs and outputs, (iii) follow detailed instructions, and (iv) solve fifth-grade math problems, by Scale AI
  • HELM - Holistic Evaluation of Language Models projec - leaderboards with many scenarios, metrics, and models with support for multimodality and model-graded evaluation, by Stanford
  • vals.ai - an independent model testing service, developed benchmarks that rank LLM performance of tasks associated with income taxes, corporate finance, and contract law; it also maintains a pre-existing legal benchmark, by Vals AI
  • TrustLLM - a comprehensive study of Trustworthiness in LLMs
  • LMSYS Chatbot Arena - an open platform to evaluate LLMs by human preference in the real-world
  • Open LLM Leaderboard - evaluate models on 6 key benchmarks using the Eleuther AI Language Model Evaluation Harness, a unified framework to test generative language models on a large number of different evaluation tasks
  • LLM-Perf Leaderboard - a benchmark the performance (latency, throughput, memory & energy) of LLMs with different hardwares, backends and optimizations using Optimum-Benhcmark
  • Hallucinations Leaderboard - evaluates the propensity for hallucination in LLMs across a diverse array of tasks, including Closed-book Open-domain QA, Summarization, Reading Comprehension, Instruction Following, Fact-Checking, and Hallucination Detection
  • NPHardEval leaderboard - a benchmark for assessing the reasoning abilities of LLMs through the lens of computational complexity classes
  • LLM Safety Leaderboard - evaluation for LLM safety and help researchers and practitioners better understand the capabilities, limitations, and potential risks of LLMs
  • The Open Medical-LLM Leaderboard - aims to track, rank and evaluate the performance of LLMs on medical question answering tasks
  • TheFastest.AI - site that provides reliable measurements for the performance of popular models
  • GAIA Leaderboard - evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc)

Datasets

  • InfiMM-WebMath-40B Dataset - large-scale, open-source multimodal dataset specifically designed for mathematical reasoning tasks
  • MMMLU - Multilingual Massive Multitask Language Understanding
  • Natural Questions - contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question

Libraries

  • LangChain, docs - a framework for developing applications powered by language models
  • LlamaIndex, docs - a “data framework” to help you build LLM apps
  • LLaMA2-Accessory - an open-source toolkit for pre-training, fine-tuning and deployment of LLMs and mutlimodal LLMs
  • LLaMA-Adapter - a lightweight adaption method for fine-tuning Instruction-following and Multi-modal LLaMA models
  • streaming-llm - Efficient Streaming Language Models with Attention Sinks
  • llamafile - run LLMs with a single file
  • outlines, docs - a library to write reliable programs for interactions with generative models: language models, diffusers, multimodal models, classifiers, etc
  • OneLLM - One Framework to Align All Modalities with Language
  • guidance - interleave generation, prompting, and logical control into a single continuous flow matching how the language model actually processes the text
  • nanoGPT - the simplest, fastest repository for training/finetuning medium-sized GPTs
  • TorchScale - a PyTorch library that allows researchers and developers to scale up Transformers efficiently and effectively
  • InvokeAI - an implementation of Stable Diffusion, the open source text-to-image and image-to-image generator
  • ComfyUI - a powerful and modular Stable Diffusion GUI and backend. This UI will let you design and execute advanced stable diffusion pipelines using a graph/nodes/flowchart based interface
  • StableSwarmUI - Modular Stable Diffusion Web-User-Interface, with an emphasis on making powertools easily accessible, high performance, and extensibility
  • Wanda - Pruning LLMs by Weights and Activation: removes weights on a per-output basis, by the product of weight magnitudes and input activation norms
  • LOMO: LOw-Memory Optimization - a new optimizer, which fuses the gradient computation and the parameter update in one step to reduce memory usage
  • LMFlow - an extensible, convenient, and efficient toolbox for finetuning large machine learning models, designed to be user-friendly, speedy and reliable, and accessible to the entire community
  • Heron - a library that seamlessly integrates multiple Vision and Language models, as well as Video and Language models. Additionally, we provide pretrained weights trained on various datasets
  • Curated Transformers - a transformer library for PyTorch. It provides state-of-the-art models that are composed from a set of reusable components, by Explosion
  • spacy-llm - integrates LLMs into spaCy, featuring a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required, by Explosion
  • Medusa - a simple framework that democratizes the acceleration techniques for LLM generation with multiple decoding heads
  • Self-RAG - a new framework to train an arbitrary LM to learn to retrieve, generate, and critique to enhance the factuality and quality of generations, without hurting the versatility of LLMs
  • Mirascope, docs - a toolkit for developing production-ready LLM-powered tools using Python and Pydantic
  • gateway — route to 100+ open & closed source models with a unified API. It is also production-ready with support for caching, fallbacks, retries, timeouts, loadbalancing, and can be edge-deployed for minimum latency
  • corenet - a library for training deep neural networks for variety of tasks, including foundation models (e.g., CLIP and LLM), object classification, object detection, and semantic segmentation
  • MONSTER API - a platform for no code LLM fine tuning and deployments
  • Lamini Platform - a LLM platform that seamlessly integrates every step of the model refinement and deployment process – making model selection, model tuning and inference usage incredibly straightforward for your dev team
  • PowerInfer - a CPU/GPU LLM inference engine leveraging activation locality for your device
  • mixtral-offloading - efficient inference of Mixtral-8x7B models
  • bitnet.cpp - is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next)
  • LayerSkip - end-to-end solution promises to accelerate LLM generation times without the need for specialized hardware, by MetaAI
  • Lingua - a lean, efficient, and easy-to-hack codebase to research LLMs, by MetaAI
  • fairchem - the FAIR Chemistry's centralized repository of all its data, models, demos, and application efforts for materials science and quantum chemistry, by MetaAI

Agents

  • swarm - educational framework exploring ergonomic, lightweight multi-agent orchestration, by OpenAI
  • Agent-S - an open agentic framework that uses computers like a human
  • TEN-Agent - a real-time multimodal agent integrated with the OpenAI Realtime API, RTC, and features weather checks, web search, vision, and RAG capabilities
  • bee-agent-framework - open-source framework for building, deploying, and serving powerful agentic workflows at scale
  • agent.exe - the easiest way to let Claude's new computer use capabilities take over your computer
  • Pearl - a production-ready RL AI Agent Library, by MetaAI
  • OpenAgents - an open platform for using and hosting language agents in the wild of everyday life
  • agents - an open-source library/framework for building autonomous language agents
  • ChatDev - highly customizable and extendable framework, which is based on LLMs and serves as an ideal scenario for studying collective intelligence
  • JARVIS-1 - Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models, generate sophisticated plans, and perform embodied control, within the open-world Minecraft universe
  • AppAgent - Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone app

Devices

  • NotePin - wearable AI memory capsule, by Plaud
  • biped.ai - an AI wearable vest that helps blind and visually impaired people avoid obstacles, follow GPS instructions, and find crosswalks or door
  • LPU Inference Engine - Language Processing Units, by Groq
  • FigureAI - AI robotics company bringing a general purpose humanoid to life
  • SanctuaryAI - company on a mission to create the world’s first human-like intelligence in general-purpose robots
  • Mytra - warehouse robotics
  • friend - AI-Powered Necklace companion designed not to help you get things done but to be there for you—anytime, anywhere
  • Limitless - personalized AI powered by what you’ve seen, said, and heard
  • rabbit r1 - a personalized operating system through a natural language interface
  • 01 Project - the open-source language model computer, by Open Interpreter

Glasses

  • G1 - , by evenrealities
  • AirGo Vision - Audio Smartglasses powered by ChatGPT, by Solosglasses
  • Ray-Ban Meta Smart Glasses - a 12 MP camera and five-mic system, updates, by Ray-Ban & MetaAI
  • Frame - AI glasses designed to be worn as a pair of glasses with a suite of AI capabilities out of the box, by Brilliant Labs
  • air2 - , by xreal
  • TCL RayNeo X2 - AR Glasses, by RayNeo

Income

  • Poe - price-per-message revenue model for AI bot creators
  • GPTs Store - create custom versions of ChatGPT that combine instructions, extra knowledge, and any combination of skills
  • Voice Library - share your voice in the Voice Library today and earn cash rewards when it's used
  • HuggingChat - making the community's best AI chat models available to everyone

Tools

Text-to-Image Text-to-Music Text-to-Video Games Brand Prompt Generator
Midjourney Mubert fal.ai Leonardo.Ai - Assets Flair G-prompter
Adobe Firefly Waveformer PIKA LABS Dreamlab - Animated Sprites Logolivery Prompt Builder
Catbird Morph Studio Kaiber Didimo Midjourney PromptHelper1
BlueWillow Invidio Scenario - Assets Midjourney PromptHelper2
Lexica Moonvalley Skybox - World-building FlowGPT
Imgcreator ilumine AI Bezi - 3D Assets Anthropic
Craiyon LTX Studio Charmed - 3D Assets

Text-to-image

Models
Google Muse, Imagen, Parti, HyperDreamBooth, DreamBooth
StyleDrop, Imagen 2, ImageFX, Imagen 3
OpenAI CLIP, DALL·E, DALL·E 2, DALL·E 3
MetaAI CM3leon, Emu Video, Emu Edit, Imagine
Stability.ai Stable Diffusion XL, DreamStudio, Clipdrop, DeepFloyd IF: (Code, Demo: HF)
SDXL Turbo, Stable Cascade, Stable Diffusion 3, Stable Diffusion 3 Medium,
Adversarial Diffusion Distillation, Stable Diffusion 3.5
Black Forest Labs FLUX.1, FLUX1.1 [pro], FLUX1.1 [pro] Ultra
Playground Playground v2, Playground v3
  • Ideogram - AI tools that will make creative expression more accessible, fun, and efficient
  • Kolors - a large-scale text-to-image generation model based on latent diffusion, by the Kuaishou Kolors team
  • StoryDiffusion - Consistent Self-Attention for Long-Range Image and Video Generation
  • Ilus AI - AI illustration generator
  • Improving Diffusion Models for Authentic Virtual Try-on in the Wild - image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively
  • Distribution Matching Distillation - one-step generator achieves comparable image quality with StableDiffusion v1.5 while being 30x faster
  • Generative Powers of Ten - a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches
  • Delta Denoising Score - a novel scoring function for text-based image editing that guides minimal modifications of an input image towards the content described in a target prompt
  • Prompt-to-Prompt - editing framework, where the edits are controlled by text only
  • OpenCLIP - an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training)
  • LEDITS - combined lightweight approach for real-image editing, incorporating the Edit Friendly DDPM inversion technique with Semantic Guidance, thus extending Semantic Guidance to real image editing, while harnessing the editing capabilities of DDPM inversion
  • Würstchen - Fast Diffusion for Image Generation
  • ExactlyAI - create images in seconds with an AI that understands your style
  • ConceptLab - generative models have enabled us to transform our words into vibrant, captivating imagery
  • IP-Adapter - Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
  • MatchAI - a powerful web app that can copy the color grading from images so you can apply it to your own, by color.io
  • Picogen - nonofficial API to Midjourney AI, Stability AI and DALLE-2 AI
  • FABRIC - Feedback via Attention-Based Reference Image Conditioning - a technique to incorporate iterative feedback into the generative process of diffusion models based on StableDiffusion
  • Controlling Text-to-Image Diffusion by Orthogonal Finetuning (OFT) - for adapting text-to-image diffusion models to downstream tasks
  • InstructPix2Pix Learning to Follow Image Editing Instructions - a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image
  • Composer - a large (5 billion parameters) controllable diffusion model trained on billions of (text, image) pairs. It can exponentially expand the control space through composition, leading to an enormous number of ways to generate and manipulate images, i.e., making the infinite use of finite means
  • GigaGAN: Large-scale GAN for Text-to-Image Synthesis - changing texture with prompting, changing style with prompting, by Adobe Research

Images

  • Sakana AI - drops image models to generate Japan’s traditional ukiyo-e artwork
  • PaintsUndo - A Base Model of Drawing Behaviors in Digital Paintings
  • SkyReels - generate comics from stories or files you upload
  • PhotoMaker - Customizing Realistic Human Photos via Stacked ID Embedding
  • DeWatermark - Remove Watermark from photos online free with AI; Upscales - Upscale Images with AI upto 4K
  • NSF - Neural Spline Fields for Burst Image Fusion and Layer Separation
  • Material Palette - a method to extract Physically-Based-Rendering (PBR) materials from a single real-world image
  • DiffusionLight - a simple yet effective technique to estimate lighting in a single input image
  • Magnific - the image Upscaler & Enhancer
  • wasitai - check if an image was generated by a machine
  • Textify - a tool for replacing the gibberish in AI-generated images with your desired text
  • Interpolating between Images with Diffusion Models - a method for zero-shot controllable interpolation using latent diffusion models
  • AnyDoor: Zero-shot Object-level Image Customization - a diffusion-based image generator with the power to move target objects to new scenes at user-specified locations in a harmonious way
  • Matting Anything, Code, Demo: HF - an efficient and versatile framework for estimating the alpha matte of any instance in an image with user-prompt guidance
  • Plug-and-Play, Code - a large-scale text-to-image generative models have been a revolutionary breakthrough in the evolution of generative AI, allowing us to synthesize diverse images that convey highly complex visual concepts
  • Real-Time Neural Appearance Models - a complete system for real-time rendering of scenes with complex appearance previously reserved for offline use, by NVIDIA
  • Designer - generate stunning designs and original images just by typing what you want. Get writing assistance and automatic layout suggestions for anything you add. Designer expands preview with new AI design features, by Microsoft.
  • Scribble Diffusion - turn your sketch into a refined image using AI
  • StudioGPT - a tool for reimagining an existing image

Computer Vision

  • Depth-Anything - a depth estimation solution that can deal with any images under any circumstance
  • TAO-Amodal - benchmark is a dataset that includes amodal and modal bounding boxes for visible and occluded objects
  • OMG-Seg - One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation
  • PUG (Photorealistic Unreal Graphics) - 3 datasets for representation learning research
  • Tracking Anything in High Quality - a framework for high performance video object tracking and segmentation
  • DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data - a new benchmark of synthetic image triplets that span a wide range of mid-level variations, labeled with human similarity judgments
  • CoTracker, CoTracker3 - an architecture that jointly tracks multiple points throughout an entire video, by MetaAI
  • TAPIR - a model for Tracking Any Point (TAP) that effectively tracks a query point in a video sequence, by Google DeepMind
  • DreamTeache - a self-supervised feature representation learning framework that utilizes generative networks for pre-training downstream image backbones, by NVIDIA
  • ImageBind, Demo, Code - Image->Audio, Audio->Image, Text->Image&Audio, Aidio&Image->Image, Audio->Generated Image, by MetaAI
  • V-JEPA - Video Joint Embedding Predictive Architecture is an early example of a physical world model excels at detecting and understanding highly detailed interactions between objects
  • I-JEPA, Code - Image Joint Embedding Predictive Architecture is a method for self-supervised learning. At a high level, I-JEPA predicts the representations of part of an image from the representations of other parts of the same image
  • Visual Prompting - an innovative approach that takes text prompting, used in applications such as ChatGPT, to computer vision
  • Tracking Everything Everywhere All at Once - a new test-time optimization method for estimating dense and long-range motion from a video sequence
  • Track-Anything - a flexible and interactive tool for video object tracking and segmentation. It is developed upon Segment Anything, can specify anything to track and segment via user clicks only
  • EdgeSAM - an accelerated variant of the SAM, optimized for efficient execution on edge devices with minimal compromise in performance
  • EfficientSAM - light-weight SAM models that exhibit decent performance with largely reduced complexity, by MetaAI
  • SAM2 - the next generation of Segment Anything Model for videos and images, by MetaAI
  • SAM, Blog: Introducing SAM, Code - Segment Anything Model is a new AI model that can "cut out" any object, in any image, with a single click. SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training, by MetaAI
  • DINOv2 - a new method for training high-performance CV models, state-of-the-art CV models with self-supervised learning
  • Behind the Scenes: Density Fields for Single View Reconstruction - a neural network that predicts an implicit density field from a single image

Video & Animation

  • Meta Movie Gen - our latest research breakthroughs demonstrate how you can use simple text inputs to produce custom videos and sounds, edit existing videos or transform your personal image into a unique video
  • Mochi 1 - an open-source model for generating high-quality videos from text prompts, by genmo
  • Haiper - simplifies video creation with text-to-video, image-to-video, and video enhancement options
  • Hailuo AI - Image-to-Video
  • Krea - generate images and videos (Luma, Runway, Kling, Hailuo, Pika) with a delightful AI-powered design tool
  • Pyramid Flow - a training-efficient Autoregressive Video Generation model based on Flow Matching
  • Videolulu - create engaging content in popular formats for TikTok, Instagram, and YouTube
  • GoVidify - an AI-powered tool that turns your written content into short-form videos for TikTok, YouTube, and Instagram
  • hotshot - a large-scale diffusion transformer model that serves as the foundation for our upcoming consumer product
  • ClipAnything - the first-ever multimodal AI clipping that lets you clip any moment from any video using visual, audio, and sentiment cues, by Opus
  • Text2Infographic - converts your written content into eye-catching infographics without any need for design skills
  • Flow Studio - uses AI to transform your text prompts into visually captivating short films and videos
  • LivePortrait - Efficient Portrait Animation with Stitching and Retargeting Control
  • Odyssey - Hollywood-grade visual AI
  • VideoPoet - a large language model for zero-shot video generation, by Google Reasearch
  • Character-1 - model allows you to create lip-synced videos to any audio from a still image; imagine worlds, characters and stories with complete creative control, by Hedra
  • GEN-1 & Research, GEN-2 & Research, GEN-3-alpha & Research - a new frontier for high-fidelity, controllable video generation. It is a major improvement in fidelity, consistency, and motion over Gen-2, and a step towards building General World Models, by Runway
  • Showrunner - AI platform designed to let you create an animated TV episode with just a prompt
  • Luma Dream Machine - an AI model that makes high quality, realistic videos fast from text and images, by Luma
  • Kling - video generation with enhanced features and quality
  • ToonCrafter - interpolate two cartoon images by leveraging the pre-trained image-to-video diffusion priors
  • VideoFX - a new experimental tool powered by Veo. It’s designed to help support creatives through the storytelling journey, by Google
  • Veo - generates high-quality 1080p resolution videos in a wide range of cinematic and visual styles that can go beyond a minute, by Google
  • VideoGigaGAN: Towards Detail-rich Video Super-Resolution - a generative VSR model that can produce videos with high-frequency details and temporal consistency, by Adobe Research
  • VASA-1 - Lifelike Audio-Driven Talking Faces Generated in Real Time, by Microdoft
  • MagicTime - Time-lapse Video Generation Models as Metamorphic Simulators
  • Stable Video Diffusion - a foundation model for generative video based on the image model Stable Diffusion
  • EMO - Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
  • SORA - a model (a latent diffusion model that learned to transform noise into videos using an encoder-decoder and transformer) that can create realistic and imaginative scenes from text instructions, by OpenAI
  • LUMIERE - A Space-Time Diffusion Model for Video Generation: Text-to-Video, Image-to-Video, Stylized Generation, Video Stylization, Cinemagraphs, Video Inpainting
  • ActAnywhere - Subject-Aware Video Background Generation
  • MagicVideo-V2 - integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline
  • I2VGen-XL - High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
  • StreamDiffusion - an innovative diffusion pipeline designed for real-time interactive generation
  • WALT - Window Attention Latent Transformer - a transformer-based method for latent video diffusion models (LVDMs)
  • Hotshot - GIF generator
  • Unscreen - remove video background
  • Motrica - technologies and tools for advanced character animation
  • CoDeF - Content Deformation Fields for Temporally Consistent Video Processing
  • MagicEdit - supports various editing applications, including video stylization, local editing, video-MagicMix and video outpainting
  • To Infinity and Beyond - an approach to generating high-quality episodic content for IP's (Intellectual Property) using LLMs, custom state-of-the art diffusion models and our multi-agent simulation for contextualization, story progression and behavioral control
  • PlazmaPunk - create your own music video with the power of AI
  • Video-LLaMA, Code, Demo: HF - a multi-model LLM that achieves video-grounded conversations between humans and computers by connecting language decoder with off-the-shelf unimodal pre-trained models
  • AnimateDiff prompt travel - AnimateDiff with prompt travel + ControlNet + IP-Adapter
  • AnimateDiff, Code - Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
  • Animate-A-Story - a video storytelling approach which can synthesize high-quality, structure-controlled, and character-controlled videos
  • Zeroscope - a watermark-free Modelscope-based video model optimized for producing high-quality 16:9 compositions and a smooth video output
  • Klap - a tool that analyzes the video and finds short clips
  • Lalamu - low-quality video lip sync with preselected videos/video templates (take clips from videos, give the video new audio, and then the lips will sync up to that new audio within the video)
  • D-ID - uses generative AI to create customized videos featuring talking avatars at a touch of a button for businesses and creators.
  • Rooms.xyz - create & remix interactive rooms from your browser
  • Wonder Dynamics - an AI tool that automatically animates, lights, and composes CG characters into a live-action scene
  • REVELxyz - a tool for creating Animated Avatars from a single photo
  • ANIMATED DRAWINGS - a tool that brings children's drawings to life, by animating characters to move around, by MetaAI
  • RERENDER A VIDEO, Demo: HF - a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos
  • Roop, Code - take a video and replace the face in it with a face of your choice. You only need one image of the desired face
  • Text2Performer - Text-Driven Human Video Generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer
  • DragGAN, Code, Demo: HF - way of controlling GANs, that is, to "drag" any points of the image to precisely reach target points in a user-interactive manner. Through DragGAN, anyone can deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc
  • DragDiffusion - Harnessing Diffusion Models for Interactive Point-based Image Editing
  • In-N-Out: Face Video Inversion and Editing with Volumetric Decomposition - our core idea is to represent the face in a video using two neural radiance fields, one for in-distribution and the other for out-of-distribution data, and compose them together for reconstruction
  • High-Resolution Video Synthesis with Latent Diffusion Models - Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space, by NVIDIA

3D

  • cadwithai - a tool that allows users to create and edit CAD models using an AI chatbot to enhance efficiency and creativity in design work
  • Meshy - create stunning 3D models with AI
  • Generative 3D API Toolkit - generate 3D models, materials, and HDRIs at the speed of your imagination. Supercharge your 3D workflow with our groundbreaking Gen3D toolkit from Shutterstock powered by NVIDIA
  • Stable Fast 3D - generates high-quality 3D assets from a single image in just 0.5 seconds
  • Stable Video 4D - a single object video into multiple novel-view videos of eight different angles/views
  • VGGHeads - A Large-Scale Synthetic Dataset for 3D Human Heads
  • CharacterGen- Efficient 3D Character Generation from Single Images with Multi-View Pose Calibration
  • 3D Gen - fast pipeline for text-to-3D asset generation. 3DGen offers 3D asset creation with high prompt fidelity and high-quality 3D shapes and textures in under a minut, by MetaAI
  • InstantMesh - Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models
  • Spline - Generate 3D objects from text prompts and images
  • SIMA - a Scalable Instructable Multiworld Agent (SIMA) that can follow natural-language instructions to carry out tasks in a variety of video game settings
  • Stable Video 3D - Quality Novel View Synthesis and 3D Generation from Single Images, by Stability AI
  • TripoSR - Fast 3D Object Generation from Single Images, by Stability AI
  • BlendNeRF - 3D-aware Blending with Generative NeRFs
  • 4DGen - Grounded 4D Content Generation with Spatial-tempsoral Consistency
  • MobileBrick - Building LEGO for 3D Reconstruction on Mobile Devices. A novel data capturing and 3D annotation pipeline to obtain precise 3D ground-truth shapes without relying on expensive 3D scanners
  • PoseGPT - Chatting about 3D Human Pose
  • ProlificDreamer - High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
  • Stable Zero123 - 3D Object Generation from Single Images
  • SMERF - Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration
  • DreamCraft3D - a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects
  • Genie - 3D fundational model, by Lumalabs
  • Masterpiece X - the generative text-to-3D app that allows users to create 3D objects and characters complete with mesh, texture, and animations
  • GAUSSIAN SPLAT - a rasterization technique for 3D reconstruction and rendering
  • SyncDreamer - generating multiview-consistent images from a single-view image
  • MAV3D (Make-A-Video3D) - a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model
  • HiFA - High-fidelity Text-to-3D with Advanced Diffusion Guidance
  • AutoRecon - a framework named for the automated discovery and reconstruction of an object from multi-view images
  • BITE - enables 3D shape and pose estimation of dogs from a single input image. The model handles a wide range of shapes and breeds, as well as challenging postures far from the available training poses, like sitting or lying on the ground
  • CSM (Common Sense Machines) - generate your own textured 3D assets
  • MotionGPT: Human Motion as Foreign Language - a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks
  • PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360° - the first 3D-aware generative model that enables high-quality view-consistent image synthesis of full heads in 360° with diverse appearance and detailed geometry using only in-the-wild unstructured images for training
  • AvatarBooth - a text-to-3D model. It creates an animatable 3D model with your word description. Also, it can generate customized model with 4~6 photos from your phone or a character design generated from diffusion model
  • Infinigen, Code - a procedural generator of 3D scenes, creating depth maps and labeling every aspect of the world it generates, by Princeton Vision & Learning Lab
  • USD - Universal Scene Description - an open and extensible framework and ecosystem for describing, composing, simulating and collaborating within 3D worlds, originally developed by Pixar Animation Studios
  • Shap-E: Demo, Code - a conditional generative model for 3D assets, by OpenAI
  • Neural Kernel Surface Reconstruction, Code- a novel method for reconstructing a 3D implicit surface from a large-scale, sparse, and noisy point, by NVIDIA
  • Neuralangelo - a framework for high-fidelity 3D surface reconstruction from RGB video captures. Using ubiquitous mobile devices, we enable users to create digital twins of both object-centric and large-scale real-world scenes with highly detailed 3D geometry, by NVIDIA
  • Rodin Diffusion - a Generative Model for Sculpting 3D Digital Avatars, by Microsoft
  • 3D Gaussian Splatting for Real-Time Radiance Field Rendering - three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (≥ 100 fps) novel-view synthesis at 1080p resolution
  • ConsistentNeRF - a method that leverages depth information to regularize both multi-view and single-view 3D consistency among pixels
  • Text2NeRF - a text-driven 3D scene generation framework, combines the neural radiance field (NeRF) and a pre-trained text-to-image diffusion model to generate diverse view-consistent indoor and outdoor 3D scenes from natural language descriptions
  • Zip-NeRF - a technique that combines mip-NeRF 360 and grid-based models such as Instant NGP
  • S-NeRF - a new street-view NeRF (S-NeRF) that considers novel view synthesis of both the large-scale background scenes and the foreground moving vehicles jointly
  • Mip-NeRF 360 - Unbounded Anti-Aliased Neural Radiance Fields, an extension of mip-NeRF that uses a non-linear scene parameterization, online distillation, and a novel distortion-based regularizer to overcome the challenges presented by unbounded scenes
  • 3D-aware Conditional Image Synthesis - a 3D-aware conditional generative model for controllable photorealistic image synthesis. Given a 2D label map, such as a segmentation or edge map, our model synthesizes a photo from different viewpoints
  • Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior - can create high-fidelity 3D content from only a single image
  • Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models - generates textured 3D meshes from a given text prompt using 2D text-to-image models
  • Objaverse-XL - an open dataset of over 10 million 3D objects
  • OmniObject3D - a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects to facilitate the development of 3D perception, reconstruction, and generation in the real world

Audio & Speech & Music

  • Spirit LM - a foundation multimodal language model that freely mixes text and speech
  • Audiobox - generate voices and sound effects using a combination of voice inputs and natural language text prompts — making it easy to create custom audio for a wide range of use cases
  • Seamless - system that unlocks expressive cross-lingual communication in real time
  • SeamlessM4T - a foundational multilingual and multitask model that seamlessly translates and transcribes across speech and text: automatic speech recognition, speech-to-text and speech-to-speech translation, text-to-text and text-to-speech translation
  • AudioCraft - simple framework that generates high-quality, realistic audio and music from text-based user inputs after training on raw audio signals as opposed to MIDI or piano rolls
    • MusicGen, Demo: HF, Code - a simple and controllable model for music generation
    • AudioGen - an auto-regressive generative model that generates audio samples conditioned on text inputs
    • EnCodec - a neural network that is trained end to end to reconstruct the input signal
  • MuAViC - a Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
  • Voicebox - Text-Guided Multilingual Universal Speech Generation at Scale

Google

  • V2A - video-to-audio research uses video pixels and text prompts to generate rich soundtracks
  • MusicFX - a new experimental tool that enables users to generate their own music using AI
  • SingSong - a system which generates instrumental music to accompany input vocals
  • Translatotron 3 - unsupervised speech-to-speech translation from monolingual data
  • AudioPaLM - a LLM for speech understanding and generation
  • MusicLM, Demo - a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff"
  • Universal Speech Model (USM) - a state-of-the-art speech AI for 100+ languages
  • Sound Effects - create distinctive sound effects directly from text descriptions, streamlining your audio production process
  • Dubbing Studio - a tool, enabling automatic, end-to-end video translation across 29 languages. hands-on control over transcript, translation, timing, and more
  • Speech to Speech - a tool that lets you turn the recording of one voice to sound as if spoken by another
  • Eleven Multilingual v2 - a Foundational AI Speech Model for Nearly 30 Languages
  • Eleven Multilingual v1, Demo - generate top-quality spoken audio in any voice and style with the most advanced and multipurpose AI speech tool out there
  • AI Speech Classifier, Demo - detect whether an audio clip was created using ElevenLab

Other

  • Qwen2-Audio - capable of accepting audio and text inputs and generating text outputs
  • Neutone Morpho - pre-trained AI models you can transform any incoming audio into the characteristics, or “style”, of the sounds that the model is based on
  • Lazybird - AI-powered voice over generator – perfect for videos, podcasts, audiobooks, and educational content
  • Stable Audio Open - an open source text-to-audio model for generating up to 47 seconds of samples and sound effects, by Stability AI
  • AI Jukebox - a free in-browser text-to-music generation tool
  • Chatter - an interactive podcast, by Hume
  • OpenVoice, OpenVoice2 - a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages
  • Voice Engine - a model for creating custom voices, by OpenAI
  • Udio - discover, create, and share music with the world
  • Image to SFX - compare sound effects generation models from image caption
  • DubbingAI - AI tool can convert your voice into high-quality cloned voices—from celebrities to your favorite gaming characters—in real time
  • Lyria - AI music generation model
  • StockMusic - a platform for AI-generated tunes that allows you to generate up to 10 minutes of copyright-free music
  • Stable Audio, Stable Audio 2.0 - a system that generates music and sound effects from text, by Stability AI
  • RIFFUSION - the model to generate images of spectrograms and can then be converted to an audio clip
  • CLAP - you can extract a latent representation of any given audio and text for your own model, or for different downstream task
  • Vscoped - effortlessly transcribe your video content to boost click-through rates and watch time
  • MERT, Code, Demo: HF - an Acoustic Music Understanding Model with Large-Scale Self-supervised Training
  • Ecoute - a live transcription tool that provides real-time transcripts for both the user's microphone input (You) and the user's speakers output (Speaker) in a textbox. It also generates a suggested response using OpenAI's GPT-3.5 for the user to say based on the live transcription of the conversation
  • SadTalker: Demo - Stylized Audio-Driven Single Image Talking Face Animation
  • Recast - turn your want-to-read articles into rich audio summaries
  • AudioGPT, Demo: HuggingFace, Code - Understanding and Generating Speech, Music, Sound, and Talking Head
  • Chirp - music model, generates realistic audio - including speech, music and sound effects
  • Bark - a transformer-based text-to-audio model created, by Suno. Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communication like laughing, sighing and crying
  • Whisper - an automatic speech recognition (ASR) system, that approaches human level robustness and accuracy on English speech recognition
  • Musicfy - music like you've never heard. Create and discover AI covers of your favorite songs
  • Jukebox - a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles, by OpenAI
  • Koe Recast - transform your voice using AI

Code & Math

Code Math
Mistral AI Codestral, Codestral Mamba MathΣtral
Stablility AI StableCode, Stable Code 3B, Stable Code Instruct 3B
Google DeepMind FunSearch, alphageometry
Salesforce CodeT5 & CodeT5+, CodeGen2.5
Alibaba Cloud Qwen2-Math, Qwen2.5-Math CodeQwen1.5, Qwen2.5-Coder
  • Genie - AI software engineer - achieving a 30% eval score on the industry standard benchmark SWE-Bench. Genie is a fine-tuned version of GPT-4o with a larger context window of undisclosed size. Genie is able to solve bugs, build features, refactor code, and everything in between either fully autonomously or paired with the user, like working with a colleague, not just a copilot
  • Devin - first fully autonomous AI software engineer
  • The AI Scientist - Towards Fully Automated Open-Ended Scientific Discovery
  • Dracarys - a new family of open LLMs for coding, by Abacus.AI
  • MathPile - a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens
  • magicoder - a model family empowered by OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets for generating low-bias and high-quality instruction data for code
  • LearnLM - a family of models fine-tuned for learning, and grounded in educational research to make teaching and learning experiences more active, personal and engaging, by Google
  • Llemma - an open language model for mathematics (repository also contains submodules related to the overlap, fine-tuning, and theorem proving experiments described in the paper)
  • AlphaCodium - a test-based, multi-stage, code-oriented iterative flow, that improves the performances of LLMs on code problems
  • sketch-2-app - generate code based on sketch
  • GPT Pilot - a true AI developer that writes code, debugs it, talks to you when it needs help, etc
  • MAmmoTH - a series of open-source LLMs specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset
  • WrenAI - open-source Text-to-SQL solutionf or data teams to get results and insights faster by asking business questions without writing SQL
  • Defog - a state-of-the-art LLM for converting natural language questions to SQL queries, which outperforms major open-source models and slightly outperforms gpt-3
  • v0 - a generative user interface system. It generates copy-and-paste friendly React code based on Shadcn UI and Tailwind CSS that people can use in their projects, by Vercel Labs
  • SafeCoder - a code assistant solution built for the enterprise. In marketing speak: “your own on-prem GitHub copilot”, by Hugging Face
  • Code Llama - a state-of-the-art LLM capable of generating code, and natural language about code, from both code and natural language prompts, by MetaAI
  • Teaching Arithmetic to Small Transformers - small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective
  • InterCode - framework of interactive coding as a standard reinforcement learning (RL) environment, with code as actions and execution feedback as observations
  • LeanDojo - set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research
  • GPT Engineer - is made to be easy to adapt, extend, and make your agent learn how you want your code to look. It generates an entire codebase based on a prompt
  • CodeTF - a one-stop Python transformer-based library for code large language models (Code LLMs) and code intelligence, provides a seamless interface for training and inferencing on code intelligence tasks like code summarization, translation, code generation and so on. It aims to facilitate easy integration of SOTA CodeLLMs into real-world applications
  • Let’s Verify Step by Step - a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”) instead of simply rewarding the correct final answer (“outcome supervision”), by OpenAI
  • 🦍 Gorilla: LLM Connected with Massive APIs - a finetuned LLaMA-based model that surpasses GPT-4 on writing API calls
  • Framer - a tool that constructs a completely unique website for you based on a text prompt
  • Pico - a tool that use GPT4 to instantly build simple, shareable web apps
  • dropbase - uild and prototype web apps faster with AI

Games

  • ExistAI - games from text
  • Genie - a foundation world model trained from Internet videos that can generate an endless variety of playable (action-controllable) worlds from synthetic images, photographs, and even sketches, by Google DeepMind
  • PokemonRedExperiments - train RL agents to play Pokemon Red
  • BitMagic - game creation
  • AI Town - a deployable starter kit for building and customizing your own version of AI town - a virtual town where AI characters live, chat and socialize
  • Generative Agents: Interactive Simulacra of Human Behavior - contains our core simulation module for generative agents—computational agents that simulate believable human behaviors—and their game environment
  • STEVE-1 - a Generative Model for Text-to-Behavior in Minecraft
  • Mastering Stratego - DeepNash, an AI agent that learned the game from scratch to a human expert level by playing against itself
  • Voyager: An Open-Ended Embodied Agent with LLMs - the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention

Robotics

  • Open-TeleVision - an open-sourced immersive teleoperation system with stereo visual feedback. Robots executing highly precise, extremely long-horizon tasks with high success rate, autonomously
  • LeRobot - aims to provide models, datasets, and tools for real-world robotics in PyTorch
  • DrEurek - Language Model Guided Sim-To-Real Transfer
  • UniSim - a real-world simulator range from controllable content creation in games and movies to training embodied agents purely in simulation that can be directly deployed in the real world
  • JAT (Jack of All Trades) - a transformer-based agent capable of playing video games, controlling a robot to perform a wide variety of tasks, understanding and executing commands in a simple navigation environment
  • Dobb·E - an open-source, general framework for learning household robotic manipulation
  • OpenEQA - from word models to world models, by MetaAI
  • Mobile ALOHA - Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation, by Stanford
  • AutoRT, SARA-RT and RT-Trajectory - by Google DeepMind
  • Robot Parkour Learning - a system for learning a single end-to-end vision-based parkour policy of diverse parkour skills using a simple reward without any reference motion data
  • Open X-Embodiment - Robotic Learning Datasets and RT-X Models
  • Eureka - a human-level reward design algorithm powered by LLMs, by NVIDIA
  • Language to rewards for robotic skill synthesis - an approach to teaching robots novel actions through natural language input is proposed, using reward functions as an interface to bridge the gap between language and low-level robot actions
  • VIMA - General Robot Manipulation with Multimodal Prompts
  • RT-2 - a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control, by Google DeepMind
  • Robots That Ask For Help - a framework for measuring and aligning the uncertainty of LLM-based planners such that they know when they don't know and ask for help when needed
  • ViNT: A Foundation Model for Visual Navigation - a goal-conditioned navigation policy trained on diverse, cross-embodiment training data, and can control many different robots in zero-shot
  • Navigating to Objects in the Real World -
  • RVT: Robotic View Transformer - a multi-view transformer for 3D manipulation that is both scalable and accurate. RVT takes camera images and task language description as inputs and predicts the gripper pose action, by NVIDIA
  • TidyBot - personalized Robot Assistance with Large Language Models
  • Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning - by OP3 Soccer Team, by Google DeepMind
  • PaLM-E: An Embodied Multimodal Language Model - embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts
  • Scaling Robot Learning with Semantically Imagined Experience -
  • Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware - low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface

Typography

  • GenType - make an alphabet out of anything, by Google
  • Fontjoy - uses deep learning algorithms to suggest font pairings that balance style and readability
  • ControlNet, Demo: HF, How to make a QR code with Stable Diffusion - QR Code Conditioned ControlNet Models for Stable Diffusion. They provide a solid foundation for generating QR code-based artwork that is aesthetically pleasing, while still maintaining the integral QR code shape
  • Word-As-Image for Semantic Typography - A few examples of our Word-As-Image illustrations in various fonts and for different textual concept. The semantically adjusted letters are created completely automatically using our method, and can then be used for further creative design as we illustrate here
  • DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion - create artistic typography automatically, a novel method to automatically generate artistic typography by stylizing one or more letter fonts to visually convey the semantics of an input word, while ensuring that the output remains readable

Bio & Med

  • AlphaFold 3, Code - an AI model that predict the structure of proteins, DNA, RNA, ligands and more, and how they interact, by Google DeepMind and Isomorphic Labs
  • AMIE - a research AI system for diagnostic medical reasoning and conversations, by Google
  • MentalLLaMA - mental health analysis with LLMs
  • AlphaMissense - an AI model classifying missense variants to help pinpoint the cause of diseases
  • meditron - a suite of open-source medical LLM adapted to the medical domain from Llama-2 through continued pretraining on a comprehensively curated medical corpus, including selected PubMed papers and abstracts, a new dataset of internationally-recognized medical guidelines, and a general domain corpus
  • evodiff - combines evolutionary-scale data with diffusion models for controllable protein sequence generation
  • SAM-Med2D - applying the Segment Anything Model (SAM) to medical 2D images
  • Med-Flamingo - a medical vision-language model with multimodal in-context learning abilities
  • Brain2Music - Reconstructing Music from Human Brain Activity
  • Seeing the World through Your Eyes - reconstruct a 3D scene beyond the camera's line-of-sight using portrait images containing eye reflections
  • Mind-Video - High-quality Video Reconstruction from Brain Activity
  • Med-PaLM - a large language model (LLM) designed to provide high-quality answers to medical questions
  • PMC-LLaMA - the official codes for "PMC-LLaMA: Continue Training LLaMA on Medical Papers"

Military

  • AIP Pillars - activate LLMs and other AI on your private network, subject to full control
  • GeoSpy - upload satellite or aerial images, and GeoSpy’s AI examines visual details like landmarks, terrain features, and vegetation patterns to provide precise location predictions

Climat

Other: Fin, Presentation

  • Bricks - an AI-powered tool that generates reports, visuals, and presentations from your data
  • Atlas - a school AI assistant that provides personalized help by studying your specific class materials
  • Food Mood - a fusion recipe generator powered by Google AI
  • GNoME - DL tool that dramatically increases the speed and efficiency of discovery by predicting the stability of new materials
  • FinGPT
  • guidde - create documentation/presentation/FAQ from captured video
  • Gamma - create visually appealing presentations
  • Tome - create a compelling starting point for your presentation in minutes