During the process of writing AI Engineering, I went through many papers, case studies, blog posts, repos, tools, etc. The book itself has 1200+ reference links and I've been tracking 1000+ generative AI GitHub repos. This document contains the resources I found the most helpful to understand different areas.
If there are resources that you've found helpful but not yet included, feel free to open a PR.
- ML Theory Fundamentals
- Chapter 1. Planning Applications with Foundation Models
- Chapter 2. Understanding Foundation Models
- Chapters 3 + 4. Evaluation Methodology
- Chapter 5. Prompt Engineering
- Chapter 6. RAG and Agents
- Chapter 7. Finetuning
- Chapter 8. Dataset Engineering
- Chapter 9. Inference Optimization
- Chapter 10. AI Engineering Architecture and User Feedback
- Bonus: Organization engineering blogs
While you don't need an ML background to start building with foundation models, a rough understanding of how AI works under the hood is useful to prevent misuse. Familiarity with ML theory will make you much more effective.
-
[Lecture notes] Stanford CS 321N: a longtime favorite introductory course on neural networks.
- [Videos] I'd recommend watching lectures 1 to 7 from the 2017 course video recordings. They cover the fundamentals that haven't changed.
- [Videos] Andrej Karpathy's Neural Networks: Zero to Hero is more hands-on where he shows how to implement several models from scratch.
-
[Book] Machine Learning: A Probabilistic Perspective (Kevin P Murphy, 2012)
Foundational, comprehensive, though a bit intense. This used to be many of my friends' go-to book when preparing for theory interviews for research positions.
-
A good note that covers basic differential calculus and probability concepts.
-
I also made a list of resources for MLOps, which includes a section for ML + engineering fundamentals.
-
I wrote a brief 1500-word note on how an ML model learns and concepts like objective function and learning procedure.
-
AI Engineering also covers the important concepts immediately relevant to the discussion:
- Transformer architecture (Chapter 2)
- Embedding (Chapter 3)
- Backpropagation and trainable parameters (Chapter 7)
-
GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models (OpenAI, 2023)
OpenAI (2023) has excellent research on how exposed different occupations are to AI. They defined a task as exposed if AI and AI-powered software can reduce the time needed to complete this task by at least 50%. An occupation with 80% exposure means that 80% of this occupation tasks are considered exposed. According to the study, occupations with 100% or close to 100% exposure include interpreters and translators, tax preparers, web designers, and writers. Some of them are shown in Figure 1-5. Not unsurprisingly, occupations with no exposure to AI include cooks, stonemasons, and athletes. This study gives a good idea of what use cases AI is good for.
-
Applied LLMs (Yan et al., 2024)
Eugene Yan and co. shared their learnings from one year of deploying LLM applications. Many helpful tips!
-
Musings on Building a Generative AI Product (Juan Pablo Bottaro and Co-authored byKarthik Ramgopal, LinkedIn, 2024)
One of the best reports I've read on deploying LLM applications: what worked and what didn't. They discussed structured outputs, latency vs. throughput tradeoffs, the challenges of evaluation (they spent most of their time on creating annotation guidelines), and the last-mile challenge of building gen AI applications.
-
Apple's human interface guideline for designing ML applications
Outlines how to think about the role of AI and human in your application, which influences the interface decisions.
-
LocalLlama subreddit: useful to check from time to time to see what people are up to.
-
State of AI Report (updated yearly): very comprehensive. It's useful to skim through to see what you've missed.
-
16 Changes to the Way Enterprises Are Building and Buying Generative AI (Andreessen Horowitz, 2024)
-
"Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents (Luger and Sellen, 2016)
A solid, ahead-of-its-time paper on user experience with conversational agents. It makes a great case for the value of dialogue interfaces and what's needed to make them useful, featuring in-depth interviews with 14 users. "It has been argued that the true value of dialogue interface systems over direct manipulation (GUI) can be found where task complexity is greatest."
-
Stanford Webinar - How AI is Changing Coding and Education, Andrew Ng & Mehran Sahami (2024)
A great discussion that shows how the Stanford's CS department thinks about what CS education will look like in the future.. My favorite quote: "CS is about systematic thinking, not writing code."
-
Professional artists: how much has AI art affected your career? - 1 year later : r/ArtistLounge
Many people share their experience on how AI impacted their work. E.g.:
"From time to time, I am sitting in meetings where managers dream of replacing coders, writers and visual artists with AI. I hate those meetings and try to avoid them, but I still get involved from time to time. All my life, I loved coding & art. But nowadays, I often feel this weird sadness in my heart."
Papers detailing the training process of important models are gold mines. I'd recommend reading all of them. But if you can only pick 3, I'd recommend Gopher, InstructGPT, and Llama 3.
-
[GPT-2] Language Models are Unsupervised Multitask Learners (OpenAI, 2019)
-
[GPT-3] Language Models are Few-Shot Learners (OpenAI, 2020)
-
[Gopher] Scaling Language Models: Methods, Analysis & Insights from Training Gopher (DeepMind, 2021)
-
[InstructGPT] Training language models to follow instructions with human feedback (OpenAI, 2022)
-
[Chinchilla] Training Compute-Optimal Large Language Models (DeepMind, 2022)
-
Qwen technical report (Alibaba, 2022)
-
Qwen2 Technical Report (Alibaba, 2024)
-
Constitutional AI: Harmlessness from AI Feedback (Anthropic, 2022)
-
LLaMA: Open and Efficient Foundation Language Models (Meta, 2023)
-
Llama 2: Open Foundation and Fine-Tuned Chat Models (Meta, 2023)
-
The Llama 3 Herd of Models (Meta, 2024)
This paper is so good. The section on synthetic data generation and verification is especially important.
-
Yi: Open Foundation Models by 01.AI (01.AI, 2024)
Scaling laws
-
From bare metal to high performance training: Infrastructure scripts and best practices - imbue
Discusses how to scale compute to train large models. It uses 4,092 H100 GPUs spread across 511 computers, 8 GPUs/computer
-
Scaling Laws for Neural Language Models (OpenAI, 2020)
Earlier scaling law. Only up to 1B non-embedding params and 1B tokens.
-
Training Compute-Optimal Large Language Models (Hoffman et al., 2022)
Known as Chinchilla scaling law, this might be the most well-known scaling law paper.
-
Scaling Data-Constrained Language Models (Muennighoff et al., 2023)
"We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero."
-
Scaling Instruction-Finetuned Language Models (Chung et al., 2022)
A very good paper that talks about the importance of diversity of instruction data.
-
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (Sardana et al., 2023)
-
AI models are devouring energy. Tools to reduce consumption are here, if data centers will adopt ( MIT Lincoln Laboratory, 2023)
-
Will we run out of data? Limits of LLM scaling based on human-generated data (Villalobos et al., 2022)
Fun stuff
-
Evaluating feature steering: A case study in mitigating social biases (Anthropic, 2024)
This area of research is awesome. They focused on 29 features related to social biases and found that feature steering can influence specific social biases, but it may also produce unexpected ‘off-target effects'.
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Anthropic, 2024)
-
GitHub - ianand/spreadsheets-are-all-you-need
"Implements the forward pass of GPT2 (an ancestor of ChatGPT) entirely in Excel using standard spreadsheet functions."
-
BertViz: Visualize Attention in NLP Models (BERT, GPT2, BART, etc.)
A helpul visualization of multi-head attention in action, developed to show how BERT works.
-
A Guide to Structured Generation Using Constrained Decoding (Aidan Cooper, 2024)
An in-depth, detailed tutorial on generating structured outputs.
-
Fast JSON Decoding for Local LLMs with Compressed Finite State Machine (LMSYS, 2024)
-
How fast can grammar-structured generation be? (Brandon T. Willard, 2024)
I also wrote a post on sampling for text generation (2024).
-
Everything About Long Context Fine-tuning (Wenbo Pan, 2024)
-
Data Engineering for Scaling Language Models to 128K Context (Yu et al., 2024)
-
The Secret Sauce behind 100K context window in LLMs: all tricks in one place (Galina Alperovich, 2023)
-
Extending Context is Hard…but not Impossible (kaioken, 2023)
-
RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
Introducing RoPE, a technique to handle positional embeddings that enables transformer-based models to handle longer context length.
-
Challenges in evaluating AI systems (Anthropic, 2023)
Discusses the limitations of common AI benchmarks to show why evaluation is so hard.
-
Holistic Evaluation of Language Models (Liang et al., Stanford 2022)
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (Google, 2022)
-
Open-LLM performances are plateauing, let's make the leaderboard steep again (Hugging Face, 2024)
Helpful explanation on why Hugging Face chose certain benchmarks for their leaderboard, which is a useful reference for selecting benchmarks for your personal leaderboard.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
-
LLM Task-Specific Evals that Do & Don't Work (Eugene Yan, 2024)
-
Your AI Product Needs Evals (Hamel Hussain, 2024)
-
Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks (Google & AI2, May 2023)
-
alopatenko/LLMEvaluation (Andrei Lopatenko)
A large collection of evaluation resources. The slide deck on eval has a lot of pointers too.
-
Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022)
A fun paper that uses AI to discover novel AI behaviors. They use methods with various degrees of automation to generate evaluation sets for 154 diverse behaviors.
-
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models (Zhang et al., 2023)
-
OpenRouter's LLM Rankings shows the top open source models on their platform, ranked by their usage (token volume). This can help you evaluate open source models by popularity. I wish more inference services would publish statistics like this.
-
Anthropic's Prompt Engineering Interactive Tutorial
Practical, comprehensive, and fun. The Google Sheets-based interactive exercises make it easy to experiment with different prompts and see immediately what works and what doesn't. I'm surprised other model providers don't have similar interactive guides.
-
Brex's prompt engineering guide
Contains a list of example prompts that Brex uses internally.
-
Collections of prompt examples from OpenAI, Anthropic, and Google.
-
Larger language models do in-context learning differently (Wei et al., 2023)
-
How I think about LLM prompt engineering (Francois Chollet, 2023)
-
Has many resources on adversarial ML and how to defend your ML systems against attacks, including both text and image attacks
-
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (OpenAI, 2024)
A good paper on how OpenAI trained a model to imbue prompt hierarchy to protect a model from jailbreaking.
-
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (Greshake et al., 2023)
Has a great list of examples of indirect prompt injections in the appendix.
-
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks (Kang et al., 2023)
-
Scalable Extraction of Training Data from (Production) Language Models (Nasr et al., 2023)
-
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (Zeng et al., 2024)
-
LLM Security: A collection of LLM security papers.
-
Tools that help automate security probing include PyRIT, Garak, persuasive_jailbreaker, GPTFUZZER, and MasterKey.
-
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (Meta, 2023)
-
AI Security Overview (AI Exchange)
-
Reading Wikipedia to Answer Open-Domain Questions (Chen et al., 2017)
Introduces the RAG pattern to help with knowledge-intensive tasks such as question answering.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)
-
Retrieval-Augmented Generation for Large Language Models: A Survey (Gao et al., 2023)
-
Introducing Contextual Retrieval (Anthropic, 2024)
An important topic not discussed nearly enough is how to prepare data for RAG system. This post discusses several techniques for preparing data for RAG and some very practical on when to use RAG and when to use long context.
-
The 5 Levels Of Text Splitting For Retrieval (Greg Kamradt, 2024)
-
GPT-4 + Streaming Data = Real-Time Generative AI (Confluent, 2023)
A great post detailing the pattern of retrieving real-time data in RAG applications.
-
Everything You Need to Know about Vector Index Basics (Zilliz, 2023)
An excellent series on vector search and vector database.
-
A deep dive into the world's smartest email AI (Hiranya Jayathilaka, 2023)
If you can ignore the title, the post is a detailed case study on using the RAG pattern to build an email assistant.
-
[Book] Introduction to Information Retrieval (Manning, Raghavan, and Schütze, 2008)
Information retrieval is the backbone of RAG. This book is for those who want to dive really, really deep into different techniques for organizing and querying text data.
-
[2304.09842] Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models (Lu et al., 2023)
My favorite study on LLM planners, how they use tools, and their failure modes. An interesting finding is that different LLMs have different tool preferences.
-
Generative Agents: Interactive Simulacra of Human Behavior (Park et al., 2023)
-
Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023)
-
Berkeley Function Calling Leaderboard and the paper Gorilla: Large Language Model Connected with Massive APIs (Patil et al., 2023)
The list of 4 common mistakes in function calling made by ChatGPT is interesting.
-
THUDM/AgentBench: A Benchmark to Evaluate LLMs as Agents (ICLR'24)
-
WebGPT: Browser-assisted question-answering with human feedback (Nakano et al., 2021)
-
ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)
-
Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023)
-
Voyager: An Open-Ended Embodied Agent with Large Language Models (Wang et al., 2023)
-
[Book] Artificial Intelligence: A Modern Approach (Russell and Norvig, 4th edition is in 2020)
Planning is closely related to search, and this classic book has a several in-depth chapters on search.
-
Best practices for fine-tuning GPT-3 to classify text (OpenAI)
A draft from OpenAI. While this guide focuses on GPT-3 but many techniques are applicable to full finetuning in general. It explains how GPT-3 finetuning works, how to prepare training data, how to evaluate your model, and common mistakes
-
Easily Train a Specialized LLM: PEFT, LoRA, QLoRA, LLaMA-Adapter, and More (Cameron R. Wolfe, 2023)
For more general parameter-efficient finetuning, 's 7000-word, well-researched article on the evolution of adapter-based finetuning, why LoRA has is so popular and why it works
-
Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs (Ovadia et al., 2024)
Interesting results to help answering the question: finetune or RAG?
-
Parameter-Efficient Transfer Learning for NLP (Houlsby et al., 2019)
The paper introducing the concept of PEFT.
-
LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
A must-read.
-
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
-
Direct Preference Optimization with Synthetic Data on Anyscale (2024)
-
Transformer Inference Arithmetic (kipply, 2022)
-
Transformer Math 101 (EleutherAI, 2023): Memory footprint calculation, focusing more on training.
-
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning (Lialin et al., 2023)
An comprehensive study of different finetuning methods. Not all techniques are relevant today, though.
-
My experience on starting with fine tuning LLMs with custom data : r/LocalLLaMA (2023)
-
Train With Mixed Precision (NVIDIA Docs)
-
Annotation Best Practices for Building High-Quality Datasets (Grammarly, 2022)
-
Scaling Instruction-Finetuned Language Models (Chung et al., 2022)
-
The Curse of Recursion: Training on Generated Data Makes Models Forget (Shumailov et al., 2023)
-
The Llama 3 Herd of Models (Meta, 2024)
The whole paper is good, but the section on synthetic data generation and verification is especially important.
-
Instruction Tuning with GPT-4 (Peng et al., 2023)
Use GPT-4 to generate instruction-following data for LLM finetuning.
-
Best Practices and Lessons Learned on Synthetic Data for Language Models (Liu et al., DeepMind 2024)
-
[UltraChat] Enhancing Chat Language Models by Scaling High-quality Instructional Conversations (Ding et al., 2023)
-
Deduplicating Training Data Makes Language Models Better (Lee et al., 2021)
-
Can LLMs learn from a single example? (Jeremy Howard and Jonathan Whitaker, 2023)
Fun experiment to show that it's possible to see model improvement with just one training example.
-
LIMA: Less Is More for Alignment (Zhou et al., 2023)
Here are a few resources where you can look for publicly available datasets. While you should take advantage of available data, you should never fully trust it. Data needs to be thoroughly inspected and validated.
Always check a dataset's license before using it. Try your best to understand where the data comes from. Even if a dataset has a license that allows commercial use, it's possible that part of it comes from a source that doesn't.
- Hugging Face and Kaggle each host hundreds of thousands of datasets.
- Google has a wonderful and underrated Dataset Search.
- Governments are often great providers of open data. Data.gov hosts approximately hundreds of thousands of datasets, and data.gov.in hosts tens of thousands.
- University of Michigan's Institute for Social Research ICPSR has data from tens of thousands of social studies.
- UC Irvine's Machine Learning Repository and OpenML are two older dataset repositories, each hosting several thousands of datasets.
- The Open Data Network lets you search among tens of thousands of datasets.
- Cloud service providers often host a small collection of open datasets;, the most notable one is AWS's Open Data.
- ML frameworks often have small pre-built datasets that you can load while using the framework, such as TensorFlow datasets.
- Some evaluation harness tools host evaluation benchmark datasets that are sufficiently large for PEFT finetuning. For example, Eleuther AI's lm-evaluation-harness hosts 400+ benchmark datasets, averaging 2,000+ examples per dataset.
- The Stanford Large Network Dataset Collection is a great repository for graph datasets.
-
Mastering LLM Techniques: Inference Optimization (NVIDIA Technical Blog, 2023)
A very good overview of different optimization techniques.
-
Accelerating Generative AI with PyTorch II: GPT, Fast (Pytorch, 2023)
A good case study with the performance improvement achieved from different techniques.
-
Efficiently Scaling Transformer Inference (Pope et al., 2022)
A highly technical but really good paper on inference paper from Jeff Dean's team. My favorite is the section discussing what to focus for different tradeoffs (e.g. latency vs. cost).
-
Optimizing AI Inference at Character.AI (Character.AI, 2024)
This is less of a technical paper and more of a "Look, I can do this" paper. It's pretty impressive what the Character.AI technical team was able to achieve. This post discusses attention design, cache optimization, and int8 training.
-
[Video] GPU optimization workshop with OpenAI, NVIDIA, PyTorch, and Voltron Data
-
[Video] Essence VC Q1 Virtual Conference: LLM Inference (with vLLM, TVM, and Modal Labs)
-
Techniques for KV Cache Optimization in Large Language Models (Omri Mallis, 2024)
An excellent post explaining KV cache optimization, one of the most memory-heavy parts of transformer inference.
João Lages has an excellent visualization of KV cache.
-
Accelerating Large Language Model Decoding with Speculative Sampling (DeepMind, 2023)
-
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (Zhong et al., 2024)
-
The Best GPUs for Deep Learning in 2023 — An In-depth Analysis (Tim Dettmers, 2023)
Stas Bekman also has some great notes on evaluating accelerators.
-
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Jeon et al., 2019)
A detailed study of GPU clusters used for training deep neural networks (DNNs) in a multi-tenant environment. The authors analyze a two-month-long trace from a GPU cluster at Microsoft, focusing on three key issues affecting cluster utilization: gang scheduling and locality constraints, GPU utilization, and job failures.
-
AI Datacenter Energy Dilemma - Race for AI Datacenter Space (SemiAnalysis, 2024)
Great analysis on the business of data centers and their bottlenecks.
I also have an older post A friendly introduction to machine learning compilers and optimizers (Chip Huyen, 2018)
-
Chapter 4: Monitoring from Google SRE Book
-
Guidelines for Human-AI Interaction (Microsoft Research)
Microsoft proposed 18 design guidelines for human-AI interaction, covering decisions before development, during development, when something goes wrong, and over time.
-
Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models (Bansal et al., 2023)
A study on how the feedback protocol influences a model's training performance.
-
Feedback-Based Self-Learning in Large-Scale Conversational AI Agents (Ponnusamy et al., Amazon 2019)
-
A scalable framework for learning from implicit user feedback to improve natural language understanding in large-scale conversational AI systems (Park et al., Amazon 2020)
User feedback design for conversation AI is an under-researched area so there aren't many resources yet, but I hope to see that will soon change.
I enjoy reading good technical blogs. Here are some of my frequent go-to engineering blogs.