What Are Large Language Models

1.1 Large Language Model

An LLM is a neural network trained on massive text datasets to predict the next token in a sequence. Through this simple objective, the model learns grammar, facts, reasoning patterns, and even code.

Large Language Models (LLMs) represent a fundamental shift in how machines process and generate human language.

1.1.1 Two Families of Language Models

1.1.1.1 Generative Models

Produce text outputs (GPT-4, Claude, Llama). Trained to predict the next token autoregressively.

1.1.1.2 Representation Models

Create vector embeddings of text (BERT, sentence-transformers). Used for classification, search, clustering.

1.1.2 Language and Intelligence

How does “predicting the next word” gets us to solving arithmetic, reasoning, and even coding?

Language encodes relationships between concepts (causal, Q&A patterns). The statistical patterns in text do reflect human reasoning processes.

Then comes the compression argument. To predict the next word accurately across billions of diverse texts, the model can’t just memorize. It must build internal representations of the underlying concepts.

When text describes “2+2=4” in thousands of ways (word problems, code, math notation), the most efficient compression is to learn actual arithmetic.

The model discovers that reasoning, coding logic, and arithmetic are useful abstractions for predicting language, not because they’re explicitly taught, but because they’re the compact way to represent patterns in how humans write about these topics.

1.1.2.1 Practical application

This explains why prompt engineering works:

A right prompt can activate latent knowledge already compressed into its weights. When you write “Let’s solve this step-by-step,” you’re triggering reasoning circuits the model learned because step-by-step text statistically follows different patterns than stream-of-consciousness text.

This is why in production systems, you can dramatically improve outputs by structuring prompts to activate the right “mode” - using XML tags, examples, or chain-of-thought formatting to access different compressed representations.

1.2 The Scale Hypothesis

LLMs improve predictably with scale (more data, more parameters, more compute). This “scaling law” is why models went from millions to hundreds of billions of parameters. The emergent capabilities that appear beyond certain scale thresholds (reasoning, code generation, few-shot learning) were largely unexpected.

1.3 Why this matters for Building Applications

Understanding the distinction between generative and representation models is your first design decision when building LLM-powered systems.

Ask yourself: Representation or Generation

  • Do I need to generate text? → Generative model (GPT, Claude, Llama)
  • Do I need to search/classify/cluster text? → Embedding model (sentence-transformers, Cohere embed)
  • Do I need both? → You’ll combine them (this is the basis of RAG)

The terms ‘Representation’ vs ‘Generation’ are fundamental to every architectural choice downstream: from choosing APIs to deciding whether to fine-tune.

1.2 The Modern LLM Workflow

Building production LLM applications follows a pattern:

Data Collection → Feature Engineering → Model Selection/Tuning → RAG or Fine-Tuning → Inference Pipeline → Deployment → Monitoring

The LLM Engineer’s Handbook frames this as the FTI pipeline:

  • Feature
  • Training
  • Inference

Three decoupled systems that communicate through data stores.