Most people using AI tools right now are operating on vibes. They type prompts, read outputs, and trust the whole thing like a vending machine- put in a question, get an answer, move on. That works fine until it doesn’t. Until the AI confidently tells you something false. Until your long conversation loses its thread. Until you wonder why the same prompt produces wildly different results on different days.Seven terms fix most of that confusion. Not because jargon is the point- it isn’t but because understanding what these concepts actually are changes how you use every AI tool you’ll touch from here forward.

This isn’t vocabulary. It’s mechanics.

1. Tokens

AI models don’t read words. They don’t even read letters. They read tokens- chunks of text that might be a full word, part of a word, a punctuation mark, or a space. The sentence “I love pizza” breaks into roughly three tokens: I, love, pizza. A longer word like “unbelievable” might be three tokens: un, believ, able.

Why does this matter in practice?

Because every AI product you use ChatGPT, Claude, Gemini- is counting tokens constantly. The number of tokens in your prompt determines how much the model has to process. The number in its reply determines how much it costs to generate. API pricing is measured in tokens per thousand, not words or characters.

Tokens are also the unit of memory. The model doesn’t hold your entire conversation in its head — it holds a fixed number of tokens. When that limit fills up, the oldest tokens get dropped. That’s not a bug. That’s just how the architecture works. Once you know this, AI’s “forgetting” in long conversations stops feeling random and starts making complete sense.

One-line version: Tokens are the atoms AI reads. Everything the model sees, processes, and generates is measured in them including your bill.

2. Context window

Every AI model has a context window- the maximum number of tokens it can hold in memory at one time. This includes your instructions, the full conversation history, any documents you’ve pasted in, and the model’s own replies. When the window fills up, something has to go. Usually the oldest content.

Think of it as a whiteboard. You can write whatever you want on it. But it has a fixed size, and once it’s full, you have to erase old content to make room for new.

Older models had context windows of 4,000 tokens- roughly 3,000 words, or a few pages. Claude 3 launched with 200,000 tokens. Some models now approach 1 million. That’s not a marketing number — it fundamentally changes what you can do. A 200K context window means you can paste an entire novel and ask questions about it. A 4K window means you can paste about four pages before the model starts losing the beginning.

The practical consequence: if you’re asking an AI to analyze a long document or work through a complex multi-step task, be aware that it may genuinely not remember what it said 30 messages ago. It’s not being inconsistent on purpose. The whiteboard ran out of space.

One-line version: The context window is how much an AI can see at once. Bigger window = longer memory = more complex tasks.

3. Temperature

Temperature is a setting that controls how random an AI’s output is. It’s a dial, typically between 0 and 1 (sometimes higher), and it shapes whether the model plays it safe or takes risks.

Low temperature (close to 0): the model picks the most statistically likely next token, every time. The output is predictable, consistent, and precise. Useful for factual tasks- summarizing, coding, extracting information from a document. You want the AI to be accurate, not surprising.

High temperature (closer to 1 or above): the model samples more randomly from its options. Outputs get more varied, more unexpected, occasionally more creative- and occasionally more wrong. Useful for brainstorming, writing fiction, generating ad copy, anything where you want interesting rather than correct.

Most consumer apps — ChatGPT’s standard interface, Claude.ai — don’t expose this setting. They’ve fixed it at a reasonable middle ground. But in any API or developer tool, temperature is one of the first parameters you’ll see. Now you know what to do with it: low for precision, high for creativity.

One-line version: Temperature is the randomness dial. Turn it down for accuracy, turn it up for creativity.

4. Hallucination

Hallucination is when an AI states something false with complete confidence. Not hedged, not uncertain- just wrong, delivered as fact.

The frustrating part isn’t that it happens. All tools make mistakes. The frustrating part is that a hallucinating AI uses exactly the same tone and confidence as when it’s right. There’s no stutter, no hesitation, no tell. It just answers.

Why? Because a language model isn’t looking things up. It’s predicting tokens — completing patterns based on what it learned during training. When it doesn’t know the answer to your question, it doesn’t say “I don’t know.” It generates what sounds like a correct answer, because that’s structurally what it was trained to do.

The practical rule: never trust AI-generated outputs for facts, statistics, legal positions, medical information, or citations without verifying independently. Use AI to draft, explore, and accelerate — then check anything that matters. The people who understand hallucination don’t stop using AI. They use it at the right stage of the process and verify before anything goes out the door.

One-line version: AI hallucinates because it predicts plausible text, not because it looks up facts. Confidence means nothing- verify anything important.

5. RAG (Retrieval-Augmented Generation)

RAG is why “Chat with your PDF” works. It’s the mechanism behind every AI product that knows your documents, your company’s knowledge base, or recent events the model wasn’t trained on.

Here’s the problem it solves: a standard language model was trained on data up to a cutoff date. It knows nothing about your internal files. It knows nothing about the product release from last week. Ask it about either, and it’ll either hallucinate or admit it doesn’t know.

RAG fixes this. When you upload a document, the system doesn’t feed the whole thing into the model’s brain — it breaks it into chunks, stores those chunks in a vector database (a database that understands meaning, not just keywords), and indexes it. When you ask a question, the system searches that database for the most relevant chunks, retrieves them, and includes them in the prompt it sends to the model: “Here’s the relevant context. Now answer using this.”

Retrieve. Augment. Generate. That’s the three-step loop behind almost every practically useful AI product built in the last two years- support bots that know your company’s policies, legal tools that read your contracts, research assistants that summarize papers you’ve uploaded. When an AI product claims to “know your data,” it’s almost certainly running RAG.

One-line version: RAG is how AI knows your documents. It searches first, then answers — the model didn’t learn your data, it just got handed the relevant pieces.

6. LLM (Large Language Model)

LLM is the term underneath everything. GPT-4o is an LLM. Claude is an LLM. Gemini is an LLM. When people say “the AI” — they almost always mean a large language model.

What makes it large: the number of parameters it was trained on. Parameters are the internal numerical weights the model adjusts during training to get better at predicting what comes next. GPT-3 had 175 billion parameters. Current frontier models aren’t publicly disclosed, but estimates run into the trillions.

What makes it a language model: it was trained to predict the most likely next token given a sequence of preceding tokens. That’s it. It sounds simple. The emergent behavior from doing this at scale on hundreds of billions of words of text is what’s surprising — the model develops what look like reasoning, writing, and comprehension abilities, not because they were explicitly programmed, but because predicting text well enough requires representing the underlying structure of language and, to some degree, the world it describes.

Important limitation: an LLM is not a database. It doesn’t store facts in retrievable slots the way a search engine does. It stores statistical patterns. That’s why it hallucinates — and why RAG was invented to augment it with actual retrieval.

One-line version: An LLM is the model behind every AI text tool. It predicts tokens at scale — that’s the whole magic trick.

7. Inference

Inference is what happens every time you send a message to an AI. It’s the act of running a trained model to generate an output- the compute that happens between you hitting Enter and the response appearing.

Training (building the model in the first place) is a one-time process that costs billions of dollars and takes months. Inference is the ongoing cost of actually using the model, and it happens millions of times per day across ChatGPT, Claude, and every other deployed AI.

Why does this term matter to you, practically?

Because inference speed and cost are why AI products work the way they do. Fast models (GPT-4o-mini, Claude Haiku) are optimized for cheap, low-latency inference — quick responses at scale. Frontier models (GPT-4o, Claude Opus) are more capable but slower and more expensive per inference. That’s why they’re gated behind paid tiers.

It’s also why “running AI locally” on your own hardware is increasingly discussed — if you run inference on your own machine with a model like Llama or Gemma, you pay the hardware cost once and run as many inferences as you want for free. The tradeoff is capability: local models are getting better fast, but frontier-hosted models are still more capable for most complex tasks.

When someone says an AI is “slow to respond,” they mean inference is taking longer than expected — usually because the model is large, the output is long, or the server is under load. Knowing this, you stop being confused by the variability and start understanding the system.

One-line version: Inference is the compute that generates your AI response. It’s why fast models cost less, and why local models are appealing even when they’re less capable.

Why any of this matters

None of these terms are trivia. Each one maps to a real behavior you’ll observe in AI tools, and understanding the mechanism changes how you work with them.

  • Tokens- write tighter prompts. You now know that every word costs something and that the model’s memory is measured in units you can influence.
  • Context window- stop being surprised when AI forgets. Know when to start a fresh conversation and when to summarize earlier context into the prompt.
  • Temperature- match the setting to the task. Low for precision, high for creativity. Most tools don’t expose this, but now you know what they’ve decided for you and why.
  • Hallucination- verify anything that matters. AI confidence is not a signal of accuracy. Treat it as a fast first draft, not a finished fact.
  • RAG- understand what “AI that knows your documents” actually means. The model didn’t learn your data- it searched it and handed the relevant pieces to the LLM. Knowing this tells you when the system will fail (bad retrieval, vague queries, poorly chunked documents).
  • LLM- stop treating AI like magic. It’s a statistical pattern-predictor operating at enormous scale. That mental model makes it easier to understand both its strengths and its limits.
  • Inference- understand why speed and cost differ between models. Now you know why cheaper, faster models exist and what you’re trading when you choose one over another.

AI is not going away. The gap between people who use it on autopilot and people who understand what’s actually happening is going to widen. These seven terms are the shortest path across that gap.

Frequently asked questions

What is a token in AI?

A token is the basic unit of text that AI language models process. It’s not a word — it can be a whole word, part of a word, punctuation, or a space. The sentence “I love pizza” is roughly 3 tokens. AI models read, process, and generate tokens, not words. API pricing is measured in tokens per thousand, and a model’s context window is measured in total tokens it can hold in memory at once.

Why does AI forget things in long conversations?

Because the context window — the amount of text a model can hold in memory — has a limit. When a conversation fills that limit, the oldest tokens get dropped to make room for new ones. It’s not a bug. The model literally cannot see what’s no longer in the window. Newer models have much larger context windows (200K tokens or more), which reduces this problem significantly but doesn’t eliminate it.

What causes AI hallucination?

Language models are trained to predict the most statistically likely next token — not to look up facts. When a model doesn’t know something, it doesn’t say “I don’t know.” It generates what sounds like a correct answer based on learned patterns. The output can be completely wrong while sounding fully confident. This is why AI should be used for drafting and ideation, with independent verification for anything factually important.

What is RAG and why does it matter?

RAG (Retrieval-Augmented Generation) is how AI products work with your documents or recent data the model wasn’t trained on. The system stores document chunks in a vector database, searches for relevant chunks when you ask a question, and includes them in the prompt sent to the model. The model didn’t learn your documents — it’s being fed the relevant pieces in real time. Almost every “chat with your data” product uses this pattern.

What is the difference between LLM and AI?

AI (artificial intelligence) is the broad field covering all systems that simulate intelligent behavior. An LLM (large language model) is a specific type of AI trained to process and generate text by predicting tokens at scale. GPT-4o, Claude, and Gemini are all LLMs. Not all AI is an LLM — image recognition models, recommendation algorithms, and robotics systems are all AI but not language models.

What does inference mean in AI?

Inference is what happens when a trained AI model generates a response to your input — the actual computation that produces output. Training (building the model) happens once and costs enormous resources. Inference happens millions of times per day when people use the model. Inference speed and cost explain why some models are faster than others, why paid tiers exist, and why running AI locally is appealing even if local models are less capable.

What is temperature in AI and how does it affect output?

Temperature is a parameter that controls output randomness. Low temperature (near 0) makes the model pick the most statistically likely token each time — outputs are consistent, precise, and predictable. High temperature (near 1 or above) makes the model sample more randomly — outputs are more varied, creative, and occasionally surprising. Use low temperature for factual or technical tasks, high temperature for creative work. Most consumer apps set this to a fixed middle value and don’t expose the control.

Sources

  • Anthropic — Claude model documentation and context window specs: docs.anthropic.com
  • OpenAI — Tokenizer documentation and API pricing: platform.openai.com
  • Lewis et al. (2020) — “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (original RAG paper): arxiv.org/abs/2005.11401
  • Google DeepMind — Gemini 1.5 context window technical report: deepmind.google