How LLMs Work: A High-Level Overview

Have you wondered how LLMs like GPT, Claude, and Gemini work? You may have even tried understanding them but stopped because of too many unfamiliar terms.

You might think — I don't need to build one from scratch, so why understand it? But learning about these internal workings, even from a high-level perspective, will help you see these models differently and give you an edge when building AI applications.

Let's start from the basics. LLM stands for Large Language Model. The most popular ones are GPT, Claude, Gemini, DeepSeek, and others.

OpenAI was the company that popularized LLMs by making them accessible to a mass audience through ChatGPT.

Now let's understand what happens when you send a message to these models behind the scenes.

The Pipeline

How LLMs process input through tokenization, vector embeddings, positional encoding, self-attention, and output generation How LLMs process input through tokenization, vector embeddings, positional encoding, self-attention, and output generation

Highly simplified mechanism of how LLMs work: Input → Tokenization → Tokens → Vector Embedding → Positional Encoding → Self-Attention Mechanism → Output Tokens → Output in Natural Language

Tokenization

Computers are good at understanding numbers, but they cannot understand the natural language we use in daily life.

To make our natural language something computers can understand, a process called tokenization is performed.

It converts our input to numbers based on a pre-defined vocabulary, which differs from model provider to model provider.

OpenAI, Anthropic, and Google each have different vocabularies, so the same input can be converted into different tokens.

A common misconception is that every word is a token — but that's not how tokenization works.

For example:

"The" → can be one token (as it is a repeatedly and commonly used term, it is treated as one token)
"a" → can be one token

Note: I built a small CLI tool that converts your text to tokens and back to text for learning purposes — tiktoken-cli.

Vector Embeddings

After tokenization, we create vector embeddings from the tokens and plot them on a 3D graph (for visualization purposes — in reality, embeddings have hundreds to thousands of dimensions).

This is done to provide semantic meaning to tokens, so the AI can understand their relationships as well.

For example:

Harsh built his successful startup after failing 5 times.

Shreya built her successful startup after failing 8 times.

Vector embedding visualization showing semantic relationships between founders, startups, and failure counts Vector embedding visualization showing semantic relationships between founders, startups, and failure counts

This helps the LLM form relationships between words and understand their meaning.

Positional Encoding

But there is still a problem. When two sentences have the same words but different positions, they have different meanings — yet the same tokens will be generated for both sentences. So the AI might treat them as identical.

To solve this, a concept called positional encoding is implemented.

For example:

"Only I love her" vs "I only love her"

Both sentences have different meanings. Positional encoding adds more data to the tokens about their position in the sentence.

So the same words in different positions will have different embeddings.

Transformer

Everything we have covered so far — tokenization, embeddings, and positional encoding — is handled by an architecture called a Transformer. It ties all of these steps together and processes your input end to end. Each model has its own transformer.

Next Token Prediction

Lastly, these models predict the next token based on the data they were trained on — not something out of thin air or chosen at random.

They predict only one next token at a time.

For example:

Your input → "Hello"

Model response → "Hi Lakshay, how are you?"

How is it generated?

"Hello"              → Model → "Hi"
"Hello Hi"           → Model → "Lakshay"
"Hello Hi Lakshay"   → Model → ","
...until the model predicts EOS — End of Sequence token

I hope this gives you a clear high-level picture of how LLMs work under the hood. This is a simplified explanation — modern LLMs contain many additional components, but understanding these concepts provides a solid foundation.