Back to blog

AI patterns

Introduction to LLMs and GPTs

A non-technical introduction to how Large Language Models (LLMs) and GPT models actually work — covering tokens, embeddings, the Transformer architecture, training, and where the field is heading.

SvK by Sven von Känel 49 min read
  • KI
  • LLM
  • Grundlagen

LLM introduction / TLDR

This article offers a generally accessible, non-technical introduction to how Large Language Models (LLMs) work, with a particular focus on the well-known GPT models. The aim is to present the fundamental concepts behind these impressive AI technologies in a way that readers without a technical background can follow the principles of language processing and generation. Key terms, central components such as the Transformer architecture and typical challenges of natural language processing are explained using accessible analogies.

Disclaimer: AI tools were used for research; photos are generated (apart from Nvidia / xAI AI). Diagrams, overviews and texts were created by us (apart from the "official" Transformer diagram).

Why does AI development seem to be progressing so fast since 2022?

Many of the foundations needed for Large Language Models (LLMs) and Generative Pretrained Transformers (GPTs) were already developed over the past 30 years, but it wasn't until GPT-3.5/ChatGPT appeared in 2022 that the field really arrived in the public consciousness and has been advancing rapidly ever since. The following sections give a brief overview of the key developments and foundations that, together with today's powerful hardware, made this technological leap possible.

The theory — foundation of today's developments

Rule-based language processing

In the early years of computational linguistics, language programmes were driven by fixed rules and dictionaries. Linguists defined how sentences should be structured, and the computer followed those instructions strictly. It's roughly comparable to a worker who has a rigid rulebook for every step and can neither deviate from it nor improvise. Early chatbots such as ELIZA (1966) worked exactly along these lines, comparing user input with predefined patterns and producing matching answers. Systems like these could simulate simple conversations but quickly hit their limits as soon as the input no longer matched the stored rules.

Statistical language models (n-grams)

From the 1980s onwards, a shift in thinking set in — computers now learned from example data rather than just from hard-coded rules. Statistical language models analyse large bodies of text and count how often particular word sequences occur. From these counts, probabilities can be derived for the next word — similar to the autocomplete on a phone that suggests the most likely next word. A fitting analogy is a hobby cook who, after trying out many recipes, can devise their own workable procedures — without frustrating the eventual consumers of their craft too much. These n-gram models and related statistical methods were used, for example, in machine translation and speech recognition, and produced more fluent results than earlier approaches because they were based on real-world language data.

Word embeddings

To capture the meaning of words more accurately, word embeddings were introduced in the 2010s. Each word is represented as a numerical vector that describes how it is classified across different contexts and dimensions. Words with similar meanings sit close together in this vector space — you can picture it like a workshop in which similar tools (for example, screwdrivers and wrenches) live in the same drawer, while a tool of a different kind — like a paintbrush — is stored further away. Numerically, this is expressed through similar values along the "functionality" dimension. A well-known example is Word2Vec (2013), where the computer learns, for instance, that "king" and "queen" or "car" and "vehicle" occur in similar contexts. Such embeddings allow NLP systems to recognise similarities in meaning and produce better results on tasks like search, translation or text classification.

Neural networks for language (RNNs and LSTMs)

With more powerful computers, deep neural networks entered language processing from the 2010s onwards. Recurrent Neural Networks (RNNs) read a sentence word by word and maintain an internal state to remember what came before. This way they can take the context of previously read words into account — similar to a craftsman who follows a multi-step set of instructions step by step and remembers what they did earlier. One problem with simple RNNs, though, was that they tended to "forget" important information over longer sentences. The remedy came with the Long Short-Term Memory (LSTM) cell: this extended RNN, introduced in 1997, has a kind of long-term memory that filters out unimportant information and stores the important pieces across many words. In effect, the craftsman now takes notes during the work to refer back to later. Thanks to RNNs and LSTMs, translation and speech recognition systems can process whole sentences meaningfully without losing the thread after just a few words.

Attention mechanism

A major leap forward came in the mid-2010s, when models learned to dynamically focus on important word relationships. The attention mechanism allows an AI model to decide, for each word it generates, which parts of the preceding text are particularly relevant. You can picture it as a conductor who, during an orchestral piece, signals to individual sections of instruments when they should step forward and play more prominently. In every bar the conductor "listens" to the whole orchestra but decides which instruments deserve special attention at that moment. So nouns play "louder" than verbs and adjectives, which in turn play louder than filler words. This targeted weighting of word relationships allowed translation systems, for example, to deliver significantly better results, because at every word the model could take the relevant connections in the source sentence into account. The attention mechanism effectively computes a "meaning map" of all words in context and thus enables a deeper understanding of linguistic structures. This paved the way for the Transformer architecture, which works particularly efficiently because it processes these relationships in parallel.

More on this later when we look at how an LLM with a Transformer architecture works.

Transformer architecture

In 2017, Google revolutionised the NLP world with the Transformer architecture. A Transformer model processes an entire sentence at once rather than word by word, and uses several attention mechanisms (self-attention) to evaluate the relationships between all words in parallel. Figuratively, it's like a full orchestra in which not just one conductor but every musician is simultaneously conductor and instrumentalist: each "instrument" (word) perceives all the others and decides for itself which to pay particular attention to. This is how it works out, for example, which adjectives belong to a noun and what meaning lies behind them.

While conventional models play a piece of music bar by bar, the Transformer takes in the whole score at once, with different "attention layers" (attention heads) highlighting different aspects of the composition — some focus on rhythm, others on melody or harmony. The sentence fragment "The loud sports car" is semantically almost equivalent to "The roaring jalopy" — but on the attention layer, or in the "style" dimension, clear differences become apparent.

This parallelisation and the multi-layered distribution of attention let Transformers process very long texts and complex relationships efficiently, without losing information from the start of a sentence. Transformer models have since become the heart of modern language systems — today's translators, voice assistants and Large Language Models all build on this architecture.

Pretrained language models and transfer learning

Instead of building a new model for every NLP task, since around 2018 huge language models have been trained first on enormous text collections. The model learns general language patterns and relationships from books, articles or the web — for example, the entirety of Wikipedia. It can then be specialised on a concrete task with relatively few additional examples — this is called transfer learning. It's roughly comparable to an apprentice who first studies every available textbook in vocational school to internalise the basics and can then, with specific guidance from the master, tackle a wide range of very different tasks. An early example is BERT from Google (2018): this model was pretrained on English Wikipedia and could then, with light fine-tuning, handle many different language tasks (from answering questions to summarising texts). Shortly afterwards, the GPT series from OpenAI followed. GPT-2 (2019) impressed with its ability to write coherent texts from just a few keywords. GPT-3 (2020) went a step further — with 175 billion parameters (more on that later) it showed an astonishing versatility that allowed it to carry out tasks expressed in natural language with minimal additional instruction. These kinds of pretrained large language models form the basis for practical applications today: ChatGPT, for example, is built on a pretrained model and can answer questions or carry on conversations without having to be manually programmed for each individual task.

Historical milestones

Here is a summary of selected key milestones, with a particular focus on the development of natural language processing (NLP), since this is what matters most for today's GPTs:

Year Milestone (NLP/LLM development)
1950 Alan Turing proposes the Turing Test — a criterion for whether a computer can convincingly imitate human thought through a language-based dialogue.
1966 MIT researcher Joseph Weizenbaum presents ELIZA, the first chatbot. ELIZA reacts to input using simple pattern rules, showing how computers can carry on simple dialogues.
1980s A shift from purely rule-based systems to statistical methods in language processing. Computers now learn from large text corpora using probabilities, significantly improving natural language processing.
1997 Introduction of LSTM networks (Long Short-Term Memory) by Hochreiter and Schmidhuber. These extended neural networks can also remember information from far back in a text, solving a key problem of earlier RNNs.
2013 Google develops Word2Vec, a method for creating word embeddings. For the first time, words are represented as vectors, so similarities in meaning can be captured computationally.
2017 Google publishes the paper "Attention is All You Need". It introduces the Transformer architecture with self-attention — a paradigm shift that makes parallel processing of language possible.
2018 The pretrained language model BERT (Bidirectional Encoder Representations from Transformers) is introduced. BERT uses the Transformer to understand words in context from both sides and ushers in the era of transfer learning in language processing.
2019 OpenAI releases GPT-2, a generative language model with 1.5 billion parameters. GPT-2 can write a long, coherent text from a short input and sets new benchmarks for text generation.
2020 OpenAI's GPT-3 (175 billion parameters) becomes the largest language model to date. GPT-3 demonstrates that a single model can solve a wide variety of language tasks through pretraining alone, with little or no specific adaptation.
2022 ChatGPT is made publicly available. This dialogue system based on GPT-3.5 shows impressively how far LLM technology has come, by communicating with users in natural language and answering complex queries in an understandable way.

Powerful hardware as an accelerator

Companies such as Nvidia developed specialised graphics processing units (GPUs) in the past, which were later also optimised for training large language models. GPUs are particularly well suited to AI computations because they have a high parallel-processing capacity for floating-point numbers and can therefore carry out large numbers of matrix operations efficiently — a key building block both in gaming and in training neural networks. In particular, the introduction of the Blackwell architecture in 2024 led to significant performance gains, making the training and use of LLMs more efficient. Beyond that, Nvidia has developed its own AI hardware such as the Hopper and Blackwell chips, which are specifically optimised for the demands of AI workloads. These specialised chips deliver substantially higher speed and efficiency than conventional GPUs (de.wikipedia.org). The speed of these chips is immense: Nvidia's H100, for example, reaches up to 1 petaflop = 1,000 teraflops = 1,000,000 gigaflops (a million billion FP32 operations per second; FLOP = floating-point operation).

Even so, efficient training of LLMs requires thousands of these high-performance cards combined — for example, 100,000 of them in the new xAI data centre "Colossus", used for the LLM "Grok 3" in Boxtown, a suburb of Memphis, Tennessee.

Not only the number of GPUs, but other figures too, are impressive:

  • Colossus was built in just 122 days and is considered the largest data centre in the world

  • The facility covers an area of about 69,700 square metres (750,000 square feet)

  • It originally launched in September 2024 with 100,000 NVIDIA H100 GPUs; today there are 200,000

  • The power capacity was raised from an initial 150 MW to about 250 MW by December 2024, with up to 1.2 gigawatts planned

Financially strong technology companies

Sums of the order needed to build data centres on the scale of Colossus can only be carried by extremely well-funded investors. That's why companies such as Microsoft, Google, Meta and OpenAI have invested huge amounts into AI research, into ever larger and more powerful models, and into the necessary infrastructure. These companies have a basis for such investments in their high ongoing cash flows, which stem from their anchoring in the platform economy (easy scaling of the business model across millions of users). Other players such as OpenAI or Anthropic regularly look for similarly well-funded investors. OpenAI, for example, received up to 40 billion dollars in 2024 to develop new models such as GPT-4.5.

Competition and innovation

The intense competition between these technology companies keeps accelerating progress. Companies like Elon Musk's xAI and Meta are building large clusters of Nvidia chips to develop more powerful AI systems. This race for AI supremacy, primarily in the US and China, is driving the rapid development of GPT models. On top of that, considerable investment continues to flow into AI research.

LLM basics

Definitions

Large Language Models (LLMs) are large, pretrained neural networks specialised in processing and generating natural language. They are usually based on the Transformer architecture and are trained on huge volumes of text data to handle a wide range of language tasks such as text understanding, translation or summarisation.

GPT (Generative Pre-trained Transformer) is a particular class of LLMs developed by OpenAI. While LLM is a general term for large-scale language models, GPT refers to a specific architecture within that category, characterised by an autoregressive prediction model — it generates text by predicting word by word based on the preceding words. GPT models are a subset of LLMs, but not all LLMs are GPT models.

Components of an LLM

The Transformer architecture

The Transformer architecture is the underlying model behind the well-known modern Large Language Models (LLMs) such as ChatGPT, Gemini, Claude or Grok. It consists of multiple encoder and decoder layers designed to process complex linguistic relationships efficiently.

The encoder

An encoder processes the input text and converts it into an abstract representation made up of vectors (embeddings). In doing so it recognises relationships between words, takes meanings into account and structures the information so it can be used by the decoder. Instead of producing a direct word-by-word translation, it builds a much richer representation of the input that enables the decoder to generate a coherent and — hopefully — meaningful response.

The decoder

A decoder in an AI language model turns an internal representation of information into understandable text. It analyses the words already generated to ensure consistency and coherence, and uses information from the encoder to reproduce relevant content correctly. Based on trained patterns and probabilities it decides step by step which word should come next, until a complete and meaningful response emerges.

Analogy: subject-matter analyst and editor

A fitting analogy is the collaboration between a subject-matter analyst and an editor. The encoder plays the role of the analyst, who extracts, structures and prepares all relevant information from a source for further processing. The decoder acts as the editor, formulating an understandable and coherent text based on this analysis. Through repeated processing steps across multiple layers, the quality and accuracy of the output is refined, so the model can produce linguistically high-quality and contextually appropriate answers.

The diagram shows the classic encoder–decoder architecture of a Transformer, in which the encoder processes an input sequence and produces an abstract representation, which the decoder then uses to generate an output. A GPT, however, is a special case of a Large Language Model (LLM) that consists of a decoder only and dispenses with the encoder. While in the classic Transformer architecture the decoder can access the complete encoder representation in order to produce an output word by word, a GPT has to handle the entire processing on its own.

A central mechanism that makes this possible is masked multi-head attention, shown in the decoder block of the diagram. This mechanism is used in GPTs to ensure that each token can only look at previous tokens in the sequence, never at future ones. The model processes text autoregressively, step by step, predicting the next token based on the tokens already generated. This is what makes a GPT substantially different from encoder–decoder architectures such as BERT, which have a global input representation.

Because a GPT has no encoder, it has to derive the entire context from the tokens already generated. Every new token again passes through the multi-layered self-attention mechanisms, reusing already computed information to produce coherent, grammatically correct and meaningful sequences. This autoregressive structure makes GPT particularly well suited to applications such as text generation, dialogue systems and creative language processing, where a coherent and fluent output has to be produced step by step.

The other terms used in the diagram will be explained in more detail shortly in the "Decoder" section.

Natural language understanding by LLMs

Challenges of natural language processing (NLP)

Human language consists of hundreds of thousands of words, which on top of that come with different shades of meaning, synonyms and dialects. For computers, dealing with this variety is a challenge — much like for an apprentice in a workshop. They not only have to know the names of the tools, but also know that "spanner" and "wrench" can mean the same thing, while "drill" and "router" sound similar but refer to different tools.

To understand language properly, computers also need to recognise how terms are hierarchically linked. A dachshund, for example, isn't just a standalone concept but also a breed of dog, and dogs in turn are mammals. A supermarket works in a similar way: fruit, vegetables and drinks are organised into categories. A point-of-sale system has to recognise that an orange and a banana both fall under "fruit", even though they're clearly different products.

Another stumbling block is how specific or general words are. While "screwdriver" is very specific, "tool" covers a very broad area. Language models therefore have to learn to gauge the right level of generality — comparable to a carpenter who, for a piece of furniture, can speak generally of "wood", more specifically of "hardwood", or very specifically of "oak".

Beyond that, many terms are ambiguous and their exact meaning only becomes clear from context. In English, the word "bank", depending on context, means either a financial institution or the side of a river. For an electrician, in turn, cable isn't just cable: only the context — wiring diagrams or the surrounding environment — makes it clear whether it's meant for power or for data transmission. Mistakes here aren't recommended.

It's not just the words themselves but also emphasis and the relationships within a sentence that determine its meaning. The statement "I didn't take the cupboard apart" changes depending on whether "cupboard" is stressed (because perhaps the table was taken apart) or whether "didn't" is stressed (because the cupboard was left standing). In a recipe like "sprinkle the fish with herbs and put it in the oven for 20 minutes", it also needs to be clear that it's the fish that goes in the oven, not (only) the herbs.

To handle this complexity, computers need methods that allow them to store words not just as strings but actually together with their meaning. A painter groups similar shades of colour next to each other on the palette; in the same way, words like "car" and "vehicle", which carry similar meanings, should sit close together so their semantic relationship is easier to recognise.

Unlike databases, the human brain stores knowledge associatively — not in structured lists, but in webs of relationships and memories. A carpenter knows, for example, that oak is harder to work than beech, without necessarily remembering the exact density. In a similar way, language models store general knowledge and form connections without committing every detail precisely to memory.

Beyond pure understanding, computers also have to be able to produce sensible and grammatically correct texts themselves. This is comparable to an apprentice in their first year, who can identify tools and materials but is only able, after gaining practical experience, to build a stable and aesthetically pleasing piece of furniture.

It's equally hard to recognise irony or emphasis. People often say the opposite of what they really mean, or use subtle inflection. "Yes, exactly what I wanted..." can express either genuine enthusiasm or annoyed irony. In the same way, a facial expression like a smile, without further context, can easily be misread as a grimace.

People also don't speak perfectly — they make typos, use colloquialisms, or leave words out. A good language model has to understand what was meant anyway, and be able to correct errors. Comparable to a mechanic who notices that a screw is sitting crookedly and therefore straightens it — even if that wasn't part of the original plan.

Finally, language is often tied to visual or other sensory data. Modern AI models therefore have to learn to link language with images and similar information. An image caption has to correctly identify the depicted object while also being phrased meaningfully. In a similar way, an architect describes a building project not just with words, but also with plans and models.

Semantic search

How it works and how it differs from keyword search

Semantic search is a method of information retrieval that doesn't just look for exact keywords but understands the meaning (semantics) of a query and finds content that matches it based on substantive relationships. Unlike classic keyword search, which only looks for exact word matches or clear rules, semantic search can take synonyms, contextual relationships and topically related content into account. This enables a more intelligent and precise search, especially for complex or naturally phrased queries.

Role of the encoder in semantic search

The encoder of a Large Language Model (LLM) plays a decisive role in semantic search, because it converts texts into a numerical representation (embedding). In doing so, it not only analyses individual words but recognises their meaning in context.

Embeddings: a technical mapping of language

Explaining the concept of embeddings

An embedding represents words as vectors in a high-dimensional space and makes it possible to measure semantic relationships by the distances between these vectors. To picture this more easily, you can use three-dimensional space as an analogy, in which two aircraft are described by their position relative to each other. If two aircraft have similar coordinates on the axes of altitude, speed and geographic position, they are close together. Language works similarly: words with similar meaning or use also lie close together in the multi-dimensional vector space.

Example: semantic relationships in the trades

A concrete example of the semantic structure of such a high-dimensional space are the terms tool, hammer, nail, master craftsman and apprentice. While the first three terms describe physical objects, the last two stand for people. This distinction is reflected in the vector space: tools such as hammer and nail lie closer together because they often appear together in texts and are in a functional relationship with each other — a hammer is used to drive nails. The general word tool also lies nearby, but a little further away, because it represents a higher-level category that covers many different tools.

It's different with the terms master craftsman and apprentice, which also belong to the world of the trades but differ semantically from the objects. Because they refer to people, their embeddings carry more weight in dimensions standing for human roles and relationships. The master craftsman can have a high value on a dimension representing authority or experience, while the apprentice has a lower value on the same dimension, because they are in a learning position. On another dimension representing the trade as a topical area, however, both terms have similar values, indicating that they are related in content.

These high-dimensional relationships help language models understand not just word meanings but also their function and context. So a model recognises, for instance, that a sentence about a master craftsman showing an apprentice how to drive a nail with a hammer is a logical and coherent statement. The close relationship between hammer and nail is represented by their proximity in vector space, while the distinction between tool and person is reflected in different dimensions. In this way a language model can not only group similar terms but also correctly place their contextual relationship.

How does the encoder of an LLM produce embeddings?

Step-by-step generation of embeddings

The encoder of an LLM has the job of turning the input data into technically processable embeddings, by analysing them step by step and condensing them into a compact, meaningful representation. This process resembles the work of a professional archivist filing a new document into an ordered system. First the text has to be broken down (tokenisation) and analysed — comparable to archiving a new book, which is first classified by title, topic and content. In the first phase the encoder processes the tokens, the individual words or word pieces, and converts them into a numerical representation with the help of embedding vectors. This corresponds to library classification, where each book is given a specific category so it can be found again later.

The self-attention mechanism

In the next step the encoder uses the self-attention mechanism to recognise which words are particularly important in the given context. This works like an archivist who doesn't just look at a book's title but also checks which other books are topically related, whether there are cross-references to other works, or whether a term is used differently in different chapters. This way the encoder can recognise, for example, that the word "nail" in the sentence "The craftsman drives a nail into the wall" carries a different meaning than in "His nail has broken off", because they sit in different topical contexts.

Analogy: archivist in an archive system

These pieces of information are then processed through several layers of a neural network, comparable to the various departments of an archive, in which documents are further categorised by relevance, level of detail and substantive relationships. At the end of this process the encoder no longer outputs the original words, but a series of high-dimensional vectors that represent the meaning and context of the input. These embeddings are now ready for the decoder, which can generate a coherent and meaningful output from them — similar to an archive system that delivers all relevant documents on a given topic upon request.

Context window of an encoder

For Large Language Models (LLMs), a context window describes the maximum amount of text the model can process at the same time and take into account. You could say it's the model's memory for the current conversation or task. You can compare it to a craftsman's workbench: the larger the bench, the more tools and materials fit on it at once, and the more easily the craftsman can work without constantly searching or rearranging. The larger an LLM's context window, the better the model can handle long texts, whole documents or extensive conversations without forgetting important details. If the context window is too small, by contrast, the model "forgets" information from the beginning of the text — similar to a craftsman who keeps having to pick up tools that have fallen from a small bench and sort them again. A large context window therefore not only improves answer quality but also enables entirely new use cases such as working through whole books or comprehensively analysing large software projects.

Prompt engineering

Besides the pure textual meaning, the formulation of the input text for the encoder also matters. Prompt engineering refers to deliberately shaping inputs (prompts) to obtain optimal answers from AI models. Good prompts contain clear instructions, specific context and, where appropriate, examples. Similar to a director who gives clear directions, the user can in this way ensure the model delivers precise results. Since GPT models generate answers solely on the basis of the input context, the design of the prompt has a major influence on the quality and accuracy of the generated text. Further information and many examples can be found, for example, in articles by Datacamp and 121watt.

Text generation by LLMs

This brings us to the question of how an LLM generates an answer from input in the form of embeddings, and why this typically happens word by word.

How the decoder works

Overview

The decoder architecture of a Transformer is based on deep neural networks (deep learning), which use trained weights across several processing steps to generate text. A neural network is a computational model structure made up of multiple layers of artificial neurons. These neurons process inputs, weight them and pass them on to recognise complex relationships. Deep learning refers to a particular form of machine learning in which such neural networks have especially many layers (hence "deep") and are trained on large amounts of data with optimised algorithms.

In the Transformer's decoder, several of these deep neural networks are at work: the multi-head attention network uses fully connected neural networks to compute which parts of a text are important for generating the current word. Multi-head attention is an extension of self-attention in which several independent self-attention computations (heads) are carried out in parallel. You can think of this as deploying several experts who analyse a text from different perspectives. Each head learns different relationships between the words, which improves the representation. The feed-forward network consists of several stacked neural layers with non-linear activation and helps capture complex relationships and patterns. The softmax function in the output layer finally converts the computed values into probabilities for the next word. Through this multi-layered neural structure, deep learning enables the decoder to generate context-aware, grammatically correct and meaningful texts by learning from large amounts of training data.

Analogy: professional translator

A decoder can also be explained by analogy with a professional translator who translates a text piece by piece, drawing on several cognitive processes at once. Multi-head attention corresponds to the translator's ability to focus on different aspects of the source text simultaneously: they keep the entire passage translated so far in mind, double-check the original meaning, and pay attention to grammatical structures to deliver an accurate translation. The feed-forward network is comparable to the translator's inner sense of language, which helps refine the chosen formulations, recognise synonyms, and adapt sentence structure to the target language. Finally, the softmax function takes on the role of the final word choice: the translator may have several suitable terms in mind, weighs how likely each is to fit the given context, and then settles on the most fitting word. In this way the decoder ensures the generated output is both factually accurate and linguistically fluent.

How can the decoder draw on learned knowledge?

To generate a fitting response to the request handed over by the encoder, the decoder draws on already trained neural networks. Several training methods are used in the process. First, however, some background on how knowledge is stored in the decoder's neural networks.

Storing the knowledge learned through training

A key element that allows the decoder to draw on learned knowledge is the millions of parameters in its neural networks, which are optimised during the training process. These parameters represent the weights that determine within the neural network how strongly a given input word is linked to other concepts or patterns. You can think of these weights as a kind of memory of the model, allowing it to fall back on already-learned linguistic structures and meanings.

To make this idea easier to grasp, you can picture the neural network of a language model as a large web of artificial neurons arranged in several levels (layers), which — much like the human brain — forms connections between concepts through training. Each of these artificial neurons is connected to others, and the weights determine the strength of these connections. When a GPT model processes the word "doctor", for example, the weighting automatically activates related terms such as "patient", "hospital" or "diagnosis" with higher relevance. These semantic associations are the result of an extensive training process in which the model has learned, on the basis of huge amounts of text data, which words frequently appear in similar contexts.

The number of required parameters grows exponentially with the complexity of the model. Modern Large Language Models like GPT-4 have hundreds of billions of such parameters, fine-tuned through extensive computational operations. Every one of these parameters contributes to capturing linguistic nuance, grammatical structure and stylistic subtlety. You can picture the model as a gigantic, dynamic pattern-recognition machine that generates new text based on its weighted connections by predicting the most likely next words.

During training, these parameters are continuously adjusted to improve the quality of the predictions. Without this large number of optimised parameters, the model wouldn't be able to produce coherent, context-aware and stylistically appropriate answers. The enormous computational power required to adjust these parameters shows why training modern language models is such a resource-intensive process — and why pretraining plays such a decisive role in the decoder's later performance.

Pretraining

Pretraining plays a central role for the decoder's neural networks, because it lays the foundation for their ability to process natural language. During pretraining, the model is trained on large volumes of text data to learn statistical patterns, grammatical structures and semantic relationships. The weights of multi-head attention, the feed-forward network and the softmax layer are optimised so that the model learns to relate relevant words to each other, recognise meaningful sentence structures and compute realistic probabilities for the next word. Without this pretraining, the decoder wouldn't be able to generate coherent and meaningful texts, as it would have no language patterns from prior experience to fall back on. Pretraining is often carried out on large text corpora before the model is specialised, through fine-tuning, on specific tasks such as machine translation or question-and-answer systems.

Backpropagation — how an LLM is trained

A central mechanism by which an LLM learns during pretraining is backpropagation, a procedure for adjusting the weights within the neural network. This process can be illustrated with a simple example: suppose the model is to learn how to correctly continue the sentence "The sky is …". During training, the model is given a large number of example sentences, and in this case the correct next word would be "blue". The model first generates a prediction based on its current weights, however, and might suggest, for example, "vast" or "beautiful" as the most likely words. Since "blue" is the more probable continuation, the error (the difference between the model's prediction and the correct answer) is computed. This error is then propagated backwards through the entire neural network (backpropagation), to adjust the weights in the preceding layers so that "blue" receives a higher probability in future predictions. This is done by computing the gradients — the direction and magnitude in which the weights need to be changed — and then optimising them with an algorithm such as Stochastic Gradient Descent (SGD) or the Adam optimiser. After many iterations of this process, the model eventually learns that "The sky is blue" is one of the most likely continuations and predicts it correctly in future generations. This continuous adjustment of weights through backpropagation is the foundation of the learning process in deep neural networks and lets the model deliver more accurate and contextually appropriate answers with every training phase.

Transfer learning

Transfer learning is a general term that describes how an already pretrained model is adapted for a new, often specialised task. The language structures and patterns learned during pretraining are used to transfer the model to a new domain or application with comparatively little additional data. An example would be fine-tuning a general language model on legal texts to make it more usable for legal documents. Transfer learning can be achieved, for example, through post-training. As a term, transfer learning is broader than post-training, since it also covers other special learning approaches such as fine-tuning, feature-based transfer learning, adapters, LoRA (Low-Rank Adaptation) and so on.

Post-training

Post-training is a form of further development of an already pretrained language model, after it has first learned on general text data. While pretraining teaches the model general language patterns, grammar and contextual relationships from large amounts of text, post-training uses additional, specialised data to optimise the model for particular use cases. This adjustment is comparable to a cook who, after completing their training, can already cook but later specialises in a particular cuisine. Industry-specific texts, documentation or company-owned data are often used for post-training to tailor the model to specific fields such as medicine, law or engineering. The result is a language model that is much better suited to individual needs and therefore answers more precisely, more relevantly and more convincingly.

Reinforcement Learning with Human Feedback (RLHF)

Reinforcement Learning with Human Feedback (RLHF), as a particular form of post-training, is an important complement to the pretraining of a Large Language Model (LLM), because it helps the model not only learn language patterns but also produce useful, safe and desirable answers in a human-like way. While pretraining trains the model on a large amount of text data and enables it to understand and generate language, there are often no explicit incentives for the quality and appropriateness of the answers. This is where RLHF comes in: through human feedback the model learns which answers should be preferred. This happens in a reinforcement learning process, in which the model generates various possible answers that are then rated by human annotators. These ratings provide the basis for a reward model that optimises the LLM so that preferred answers are produced more often. An example would be improving the politeness, relevance or comprehensibility of responses in ways that pure pretraining did not adequately cover.

The difference between post-training and RLHF lies in the method of adjustment. Post-training is a more classic fine-tuning step after pretraining, in which the model is trained with additional specific data — for example, medical specialist texts for a specialised model. This method adjusts the weights of the neural network through further supervised learning. RLHF, by contrast, uses human preference judgements in an iterative learning process, so the model is continuously improved through an additional reward system. Whereas post-training is more of a one-off extension of the knowledge base or an adaptation to a specific domain, RLHF represents a dynamic optimisation of output quality that makes the model more human-like and more useful.

Hallucinations

A central phenomenon when using LLMs is that they sometimes invent false facts — so-called "hallucinations". This is directly connected to how they work: LLMs don't generate text based on a verified knowledge store, but — as described — by computing probability distributions over the next word. At each step the model selects the statistically most likely token given the previous tokens, without necessarily checking whether the content is factually correct. This missing anchor in an external factual base (a lack of grounding) means LLMs can produce statements that are linguistically convincing but factually wrong. A model may, for example, formulate a plausible answer to a question on history, medicine or technology that doesn't actually rest on a reliable source but has merely been reconstructed from patterns in the training data. Especially with open questions without a clear point of reference, or with contradictory language patterns, models tend to "invent" content to fill in the gaps. To address this issue, approaches such as Retrieval-Augmented Generation (RAG) are increasingly being used, in which the model can deliberately access external, verified knowledge sources during answer generation.

Providers of GPTs

So far, the discussion has mostly referred to OpenAI and ChatGPT, but there are now many companies and projects that offer generative language models (GPTs) or Large Language Models (LLMs) and keep developing them. Leading commercial providers include, as mentioned, OpenAI with ChatGPT, Microsoft as part of the Azure ecosystem, Anthropic with Claude, Google with Gemini, xAI with Grok, and other companies such as Cohere and AI21 Labs. These providers generally pursue a commercial approach in which the models are made available through cloud platforms and offered for a usage fee. The following gives an overview of some of the best-known providers, followed by a closing section on open-source LLMs and their advantages.

Commercial providers

OpenAI

OpenAI is regarded as a pioneer in the field of generative AI models, and with GPT-3, GPT-3.5, GPT-4(o) and the most recent models GPT-4.5 and GPT-o3 (pro) has been a major driver of the spread of neural language models. ChatGPT, based on the GPT models, is a conversational model optimised for interactive dialogue, which has gained huge visibility through its high text quality. OpenAI has recently also introduced ChatGPT Plus and ChatGPT Enterprise to meet the needs of private users and businesses.

Microsoft

Microsoft works closely with OpenAI and integrates their models into its own Azure cloud platform. This gives businesses direct access — with clearly defined service level agreements — to the latest GPT technologies, such as GPT-4o, and lets them embed them seamlessly into their existing cloud infrastructure. On top of that, Microsoft has developed Copilot, an AI assistant integrated into products such as Microsoft 365, GitHub, Edge and into developer tools, currently based on OpenAI's models.

Anthropic

Anthropic, a relatively young AI research company, offers a capable chat and language model with Claude. The latest model, Claude 4.0 Opus / Sonnet, is regarded as particularly advanced in terms of text understanding and conversational ability, and competes directly with the models from OpenAI and Microsoft.

Google

Google has consolidated its AI language model efforts into Gemini, a multimodal model that succeeds PaLM and the conversational system Bard. Gemini is integrated into a range of Google products and covers a wide spectrum of applications, from search to creative tasks.

xAI and Grok

The company xAI, founded by Elon Musk, has developed a powerful Large Language Model (LLM) with Grok, which also stands out for its ability to carry on natural and human-like conversations. Grok is designed to answer complex questions precisely and often delivers humorous and creative responses. A standout feature of Grok is its exclusive access to data from X (formerly Twitter). This unique data source allows the model to incorporate real-time information and current trends directly from one of the largest social platforms. This gives Grok a clear advantage in processing news of the day and opinions, which sets it apart from other models here.

DeepSeek and DeepSeek R1

DeepSeek is an up-and-coming company in the field of generative AI from China, and with DeepSeek R1 has developed a specialised Large Language Model focused on scientific research, data analysis and technical documentation. Like other models, DeepSeek R1 stands out for its ability to deliver highly precise and context-sensitive answers in demanding subject areas. What's particularly noteworthy, however, is that DeepSeek has managed to develop and train a state-of-the-art model with comparatively limited resources. This was achieved through innovative approaches in machine learning and model optimisation that enable more efficient use of compute and data. DeepSeek R1 is offered both as a cloud service and as an open-source variant, allowing flexible use and customisation.

Alibaba and Qwen

Alibaba's Qwen is a family of generative AI models with multimodal capabilities that can process text, images and audio. Models such as Qwen-7B, Qwen-VL and Qwen2.5-VL are open source and accessible to developers and businesses, though the latter come with restrictions for organisations with more than 100 million users. Qwen2.5-VL analyses text and images, understands visual content and can control devices. According to Alibaba, Qwen2.5-Max outperforms models such as DeepSeek-V3 and ChatGPT in mathematics and programming. The open-source approach encourages further development by the community and strengthens Alibaba's position in the Chinese and global AI competition.

Cohere

Cohere focuses on language models for enterprise use and offers specialised models for tasks such as text generation, classification and semantic search. Their latest model, Command-R, is designed to handle complex queries and be integrated into corporate workflows.

AI21 Labs

AI21 Labs has developed its own LLMs with Jurassic-2 and subsequent models, offering flexible pricing models and integration options for large-scale text processing projects. Their focus is on providing models that can be adapted to specific industry requirements.

Open-source LLMs

Alongside these commercial providers, a strong open-source ecosystem for large language models has emerged in recent years. Projects such as GPT-Neo, GPT-NeoX and BLOOM were launched by research institutions, non-profit organisations and developer communities, and offer freely available models with different sizes and capabilities. Meta has made a significant contribution with LLaMA 3, by making the model accessible to the research community. Other notable open-source projects are Mistral and Grok (2), which offer powerful models that can often compete with commercial alternatives. Chinese providers such as DeepSeek and Alibaba (Qwen models) are also active here.

Advantages of open-source LLMs

The fundamental advantage of open-source LLMs is their high degree of adaptability. Because the source code — and in many cases the model weights as well — are publicly available, developers and organisations can modify the models as needed, fine-tune them on specific datasets or install them locally. This makes complete control over data and privacy possible, without relying on cloud services. Organisations with sufficient compute infrastructure can also scale the models themselves, from local PC hardware up to dedicated AI servers. Open development also encourages scientific exchange and improves transparency around training, architecture and ethical aspects of the models.

Summary

Criterion Commercial models Open-source models
Access Via cloud services, usually paid Freely accessible, often also runnable locally
Adaptability Limited (black box) Fully adaptable (source code and weights available)
Data protection Data processing usually external (cloud — watch out for GDPR) Local processing possible, full data sovereignty
Support & SLA Commercial support, service level agreements available Community support, no guaranteed service
Compute requirements No own server needed, use via the provider platform Own infrastructure required (depending on the model)
Further development Centrally driven by the company Community- or consortium-driven development
Transparency Limited insight into training data and methods High transparency through open publications
Cost control Recurring usage fees (e.g. token costs) One-off costs (hardware, electricity), no vendor lock-in
Example providers OpenAI, Microsoft, Google, Anthropic Meta (LLaMA), Mistral, BLOOM, GPT-NeoX, DeepSeek, Qwen, etc.

Current and future developments

A few current developments in the LLM space that are playing — or will soon play — a role and will be made available in new GPT versions.

Integration of "chain-of-thought" reasoning

The integration of "chain-of-thought" (CoT) reasoning is often seen as a decisive step towards making GPT models more intelligent and more comprehensible. CoT allows the model to lay out its line of reasoning in a structured way, which leads to more precise and better-founded answers. Everything points to the idea that GPT-5 — whose release is still pending — will build out this technique further. The o1/o3 models from OpenAI, which were introduced from September 2024, already use CoT automatically.

GPT-4.5, released in February 2025, also brings improved reasoning capabilities, although no specific details are known about its CoT integration. This suggests that future models will continue to optimise the technology to handle even more complex tasks and offer a better user experience.

Larger context windows

The trend towards ever larger context windows continues unabated. GPT-4 Turbo, released in November 2023, already processes an impressive 128,000 tokens — roughly 96,000 words. This makes it possible to analyse entire code bases or process long documents in a single pass without having to laboriously split them into smaller chunks.

Although OpenAI has not published an official figure for the context window size of GPT-4.5, much suggests that it has grown again. It's fair to assume GPT-5 will continue down this path consistently, enabling even more efficient processing of large data volumes — a real game-changer for developers, businesses, and anyone working with complex content.

The current leader in this discipline is Google's Gemini 2.5 model with 1 million input tokens (1,048,576 tokens). That corresponds to about 1,500 pages of text or 30,000 lines of code.

AI agents and Retrieval-Augmented Generation (RAG)

Another important step forward in the development of AI systems is the use of intelligent agents that can not only understand tasks but actively carry them out. While models such as GPT-4o already enable impressive interactions, the next generation goes one step further: future AI agents are intended to be able to research independently, analyse documents or automate complex workflows.

A key role here is played by the concept of Retrieval-Augmented Generation (RAG). Instead of relying solely on the static knowledge inside a model, RAG combines AI-generated answers with dynamically retrieved information from external data sources — for example, from a company's own knowledge bases, cloud services or internal documentation. This means businesses will, in future, be able to deploy AI agents that access their specific data without that data having to be fed directly into the model up front.

In practice, a RAG-based agent could, for example, answer a company-specific support request by searching internal manuals, extracting the relevant passages and producing a precise answer from them. Software development also offers interesting use cases — for example, the automatic analysis of code repositories or the targeted search for bugs in complex systems.

Extended operator functions

With the introduction of OpenAI's "Operator" in January 2025, a major step towards autonomous AI agents was taken. "Operator" is based on the new "Computer-Using Agent" (CUA), a variant of GPT-4o, and brings extended reasoning capabilities. In practice, this means it can book trips, order groceries or even create memes on its own — directly through a web browser.

At the moment, "Operator" is exclusive to subscribers of the $200-per-month ChatGPT Pro plan, but OpenAI plans to broaden access in future. In the long term, this technology could be integrated into GPT models to automate everyday tasks efficiently and take the interaction with computers to a new level.

MCP servers

In this context, another forward-looking approach to using LLMs in businesses is the use of so-called MCP servers (Model Context Protocol). This is an open, lightweight interface that lets you connect internal data sources, APIs or workflows to LLMs in a standardised way — regardless of whether the model is run locally or used via a cloud platform. The MCP server acts as an intermediary between the language model and company-specific resources. An LLM can, for example, accept a support request, fetch relevant data from the internal knowledge base via the MCP server, or even kick off an ERP workflow. Requests are routed through a uniform protocol (e.g. STDIO, HTTP, SSE or OpenAPI), which enables a secure and modularly extensible architecture. For businesses, this means they keep control of their data, can keep using existing systems and at the same time benefit from modern language AI — without having to change the model itself. Separating model and data access also improves transparency and reduces regulatory risk when using AI in a professional setting. See also our article on the testing options for local MCP servers.

Real-time processing

The OpenAI Realtime API enables real-time language processing via a WebSocket-based interface that supports both text and audio input and output. The extremely low latency makes for fluid and immediate interaction with AI-based applications, which is especially valuable for voice assistants, interactive dialogue systems and live translation. Unlike conventional API calls, which are processed sequentially, this technology enables continuous, bidirectional communication between user and AI, making conversation flows feel more natural and spontaneous.

The Realtime API plays a key role in the development of accessible applications, in particular. People with visual or writing impairments can interact with digital systems more easily thanks to seamless speech processing — by using voice commands or having text content read out in real time. It also opens up new possibilities in customer service by powering dynamic, AI-driven chatbots and phone assistants that can respond to customer queries instantly.

Technically, the Realtime API combines several advanced components of language AI. Multimodality allows a flexible combination of text and speech processing, and with it applications that can dynamically switch between spoken and written communication. On top of that, support for function calls improves integration with external services, so voice assistants can not only answer questions but also carry out actions directly — booking a restaurant table, say, or setting a calendar reminder.

Progress towards artificial general intelligence (AGI)

With o1 (currently already o3), released in September 2024, OpenAI took another step towards artificial general intelligence (AGI). The model impresses with advanced logical reasoning and outstanding performance in mathematics, programming and the natural sciences. Initially available only as preview models ("o1-preview" and "o1-mini"), it was officially launched on 5 December 2024 — including a more powerful Pro version. Other providers now also offer powerful reasoning models, such as Anthropic with Claude 4.0 Opus and Google with Gemini 2.5 (Pro).

Particularly interesting: these models are already integrated into applications such as GitHub Copilot, Cursor, Windsurf and others, which underlines their practical viability.

Ethics, data protection and legal aspects

These are genuinely complex topics that deserve a thorough treatment, but within this more technical article they can only be briefly touched on in what follows. With the growing use of LLMs, alongside the technological and application-related aspects, a range of ethical and legal questions are also being discussed. A central topic here is, for example, dealing with bias: since LLMs are trained on extensive text corpora from the internet, they inevitably also take on whatever societal prejudice, discrimination or stereotypical depictions those corpora may contain. These can subtly seep into the answers — for instance in gender-specific role attributions or in handling sensitive topics such as origin, religion or political stance. Of course, judging and possibly correcting these matters is in turn highly individual and difficult. Clear differences can also be seen in how individual providers handle this. Whereas a company like Anthropic (Claude) tries to be as "correct" as possible here, the focus at xAI (Grok) is more on an "unfiltered" presentation of results without post-hoc correction.

Another central topic is data protection, since training data may also contain personal or sensitive information that can unintentionally surface again in the generated outputs. There is also increasing debate over whether copyright is being infringed when LLMs are trained on copyrighted texts and then generate content that closely resembles the original in style or substance. This is relevant not only for text generation but also for generating images and sounds ("write/draw/compose in the style of …").

These challenges make it clear that using LLMs brings not just a technical but also a societal and legal responsibility. In the European Union, the AI Act is currently being established as a comprehensive legal framework intended to regulate the development and use of AI systems. The draft legislation provides for, among other things, transparency obligations, risk assessments and clear requirements for data quality — particularly for so-called high-risk applications.

There are clear differences in approach between the EU and the US, which were also evident at the AI summit in Paris in early 2025. While in the US the focus is on the opportunities of AI development (J.D. Vance's speech), which is not to be hampered by over-regulation, the EU's focus is more on regulation. Which approach is "better" cannot yet be conclusively judged.

For developers, providers and users of LLMs, especially in the EU, this means: handling these technologies in a legally compliant and responsible way will in future not just be expected, but legally required. Anyone wanting to deploy LLMs therefore has to understand not just how they work, but also reflect on their impact in a societal context and take corresponding precautions.

Outlook and summary

This text is an attempt to give a brief, non-technical overview of the development, functioning and possible applications of Large Language Models (LLMs). The aim above all is to show how language processing has evolved over the past decades from rule-based approaches and statistical language models to the Transformer architecture.

A central outcome of this development is the emergence of GPT models as a particular type of LLM that consists of a decoder only and generates text autoregressively. Through training with billions of parameters and the use of modern hardware platforms, GPTs have reached a remarkable level of language competence — whether in understanding texts, answering questions or generating creative content.

Many innovations such as chain-of-thought reasoning, AI agents, the integration of external knowledge sources (RAG), real-time speech processing and the path towards artificial general intelligence (AGI) make this clear: the development of LLMs is far from finished — it's in a dynamic state of progress.

That's precisely why it's important to stress that the field around LLMs and GPTs is subject to rapid change. New model variants, optimised training methods, regulatory frameworks and societal expectations continuously shift the state of the art. This document is therefore a snapshot — an entry point for understanding, but not a final account. Anyone working with LLMs or wanting to understand their effects has to be willing to stay informed about new developments regularly and to keep questioning and updating their existing knowledge.

Reading recommendations

For a deeper understanding of how LLMs work, I can warmly recommend

Hands-On Large Language Models Language Understanding and Generation

by Jay Alammar and Maarten Grootendorst (published by O'Reilly)

The book uses insightful diagrams and many Python examples to introduce the topic. Beyond that, there is a wide range of further resources — many in German — (researched with the help of verified AI):

History of AI and language processing

Books:

Online resources:

Transformer architecture and attention mechanism

Books:

Online resources:

Word embeddings and vector spaces

Books:

Online resources:

How GPT models and decoders work

Books:

Online resources:

Training of LLMs (pretraining, fine-tuning, RLHF)

Online resources:

Practical applications and prompt engineering

Books:

Online courses:

LLM providers

Comparisons and overviews:

Open-source LLMs

Resources:

Ethics, bias and legal aspects

Books:

Online resources:

Future developments (AGI, RAG, chain-of-thought)

Online resources:

General AI learning resources

NEWSLETTER

Four to six times a year, no marketing noise.

One pattern, one case, one recommendation. Signup with double opt-in, unsubscribe at any time.