Late Chunking: Improving RAG Performance with Context-Aware Embeddings

blog preview

Retrieval Augmented Generation (RAG) has become a cornerstone of modern AI applications, but its efficiency often hinges on how well we handle document chunking.

When implementing RAG, developers must constantly balance between chunk sizes - too small, and we lose essential context; too large, and we compromise the embedding model's ability to capture precise semantic meaning. This isn't just a theoretical problem; it directly affects how accurately your system can retrieve relevant information for user queries.

Late chunking offers a practical approach to address this limitation. By processing documents through long-context embedding models before splitting them into chunks, we can maintain broader context while preserving semantic accuracy. In this guide, we'll examine the technical implementation of late chunking, compare its performance with traditional approaches, and provide concrete examples of how it improves retrieval quality.

For engineers and technical teams working with RAG systems, understanding this technique can lead to measurable improvements in retrieval accuracy.

NOTE: Late chunking was first introduced by the amazing people at jina ai. The code samples in this blog are also derived from their ground-work. And finally, we are going to use their outstanding jina-embeddings-v3 model for our examples. So shout-out to jina!

A quick introduction on Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is one of the most promising applications when it comes to LLMs. If you want a detailed introduction, please read our detailed post here.

RAG has two main phases:

  1. The indexing phases

  2. The retrieval phase

During indexing phase, the textual documents which are used as knowledge base for finding relevant information, are transformed to embeddings, which are high-dimensional vectors.

Embeddings are numeric representations of texts and roughly cover the semantic meaning of documents. The important thing is: Texts with similar semantic meaning have similar embeddings. For example, the text "My car" and "My automotive" have similar embeddings, whereas "My car" and "My cow" have rather different embeddings.

This fact can be used for advanced search algorithms, where users (or AI) can use natural language to search and find for semantically significant documents. That's basically the main element of RAG.

In the second phase, the retrieval phase, a user asks a question which should then be answered by an LLM. Using the embeddings created in phase 1 we can now search for documents which are semantically similar to the users question (and therefore are most likely to answer the users question).

Finally, we send the users question as well as the found documents to an LLM which then uses these docs to create a tailored answer.

A typical, simplified RAG pipeline is shown below:

Typical RAG pipelineTypical RAG pipeline

The problem with chunking in RAG applications

Now a rather big issue arises when we look at how (and why) we need to chunk our source documents.

Let's start with the why. Why do we need to chunk our documents in the first place? Simple, three reasons:

  1. LLMs don't have unlimited context windows - meaning we need to limit the amount of data we send to the LLM. In an ideal world, we could simply send the users question alongside our whole knowledge base to the LLM and wait for the answer. Due to context limitations, this is not possible.

  2. LLM context costs money. Even with LLM context windows getting bigger and bigger, each word we send to an LLM costs money. So we want to make sure that we somehow limit the costs our operation incurs.

  3. LLMs have difficulties in finding the correct information in large quantities of data. If we have possibilities to pre-filter relevant information, the LLM answer will be better.

So, and now to the how: How does document chunking works? Well, there are various methods, but all of them boil down to the following: We need to divide the documents into chunk of length x words. We can do this by simply dividing by number of words or by dividing at semantically meaningful parts - like at the end of a paragraph or chapter.

However this introduces one significant issue: We will most certainly lose connected context, as we divide our document. Let's look at an example, let's say we have a document like:

"TensorFlow is one of the most popular open-source machine learning frameworks.

It was originally developed by researchers and engineers from Google Brain.

The framework provides a flexible ecosystem of tools, libraries, and community resources. These resources help developers build and deploy ML-powered applications efficiently.

The framework excels at numerical computation and large-scale machine learning. Its architecture allows for easy deployment of computation across various platforms."

Let's further assume we chunk the document after each paragraph (which in this case are mostly individual sentences, but bear with me, it will get clear that this problem persists independently from where you chunk).

Let's say we'd ask a question like: What framework was developed by Google Brain? - we'd most probably find the second paragraph - however we totally lost the context of which framework was created by Google Brain. The word "It" is referring to "TensorFlow", however this connection/context is lost when using paragraph-based chunking.

How does late chunking work?

Now that we know the problem, how can late chunking help to alleviate it?

As this is quite a complex topic, let's split it into two parts, a summary and a more detailed explanation:

Summary: Instead of first chunking our document into smaller parts and then creating embeddings for each of these parts, long chunking applies a model to the full document but then creates token-level embeddings (so, one embedding per token). These token-level embeddings are then combined to chunk-level embeddings (meaning, reduced to one embedding per chunk). As the initial embeddings were created on tokens of the whole text, each of them contains semantic meaning of the whole text, not just the token or chunks they represent. If we combine them later on to the chunked embeddings, each chunk not only knows about itself, but also surrounding junks.

(The chart below was kindly taken from jina ai)

Late chunking as per jina aiLate chunking as per jina ai

Detailed description: The avid reader might not be satisfied with the superficial explanation above, so let's dive into more details. How does late chunking work in detail:

First, the text - the whole document if feasible - is tokenized (meaning it is divided in individual tokens):

1tokens = tokenizer(input_text, return_tensors='pt')

Second, these tokens are passed through the model:

1model_output = model(**tokens)

Now, the important part is, that embeddings are always the last hidden state of the model - so the last layer before the output. This means, we can simply access the embeddings of the tokens by accessing:

1token_embeddings = model_output.last_hidden_state[0]

To make it more accessible, after passing the tokens through the model, we can access individual token embeddings by accessing the last_hidden_state of the model. This gives is something like below, if we use the example text provided above (Note, this representation is simplified, as it shows one embedding array per word, as in reality it would be one embedding array per token):

1[
2 [0.1, 0.2, ...], # embedding for "TensorFlow"
3 [0.3, 0.4, ...], # embedding for "is"
4 [0.2, 0.1, ...], # embedding for "one"
5 [0.4, 0.3, ...], # embedding for "of"
6 [0.1, 0.1, ...], # embedding for "the"
7 [0.2, 0.3, ...], # embedding for "most"
8 [0.3, 0.2, ...], # embedding for "popular"
9 ...
10]

All we have to do, is to divide (speak, chunk) the array of embeddings. For example we can divide the array by looking at the token-text they represent. If, let's say' we chunk by sentence, we have to take the embeddings of all the tokens of the first sentence and create the vector average. This gives our final embedding for this chunk. Rinse and repeat for each and every sentence. The result is a list of chunk embeddings where each chunk embedding is derived from context-aware token embeddings, rather than processing each chunk in isolation.

The key difference from traditional chunking is that each token's embedding was created with awareness of the full context, so when "It" was embedded, the model already knew it referred to "Berlin" because it processed the whole text at once.

Why are these new chunks context-aware?

You might wonder, why these newly created chunks are more context-aware than with traditional chunking? The reason is the 'attention' - mechanism of the transformer architecture.

When the model processes "It", it doesn't just look at that token in isolation. Instead, through self-attention layers, each token can "attend to" (or "look at") all other tokens in the input text.

Here's a simplified explanation of how self-attention works:

  1. For each token, the model calculates attention scores with every other token in the sequence. These scores represent how much attention should be paid to each other token when creating the final representation.

  2. For example, when processing the word "It", the attention scores for the other words are:

    1TensorFlow: 0.7 (high attention score because it's the reference)
    2is: 0.1
    3one: 0.1
    4...
    5framework: 0.4
    6...
  3. The model learns these attention patterns during training. It learns that pronouns like "It" should pay strong attention to the nouns they refer to.

In code, this happens inside the transformer model when processing the full input:

1inputs = tokenizer(input_text, return_tensors='pt')
2model_output = model(**inputs)

The key difference to traditional chunking:

Traditional:

1chunk1 = "TensorFlow is the most popular framework..."
2chunk2 = "It was originally developed..."
3# When processing chunk2, "It" has no access to "TensorFlow"

Late Chunking:

1full_text = "TensorFlow is the most popular framework [...]. It was originally developed [...]"
2# When processing, "It" can attend to "TensorFlow" through self-attention
3model_output = model(**inputs)
4# Only after attention is applied do we average the token embeddings into chunks
5embeddings = late_chunking(model_output, [span_annotations])[0]

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use

Hans on: How to implement late chunking

Let's create an end to end example for our sample text above.

1pip install transformers torch einops
1from transformers import AutoModel
2from transformers import AutoTokenizer
3
4tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
5model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
6
7input_text = "TensorFlow is one of the most popular open-source machine learning frameworks. It was originally developed by researchers and engineers from Google Brain. The framework provides a flexible ecosystem of tools, libraries, and community resources. These resources help developers build and deploy ML-powered applications efficiently. The framework excels at numerical computation and large-scale machine learning. Its architecture allows for easy deployment of computation across various platforms."

Now let's split the text into chunks of sentences. We need this for the traditional method. We also need this, to calculate the positions of the tokens after which we want to chunk in our new method. (Remember, we need to chunk our array of embeddings after we created the embeddings for the whole text. We can do this, by remembering the start and end of our sentences).

1def chunk_by_sentences(input_text: str, tokenizer: callable):
2 inputs = tokenizer(input_text, return_tensors='pt', return_offsets_mapping=True)
3 punctuation_mark_id = tokenizer.convert_tokens_to_ids('.')
4 sep_id = tokenizer.convert_tokens_to_ids('[SEP]')
5 token_offsets = inputs['offset_mapping'][0]
6 token_ids = inputs['input_ids'][0]
7 chunk_positions = [
8 (i, int(start + 1))
9 for i, (token_id, (start, end)) in enumerate(zip(token_ids, token_offsets))
10 if token_id == punctuation_mark_id
11 and (
12 token_offsets[i + 1][0] - token_offsets[i][1] > 0
13 or token_ids[i + 1] == sep_id
14 )
15 ]
16 chunks = [
17 input_text[x[1] : y[1]]
18 for x, y in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
19 ]
20 # The span_annotations are the start and end positions of our
21 # text-chunks - in terms of tokens. So the first and last position
22 # of the tokens of our individual chunks
23 span_annotations = [
24 (x[0], y[0]) for (x, y) in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
25 ]
26 return chunks, span_annotations
27chunks, span_annotations = chunk_by_sentences(input_text, tokenizer)

Lust but not least we can create our late chunking implementation:

1def late_chunking(
2 model_output, span_annotation: list, max_length=None
3):
4 token_embeddings = model_output.last_hidden_state
5 outputs = []
6 for embeddings, annotations in zip(token_embeddings, span_annotation):
7 if (
8 max_length is not None
9 ): # remove annotations which go beyond the max-length of the model
10 annotations = [
11 (start, min(end, max_length - 1))
12 for (start, end) in annotations
13 if start < (max_length - 1)
14 ]
15 pooled_embeddings = [
16 embeddings[start:end].sum(dim=0) / (end - start)
17 for start, end in annotations
18 if (end - start) >= 1
19 ]
20 pooled_embeddings = [
21 embedding.detach().cpu().numpy() for embedding in pooled_embeddings
22 ]
23 outputs.append(pooled_embeddings)
24
25 return outputs

That's it, now we can use our methods to create embeddings as follows:

1# Traditional chunking method
2embeddings_traditional_chunking = model.encode(chunks)
3
4# Late chunking methods
5inputs = tokenizer(input_text, return_tensors='pt')
6model_output = model(**inputs)
7embeddings = late_chunking(model_output, [span_annotations])[0]

To compare these two methods, let's create a small 'benchmark' where we output the similarity between the word "TensorFlow" with all the chunks:

1import numpy as np
2
3# Method to calculate the cosine similarity
4cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
5
6# Embed our search term
7embedding = model.encode('TensorFlow')
8
9for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
10 print(chunk)
11 print(f' - late chunking:', cos_sim(embedding, new_embedding))
12 print(f' - traditional:', cos_sim(embedding, trad_embeddings))

These are the results in our case:

1TensorFlow is one of the most popular open-source machine learning frameworks.
2 - late chunking: 0.8232424
3 - traditional: 0.88007504
4 It was originally developed by researchers and engineers from Google Brain.
5 - late chunking: 0.80999833
6 - traditional: 0.535794
7 The framework provides a flexible ecosystem of tools, libraries, and community resources.
8 - late chunking: 0.7612619
9 - traditional: 0.44915122
10 These resources help developers build and deploy ML-powered applications efficiently.
11 - late chunking: 0.77815264
12 - traditional: 0.48452184
13 The framework excels at numerical computation and large-scale machine learning.
14 - late chunking: 0.8006218
15 - traditional: 0.6145765

If you look at the similarity scores (the higher, the better), the results are mind-blowing. While the similarity score drops a bit in the initial chunk, it is enormously better in all the other chunks.

It's to be expected, that the score drops a bit for the first chunk, as the 'traditional' method is already very good for finding chunks where the search term directly represents the chunk.

For all the other chunks, it's clear, that they are not 'semantically' connected to the first chunk, which - depending on your use case - can and will boost your RAG search performance.

Further Reading

More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog