❯ Guillaume Laforge

Retrieval-Augmented-Generation

Advanced RAG β€” Understanding Reciprocal Rank Fusion in Hybrid Search

Today, let’s come back to one of my favorite generative AI topics: Retrieval Augmented Generation, or RAG for short.

In RAG, the quality of your generation (when an LLM crafts its answer based on search results) is only as good as your retrieval (the actually retrieved search results).

While vector search (semantic) and keyword search (BM25) each have their strengths, combining them often yields the best results. That’s what we often call Hybrid Search: combining two search techniques or the results of different searches with slight variations.

Read more...

In-browser semantic search with EmbeddingGemma

A few days ago, Google DeepMind released a new embedding model based on the Gemma open weight model: EmbeddingGemma. With 308 million parameters, such a model is tiny enough to be able to run on edge devices like your phone, tablet, or your computer.

Embedding models are the cornerstone of Retrieval Augmented Generation systems (RAG), and what generally powers semantic search solutions. Being able to run an embedding model locally means you don’t need to rely on a server (no need to send your data over the internet): this is great for privacy. And of course, cost is reduced as well, because you don’t need to pay for a remote / hosted embedding model.

Read more...

AI Agents, the New Frontier for LLMs

I recently gave a talk titled “AI Agents, the New Frontier for LLMs”. The session explored how we can move beyond simple request-response interactions with Large Language Models to build more sophisticated and autonomous systems.

If you’re already familiar with LLMs and Retrieval Augmented Generation (RAG), the next logical step is to understand and build AI agents.

What makes a system “agentic”?

An agent is more than just a clever prompt. It’s a system that uses an LLM as its core reasoning engine to operate autonomously. The key characteristics that make a system “agentic” include:

Read more...

Advanced RAG β€” Using Gemini and long context for indexing rich documents (PDF, HTML...)

A very common question I get when presenting and talking about advanced RAG (Retrieval Augmented Generation) techniques, is how to best index and search rich documents like PDF (or web pages), that contain both text and rich elements, like pictures or diagrams.

Another very frequent question that people ask me is about RAG versus long context windows. Indeed, models with long context windows usually have a more global understanding of a document, and each excerpt in its overall context. But of course, you can’t feed all the documents of your users or customers in one single augmented prompt. Also, RAG has other advantages like offering a much lower latency, and is generally cheaper.

Read more...

Advanced RAG β€” Hypothetical Question Embedding

In the first article of this Advanced RAG series, I talked about an approach I called sentence window retrieval, where we calculate vector embeddings per sentence, but the chunk of text returned (and added in the context of the LLM) actually contains also surrounding sentences to add more context to that embedded sentence. This tends to give a better vector similarity than the whole surrounding context. It is one of the techniques I’m covering in my talk on advanced RAG techniques.

Read more...

Advanced RAG β€” Sentence Window Retrieval

Retrieval Augmented Generation (RAG) is a great way to expand the knowledge of Large Language Models to let them know about your own data and documents. With RAG, LLMs can ground their answers on the information your provide, which reduces the chances of hallucinations.

Implementing RAG is fairly trivial with a framework like LangChain4j. However, the results may not be on-par with your quality expectations. Often, you’ll need to further tweak different aspects of the RAG pipeline, like the document preparation phase (in particular docs chunking), or the retrieval phase to find the best information in your vector database.

Read more...

Advanced RAG Techniques

Retrieval Augmented Generation (RAG) is a pattern to let you prompt a large language model (LLM) about your own data, via in-context learning by providing extracts of documents found in a vector database (or potentially other sources too).

Implementing RAG isn’t very complicated, but the results you get are not necessarily up to your expectations. In the presentations below, I explore various advanced techniques to improve the quality of the responses returned by your RAG system:

Read more...

Grounding Gemini with Web Search results in LangChain4j

The latest release of LangChain4j (version 0.31) added the capability of grounding large language models with results from web searches. There’s an integration with Google Custom Search Engine, and also Tavily.

The fact of grounding an LLM’s response with the results from a search engine allows the LLM to find relevant information about the query from web searches, which will likely include up-to-date information that the model won’t have seen during its training, past its cut-off date when the training ended.

Read more...