Retrieval-Augmented-Generation

In-browser semantic search with EmbeddingGemma

📅 September 8, 2025 — by Guillaume Laforge

retrieval-augmented-generation large-language-models machine-learning

A few days ago, Google DeepMind released a new embedding model based on the Gemma open weight model: EmbeddingGemma. With 308 million parameters, such a model is tiny enough to be able to run on edge devices like your phone, tablet, or your computer.

Embedding models are the cornerstone of Retrieval Augmented Generation systems (RAG), and what generally powers semantic search solutions. Being able to run an embedding model locally means you don’t need to rely on a server (no need to send your data over the internet): this is great for privacy. And of course, cost is reduced as well, because you don’t need to pay for a remote / hosted embedding model.

AI Agents, the New Frontier for LLMs

📅 July 16, 2025 — by Guillaume Laforge

generative-ai large-language-models machine-learning retrieval-augmented-generation ai-agents langchain4j agent-development-kit

I recently gave a talk titled “AI Agents, the New Frontier for LLMs”. The session explored how we can move beyond simple request-response interactions with Large Language Models to build more sophisticated and autonomous systems.

If you’re already familiar with LLMs and Retrieval Augmented Generation (RAG), the next logical step is to understand and build AI agents.

What makes a system “agentic”?

An agent is more than just a clever prompt. It’s a system that uses an LLM as its core reasoning engine to operate autonomously. The key characteristics that make a system “agentic” include:

Advanced RAG — Using Gemini and long context for indexing rich documents (PDF, HTML...)

📅 July 13, 2025 — by Guillaume Laforge

generative-ai large-language-models machine-learning retrieval-augmented-generation langchain4j

A very common question I get when presenting and talking about advanced RAG (Retrieval Augmented Generation) techniques, is how to best index and search rich documents like PDF (or web pages), that contain both text and rich elements, like pictures or diagrams.

Another very frequent question that people ask me is about RAG versus long context windows. Indeed, models with long context windows usually have a more global understanding of a document, and each excerpt in its overall context. But of course, you can’t feed all the documents of your users or customers in one single augmented prompt. Also, RAG has other advantages like offering a much lower latency, and is generally cheaper.

Advanced RAG — Hypothetical Question Embedding

📅 July 6, 2025 — by Guillaume Laforge

generative-ai large-language-models machine-learning langchain4j java retrieval-augmented-generation

In the first article of this Advanced RAG series, I talked about an approach I called sentence window retrieval, where we calculate vector embeddings per sentence, but the chunk of text returned (and added in the context of the LLM) actually contains also surrounding sentences to add more context to that embedded sentence. This tends to give a better vector similarity than the whole surrounding context. It is one of the techniques I’m covering in my talk on advanced RAG techniques.

Advanced RAG — Sentence Window Retrieval

📅 February 25, 2025 — by Guillaume Laforge

generative-ai large-language-models machine-learning langchain4j java retrieval-augmented-generation

Retrieval Augmented Generation (RAG) is a great way to expand the knowledge of Large Language Models to let them know about your own data and documents. With RAG, LLMs can ground their answers on the information your provide, which reduces the chances of hallucinations.

Implementing RAG is fairly trivial with a framework like LangChain4j. However, the results may not be on-par with your quality expectations. Often, you’ll need to further tweak different aspects of the RAG pipeline, like the document preparation phase (in particular docs chunking), or the retrieval phase to find the best information in your vector database.

Advanced RAG Techniques

📅 October 14, 2024 — by Guillaume Laforge

generative-ai large-language-models java langchain4j retrieval-augmented-generation

Retrieval Augmented Generation (RAG) is a pattern to let you prompt a large language model (LLM) about your own data, via in-context learning by providing extracts of documents found in a vector database (or potentially other sources too).

Implementing RAG isn’t very complicated, but the results you get are not necessarily up to your expectations. In the presentations below, I explore various advanced techniques to improve the quality of the responses returned by your RAG system: