Building a Merchant Support Agent with FastAPI, RAG, and LangGraph 🛍️
A practical AI backend for grounded support answers, product recommendations, and modern agentic orchestration
I built an end-to-end merchant support AI backend for a real commerce scenario. The project combines FastAPI, OpenAI embeddings, ChromaDB, LangGraph, and Pydantic to answer operational questions, retrieve FAQ and product context, recommend products, and escalate when the retrieved evidence is weak. This focused proof of concept was created to validate a support-agent architecture for conversational commerce use cases. It showcases a modern applied AI stack, clear service boundaries, and a pragmatic approach to retrieval-augmented and agent-oriented backend design.
I have always liked problems that sit at the border between language, mathematics, and software.
Some years ago that curiosity took me through word embeddings, sequence models, neural machine translation, dependency parsing, and transformers. Back then, much of the intellectual excitement lived inside the model: how to represent words, how to encode sequences, how attention works, how pretraining changes everything.
That layer is still fascinating. But the center of gravity of practical AI has shifted.
Today, many of the most interesting engineering questions are not only about the model itself, but about the system around it, such as how to connect a model to private knowledge, constrain its behavior, make it retrieve evidence before answering, and turn generation into one step of a larger workflow rather than the whole workflow.
This project was a way to explore exactly that.
From Models to Systems - Source: Own
I built a merchant support agent: an AI backend that can answer operational questions from a FAQ knowledge base, retrieve product context from a catalog, recommend products, and escalate when the available context is too weak.
At first glance, that sounds like a fairly ordinary support use case. In practice, it is a very good vehicle to explore several ideas that are central to applied AI engineering today:
Retrieval-Augmented Generation
text embeddings and semantic retrieval
vector search
structured outputs
bounded orchestration
explicit failure paths
In this article I want to do two things.
First, I want to explain the technical ideas behind this kind of system in a way that is formal enough to be useful, but not so dense that it loses contact with the implementation.
Second, I want to show how those ideas materialize in a concrete backend built with FastAPI, OpenAI embeddings, ChromaDB, LangGraph, and Pydantic.
A standard language model can be seen as approximating a conditional distribution of an output sequence y given an input x:
P(y∣x)
Where:
x is the user query or prompt
y is the generated answer
This is already powerful, but it has an obvious limitation: all the evidence needed to produce y must either be present in x or somehow encoded in the model's parameters.
That is not ideal for many business scenarios.
Suppose a merchant changes its delivery rules, adds new products, removes others, or updates payment methods. We do not want to retrain a model every time a policy or catalog changes. We want the answer to depend on external knowledge that can be updated independently.
This is where Retrieval-Augmented Generation enters.
Instead of generating from the query alone, we retrieve a set of relevant documents first, then condition the answer on both the query and the retrieved evidence:
P(y∣x,R(x))
Where R(x) is a retrieval function such that:
R(x)={d1,d2,…,dk}
and each di is a document selected from a knowledge base because it is relevant to the query x.
This changes the problem in a fundamental way.
The quality of the final answer no longer depends only on the model. It depends on at least three components:
The representation of the documents
The retrieval function R(⋅)
The generation model conditioned on the retrieved context
That is one of the reasons I find RAG so interesting from a software engineering perspective. It forces us to think beyond the model and treat the full pipeline as the real system.
A useful way to visualize the idea is the following:
flowchart LR
Q[User Question]
E[Embed Query]
R[Retriever]
D[Relevant Documents]
G[Generator]
A[Answer]
Q --> E
E --> R
R --> D
Q --> G
D --> G
G --> A
The model is still important, of course. But now it is only one component in a retrieval-grounded architecture.
Embeddings and semantic retrieval
To retrieve relevant documents, we first need a way to represent text numerically.
If we define an embedding function as:
f:Text→Rn
then a query x can be mapped to a vector q and each document di can be mapped to a vector vi:
q=f(x),vi=f(di)
Once both the query and the documents live in the same vector space, retrieval becomes a nearest-neighbor problem. We can search for the documents whose vectors are "closest" to the query vector.
A common similarity measure is cosine similarity:
cosine(q,vi)=∥q∥∥vi∥q⋅vi
Alternatively, we can work with distance-based formulations. In practice, vector databases often expose nearest-neighbor search over these embedded representations and return the top-k most similar items.
Why is this useful?
Because semantic retrieval is not restricted to exact word overlap. Two texts can be close in embedding space even if they do not share the same literal tokens. That matters a lot in support and recommendation scenarios.
A user may ask:
"How do I pay?"
"Can I use bank transfer?"
"What options do you have for payment?"
"Do you accept Yape?"
These are different strings, but semantically related. A retrieval layer based only on exact keyword matching would be much more brittle here.
This is precisely the kind of problem embeddings help with. They give us a representation where meaning, or at least semantic proximity, becomes easier to operationalize.
Of course, embeddings are not magic. Their quality depends on the model, the data, and the granularity of the indexed documents. But they provide a practical bridge between raw text and search.
A support assistant is not only a language model
Suppose a user asks one of the following:
"How can I pay?"
"Do you ship outside the city?"
"What do you recommend for dry hair?"
"What shampoo do you recommend for dry hair and how can I pay for it?"
These are all natural-language questions, but they are not the same type of question.
Some are operational: payment methods, delivery zones, returns policies, business hours.
Some are product-oriented: product discovery, product suitability, basic recommendations.
And some combine both.
This distinction matters.
If we collapse all these requests into a single prompt and ask the model to "figure it out", we lose control over the reasoning path. The system may still produce fluent answers, but fluency is not the same as structure.
A better design is to stage the request:
Determine what kind of question we are dealing with
Retrieve the appropriate context
Decide whether the evidence is strong enough
Generate a grounded answer
This is still a simple flow, but it already has an important property: it turns the system into a sequence of inspectable transformations rather than one opaque jump from prompt to response.
By reasoning path I mean the sequence of intermediate steps a system follows between the user's input and the final answer: classification, retrieval, action selection, tool use, filtering, and response generation. In AI-backed systems, making that path explicit matters because it lets us decide which parts should remain deterministic and which parts can benefit from probabilistic inference. If we delegate the entire path to an LLM, we gain flexibility but lose control, inspectability, and often reliability. If we over-constrain everything with rigid rules, we gain predictability but lose adaptability and semantic coverage. The engineering challenge is to find the right boundary between both worlds: enough probabilistic capacity to deal with ambiguity and language variation, and enough deterministic structure to keep the system understandable, testable, and safe for its task.
From generation to orchestration
Once we decompose the problem into steps, the backend starts looking less like a chatbot and more like a state-transition system.
We can define an internal state as:
s=(q,t,F,P,a,r)
Where:
q is the original question
t is the inferred question type
F is the set of retrieved FAQ items
P is the set of retrieved product items
a is the selected action
r is the final response
The workflow can then be understood as a sequence of transitions:
s0→s1→s2→s3→s4
For this project, the sequence is:
flowchart TD
Q[Question]
C[Classify Question]
R[Retrieve Context]
D[Decide Action]
G[Generate Answer]
E[Escalate]
Q --> C
C --> R
R --> D
D --> G
D --> E
This is one of the reasons I wanted to use LangGraph.
Because even a compact workflow benefits from having explicit stages, explicit state, and explicit transitions.
That is where orchestration starts becoming interesting. It is not about making the system look more agentic for the sake of fashion. It is about expressing control flow cleanly when generation is only one part of the job.
The architecture
The project is organized in two stages: an indexing stage and an inference stage.
This separation is important because embeddings and vector storage should be prepared ahead of time, while request-time logic should only read from the existing index.
Indexing stage
The indexing stage consumes two local data sources, a FAQ markdown file and a product catalog JSON file.
Each FAQ section is transformed into one searchable document. Each product is also transformed into one searchable document built from fields such as name, description, category, price, and tags.
Those documents are embedded with OpenAI embeddings and stored in two ChromaDB collections, faq_chunks and product_chunks.
I like this separation because it reflects the semantics of the problem:
FAQ content is operational knowledge
product content is recommendation-oriented knowledge
Mixing both into a single collection would make retrieval less interpretable and the later control flow less clear.
Inference stage
The online request path is compact:
A client sends POST /ask with a natural-language question
FastAPI validates the payload with Pydantic
A service layer delegates to a support agent
The agent classifies the question as faq, product, mixed, or unknown
Retrieval runs against one or both ChromaDB collections
A decision node chooses answer, recommend, or escalate
The chat model generates the final grounded answer using only the retrieved context
The overall flow looks like this:
flowchart LR
A[POST /ask]
B[Validate Request]
C[Classify Question]
D[Search FAQ Collection]
E[Search Product Collection]
F[Decide Action]
G[Generate Grounded Answer]
H[Return Response]
A --> B
B --> C
C --> D
C --> E
D --> F
E --> F
F --> G
G --> H
The important point here is that generation happens late in the pipeline, not at the beginning. The answer is produced after classification, after retrieval, and after action selection.
That order matters more than it might seem.
Why LangGraph was a good fit
A lot of current AI demos follow roughly the same recipe: send a prompt, let the model infer intent, decide whether to use tools, reason, and answer—with the unspoken hope that the entire process remains coherent enough to be useful.
Sometimes that is acceptable. But for this project I wanted a flow that was more explicit.
LangGraph felt like a good fit because the workflow was naturally graph-shaped, but still bounded:
classify_question
retrieve_context
decide_action
generate_answer
That was enough.
I was not trying to build an open-ended autonomous agent. I was trying to build a support backend with a disciplined control flow. In many practical scenarios, that is the more interesting problem.
There is a broader lesson here.
In practice, the most useful agentic systems aren't always the most autonomous. Often, they're simply the ones with the clearest boundaries: explicit state, visible transitions, constrained tool usage, and defined fallback behavior. This project aimed to follow that approach.
Structured outputs as software contracts
Another aspect of the project that I particularly enjoyed was the use of structured outputs.
The backend uses Pydantic for more than just the API layer; it also powers internal models like retrieved FAQ items, product matches, question classification, the final answer schema, and agent state.
At first glance this may look like a small implementation detail. In reality, it is part of the architecture.
Whenever LLMs are involved, there is a strong temptation to let everything degrade into free-form text. That may be enough for quick experimentation, but it becomes fragile very quickly once the system grows beyond one prompt.
Structured outputs help in several ways:
they make intermediate steps unambiguous
they make the control flow easier to inspect
they give the rest of the backend stable contracts
they reduce the amount of string parsing and prompt guesswork
This is one of the most practical patterns in current AI engineering making systems much easier to maintain.
A note on retrieval quality
One subtle point about systems like this is that generation quality is downstream of retrieval quality.
If the retriever sends poor context, the generator has little chance to recover. It may still produce a fluent answer, but fluency is a dangerous metric here. A support answer can be perfectly well written and still be wrong.
To keep things simple and legible, the retrieval layer relies on two separate collections, well-defined documents, explicit top‑k retrieval, and a direct mapping from search results into typed models.
In other words, the retrieval layer is intentionally boring in the best possible way.
If I were extending the project, one of the most interesting directions would be to evaluate retrieval more systematically:
which questions retrieve the right FAQ sections?
which questions retrieve the most relevant products?
how sensitive is the system to wording variation?
when should reranking be introduced?
Those are the kinds of questions that move a project from "working demo" into a stronger applied system.
On escalation
One of the easiest ways to make an AI system seem impressive is to let it answer everything.
One of the easiest ways to make it unreliable is exactly the same.
This POC includes a decision step that checks retrieval scores and chooses one of three actions: answer, recommend, or escalate.
The scoring function is heuristic. It is not calibrated confidence, and I would not want to overstate it. Still, even a lightweight thresholding step changes the character of the system in an important way.
Instead of assuming that every question deserves a polished answer, the backend has an explicit path for saying: the available context is not strong enough, this should be escalated.
That may not be the most glamorous feature in the project, but from a support perspective it is probably one of the most valuable.
A system like this should not only know how to speak. It should also know when evidence is insufficient.
Sync by design
Another implementation detail I liked was keeping the request path synchronous.
The FastAPI route is synchronous. The service layer is synchronous. The graph invocation path is synchronous.
This was deliberate.
It is easy to wrap an AI backend in asynchronous syntax at the HTTP boundary and give the impression of full async design while the expensive work underneath remains effectively blocking. I prefer not to decorate a path with async unless the underlying libraries and execution path benefit from it in a meaningful way.
So in this project I kept the design simple and honest.
That may sound like a minor point, but I think it reflects a broader engineering principle: architecture should mirror reality, not aspiration.
Some implementation details I particularly enjoyed
A few design choices made this project especially satisfying to work on.
Separate collections for separate semantics
Keeping FAQ retrieval apart from product retrieval was a good decision. It gives the system a clean structure:
one knowledge source for support and policy content
one knowledge source for product discovery and recommendation
That separation later simplifies classification, retrieval, and response shaping.
The answer is generated last
The final answer is not the beginning of the process. It is the end of the process.
The system first narrows the type of question, then retrieves evidence, then decides the appropriate action, and only then generates the user-facing answer.
This is a small but meaningful design principle. Generation should happen after the problem has been constrained.
The answer is explicitly grounded
The answer prompt instructs the model to use only the retrieved context. Of course, no prompt completely solves hallucination risk, but this still matters a lot. The difference between:
"answer this question"
and
"answer this question using only these retrieved facts"
is substantial.
Why this kind of project matters
A few years ago, a project like this would have been almost entirely about the model. Which architecture? Which benchmark? Which fine-tuning strategy? Which dataset? Those questions still matter, they always will.
But what I've found is that in practical systems, the more interesting engineering work now lives elsewhere. How is knowledge represented? How does retrieval actually work? How is state managed, control flow staged, answers constrained, uncertainty handled?
The Shift From Models to Systems - Source: Own
That shift is honestly one of the reasons I wanted to build this project. It feels like part of a broader movement in AI engineering, where value comes less from the model alone and more from the quality of the system around it.
Where this could go next
If I extended this project further, there are two directions I would find especially interesting.
The first is architectural hardening. For a more production-oriented version, the system would need a more robust persistence and retrieval layer, stronger observability, authentication, and a better evaluation loop. A natural next step would be to move from a local vector store to a PostgreSQL-backed setup with pgvector, especially if retrieval needs to live closer to the rest of the operational data. That would make it easier to integrate support knowledge, catalog data, metadata filters, and application state inside one broader backend architecture.
The second is execution model. The current request path is synchronous by design, but a real async version would require more than changing the route signature. It would require checking the full path end-to-end: async-compatible model clients, async retrieval boundaries, connection management, timeout handling, and backpressure under concurrency. In other words, true async is not a surface-level syntax choice. It is a property of the whole request pipeline.
Closing thoughts
What I find most interesting about projects like this is that they force two worlds to meet.
On one side, there is the probabilistic world of language models, embeddings, and approximate semantic representations.
On the other side, there is the deterministic world of backend engineering: contracts, state transitions, retrieval pipelines, and explicit decisions.
Neither side is sufficient by itself.
A model without structure is difficult to trust. A structure without language capability cannot solve the problem. The useful system emerges in the middle, where statistical representations and software design begin to reinforce each other.
That, at least for me, is where a lot of the excitement in practical AI lies now. Not only in bigger models, but in better systems around them.
There is also a more speculative question that I find increasingly interesting.
In this project, retrieval is text-first: FAQ entries and product records are turned into text, embedded, and searched semantically. But what happens when the underlying knowledge is not naturally textual?
Could we build RAG-like systems over voice, images, or video without first reducing everything to text as an intermediate representation? Could retrieval operate directly over multimodal embeddings, where a spoken question retrieves relevant audio fragments, visual evidence, or short video segments in their native representational space?
Gemini Embedding 2 multimodal retrieval demo - Source: Google
Recent developments suggest that this direction is becoming much more realistic. Google's Gemini Embedding 2 is explicitly presented as a natively multimodal embedding model that maps text, images, video, audio, and documents into a single shared embedding space. In principle, that means retrieval could be formulated over heterogeneous media objects that are comparable in one semantic vector space, rather than only over text produced after transcription, captioning, or OCR. That does not make multimodal RAG a solved problem, but it does make the path toward retrieval-native multimodal systems much easier to imagine.
Text is still an extremely useful bridge. But perhaps, increasingly, it will not need to be the only one.
The implementation repository for this project is currently private. If this system is relevant to your team or hiring process, feel free to contact me and I can walk you through the architecture, design decisions, and selected implementation details.