MAISON CODE .
/ AI · LLM · RAG · Vectors

LLM Optimization: RAG, Vector Search, and the Edge

Running a 70B parameter model on a storefront is suicide. How to engineer AI features that are fast, cheap, and actually useful.

AB
Alex B.

In 2024, every e-commerce CEO asked: “How do we add AI?” In 2025, every CTO is answering: “Ideally, without going bankrupt.” Large Language Models (LLMs) are heavy, slow, and hallucinate. Customers do not want to chat with a bot that thinks a toaster is a microwave. They want Semantic Search and Hyper-Personalization. This requires a specific architecture: RAG (Retrieval-Augmented Generation) at the Edge.

Why Maison Code Discusses This

At Maison Code Paris, we act as the architectural conscience for our clients. We often inherit “modern” stacks that were built without a foundational understanding of scale. We see simple APIs that take 4 seconds to respond because of N+1 query problems, and “Microservices” that cost $5,000/month in idle cloud fees.

We discuss this topic because it represents a critical pivot point in engineering maturity. Implementing this correctly differentiates a fragile MVP from a resilient, enterprise-grade platform that can handle Black Friday traffic without breaking a sweat.

The Problem: Latency & Cost

Example: A user asks, “Do you have any summer dresses that are good for a wedding in Italy?”

  • Naive Approach: Send the entire product catalog (CSV) to GPT-4 context window.

    • Cost: $2.00 per query (Input tokens).
    • Latency: 15 seconds.
    • Result: User leaves before the answer loads.
  • Engineered Approach: Vector Search + RAG.

    • Cost: $0.002 per query.
    • Latency: 400ms.
    • Result: Conversion.

The Architecture: The RAG Pipeline

We don’t ask the LLM to “know” our products. We ask it to “summarize” our search results.

sequenceDiagram
    participant User
    participant Edge as Edge Function (Vercel)
    participant Vector as Vector DB (Pinecone)
    participant LLM as GPT-4o-mini

    User->>Edge: "Red dress for Italy wedding"
    
    Note over Edge: 1. Generate Embeddings
    Edge->>LLM: Embedding Request (text-embedding-3-small)
    LLM-->>Edge: [0.12, 0.98, -0.4...]
    
    Note over Edge: 2. Semantic Search
    Edge->>Vector: Query closest vectors (Top 5)
    Vector-->>Edge: Returns 5 Product JSONs
    
    Note over Edge: 3. Synthesis
    Edge->>LLM: "Here are 5 dresses. Recommend one for Italy."
    LLM-->>Edge: "The Amalfi Silk Dress is perfect because..."
    
    Edge-->>User: JSON Response (Product + Text)

Step 1: Vectorizing the Catalog

You cannot search text. You must search “Meaning.” We convert every product description into a Vector Embedding (an array of 1536 floating point numbers). “Red Dress” and “Crimson Gown” have different text but similar vectors (distance < 0.2).

The Ingestion Script (Node.js)

We run this on a cron job every night.

import { OpenAI } from "openai";
import { Pinecone } from "@pinecone-database/pinecone";

const openai = new OpenAI();
const pinecone = new Pinecone();

async function vectorizeProduct(product) {
  // 1. Create a "Semantic String"
  // We combine Title, Description, and Reviews
  const content = `
    Title: ${product.title}
    Description: ${product.description}
    Fabric: ${product.tags.join(', ')}
    Vibe: ${product.metafields.custom.vibe}
  `.trim();

  // 2. Generate Embedding
  const embedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: content,
  });

  // 3. Upsert to Vector DB
  await pinecone.index("maison-products").upsert([{
    id: product.id,
    values: embedding.data[0].embedding,
    metadata: {
      title: product.title,
      price: product.price,
      handle: product.handle,
      image: product.image
    }
  }]);
}

Step 2: The Edge Query

Speed is critical. We use Vercel Edge Functions or Cloudflare Workers. We do NOT use a Python backend. It’s too slow to cold start. We use strictly typed TypeScript on the Edge.

The query happens in two stages:

  1. Retrieval: Find the relevant products.
    • “Wedding in Italy” -> Semantic map -> Linen, Breathable, Floral, Elegant.
    • Vector DB returns: Amalfi Dress, Tuscany Skirt, Roma Sandals.
  2. Generation: Explain WHY.
    • Prompt: “You are a fashion stylist. Explain why these 3 items fit the user’s request. Be brief.”

Optimization Techniques

Processing tokens costs money. Here is how we cut costs by 90%.

If a user filters by “Size: S”, do not search the whole vector space. Apply a metadata filter to Pinecone FIRST. vector_search(query_vector, filter={ size: "S", in_stock: true }) This reduces the search space and improves accuracy.

2. Caching Answers

80% of users ask the same questions. “What is your return policy?” “Do you ship to Canada?” We cache the LLM Response in Redis, keyed by a hashed version of the Query Vector. If a new question is semantically similar (distance < 0.1) to a cached question, return the cached answer. 0ms latency.

3. Small Models (SLMs)

Do you need GPT-4 for this? No. GPT-4o-mini or Claude Haiku is 20x cheaper and faster. For e-commerce recommendations, “intelligence” is less important than “context.” If you provide the right products in the context window, even a small model gives a great answer.

UI: The “Generative UI”

Don’t just stream text. Stream Components. When the LLM suggests a dress, render the <ProductCard /> component right in the chat. We use the Vercel AI SDK to stream UI states.

// The Chat Interface
import { useChat } from 'ai/react';

export function ShopAssistant() {
  const { messages, input, handleInputChange, handleSubmit } = useChat();

  return (
    <div className="chat-window">
      {messages.map(m => (
        <div key={m.id} className={m.role}>
          {m.content}
          {/* If the tool call returned products, render them */}
          {m.toolInvocations?.map(tool => (
             <ProductCarousel products={tool.result} />
          ))}
        </div>
      ))}
    </div>
  );
}

12. Semantic Caching (Redis/Momento)

If 100 people ask “Is the shirt cotton?”, do not pay OpenAI 100 times. Pay them once.

  1. User Query -> Vectorize -> [0.1, 0.2, ...].
  2. Check Redis: GET vectors:nearest([0.1, 0.2]).
  3. If distance < 0.05, return Cached Answer. This reduces LLM costs by 60% in high-traffic deployments. It also lowers latency from 2s to 50ms.

13. Prompt Caching (Anthropic)

New in 2025: Prompt Caching. If you send a 50-page system prompt (“You are a sales agent… here is our catalog…”), you pay for those tokens every time. With Context Caching, you pay once to “upload” the context to the API. Subsequent calls reference the cache_id. This reduces input token costs by 90% and doubles speed (pre-fill is instant).

14. Quantization (GGUF / AWQ)

Models are usually FP16 (16-bit floating point). They are huge (14GB for 7B parameters). Quantization squashes them to INT4 (4-bit integers). The size drops to 4GB. Accuracy loss is negligible (< 1%). Speed increases 3x. We run 4-bit Quantized models on consumer hardware (MacBook Pros) for local development and edge inference.

15. Speculative Decoding

LLMs generate one token at a time. This is serial and slow. Speculative Decoding uses a small “Draft Model” (fast) to guess the next 5 words. The Big Model (slow) just verifies them in parallel. If the draft is correct (it usually is for simple grammar), you get 5 tokens for the cost of 1 forward pass. This doubles generation speed without changing the model weights.

16. Conclusion

AI is not magic. It is engineering. It requires data pipelines, vector databases, and edge caching. If you simply “wrapper ChatGPT”, you will burn cash. If you build a RAG pipeline, you build a competitive moat.


Hire our Architects.