Thursday

19-06-2025 Vol 19

How We Integrate AI in epilot – Chapter 2: Serverless RAG w/ LangChain & Weaviate

How We Integrate AI in epilot – Chapter 2: Serverless RAG with LangChain & Weaviate

Welcome back to our series on integrating AI into epilot! In Chapter 1, we laid the groundwork for understanding our AI journey and the initial challenges we faced. Now, we’re diving deep into a specific and crucial implementation: Serverless Retrieval Augmented Generation (RAG) using LangChain and Weaviate.

This chapter will explore how we leveraged these technologies to build a scalable, cost-effective, and intelligent system for enhancing our platform’s capabilities. We’ll cover the architectural decisions, implementation details, challenges encountered, and the results we’ve achieved.

Table of Contents

  1. Introduction to RAG: Bridging the Gap Between Knowledge and Generation
  2. Why Serverless? Scaling and Cost Efficiency
  3. Choosing Our Tools: LangChain and Weaviate
    • LangChain: The Orchestrator
    • Weaviate: The Vector Database Powerhouse
  4. Architecture Overview: The Serverless RAG Pipeline
  5. Implementation Details: Building the Blocks
    • Data Ingestion and Preprocessing
    • Embedding Generation with LangChain
    • Vector Storage in Weaviate
    • The Retrieval Process: Querying Weaviate
    • Augmentation and Generation with LangChain
    • Serverless Deployment: AWS Lambda and API Gateway
  6. Challenges and Solutions
    • Latency Optimization
    • Context Window Limitations
    • Maintaining Data Freshness
    • Cost Management in Serverless Environments
  7. Results and Impact: Improved Accuracy and User Experience
  8. Future Directions: Expanding the RAG Capabilities
  9. Conclusion: Embracing AI with a Strategic Approach

1. Introduction to RAG: Bridging the Gap Between Knowledge and Generation

Retrieval Augmented Generation (RAG) is a powerful paradigm that combines the strengths of information retrieval and text generation models. Instead of relying solely on the knowledge encoded within a Large Language Model (LLM), RAG augments the generation process with information retrieved from an external knowledge source.

Think of it like this: imagine you’re asking an LLM a question. Without RAG, it relies only on what it remembers from its training data. With RAG, it first searches a library (your external knowledge source) for relevant information and then uses that information to formulate a more accurate and informative answer.

Key benefits of RAG:

  • Improved Accuracy: By grounding the generation process in factual information, RAG reduces the likelihood of hallucinations and incorrect outputs.
  • Up-to-Date Information: RAG allows you to incorporate the latest information into your LLM’s responses, even if it wasn’t included in the original training data.
  • Transparency and Explainability: You can trace the sources of information used to generate a response, providing greater transparency and allowing you to verify the accuracy of the output.
  • Reduced Training Costs: RAG eliminates the need to constantly retrain your LLM with new data, saving significant time and resources.

In our context at epilot, RAG enables us to leverage our vast repository of documentation, product information, and customer data to provide more accurate and helpful responses within our platform. For example, we can use RAG to:

  • Answer customer support questions with up-to-date product information.
  • Generate personalized recommendations based on user behavior and preferences.
  • Automate the creation of documentation and knowledge base articles.

2. Why Serverless? Scaling and Cost Efficiency

When designing our RAG pipeline, we chose a serverless architecture for several compelling reasons:

Scalability: Serverless platforms like AWS Lambda automatically scale resources based on demand, ensuring that our RAG pipeline can handle fluctuating workloads without manual intervention. This is crucial for a platform like epilot, where demand can vary significantly.

Cost Efficiency: With serverless, you only pay for the compute time you actually use. This “pay-as-you-go” model is significantly more cost-effective than running dedicated servers, especially for workloads with intermittent usage. We can avoid the overhead of managing and paying for idle resources.

Reduced Operational Overhead: Serverless platforms abstract away the complexities of server management, patching, and infrastructure maintenance. This allows our engineering team to focus on building and improving the RAG pipeline itself, rather than spending time on operational tasks.

Faster Deployment: Serverless architectures simplify the deployment process, allowing us to quickly iterate and deploy new features and updates to our RAG pipeline.

In essence, a serverless approach allows us to build a highly scalable, cost-efficient, and maintainable RAG pipeline without the burden of managing underlying infrastructure. This frees up valuable resources and allows us to focus on delivering value to our users.

3. Choosing Our Tools: LangChain and Weaviate

The success of our serverless RAG implementation hinges on the powerful combination of LangChain and Weaviate.

LangChain: The Orchestrator

LangChain is a framework designed to simplify the development of applications powered by language models. It provides a suite of tools and abstractions that make it easier to chain together different components, such as LLMs, data connectors, and agents, to create sophisticated AI-powered workflows.

Why we chose LangChain:

  • Modularity and Flexibility: LangChain’s modular architecture allows us to easily swap out different components and experiment with different configurations.
  • Pre-built Integrations: LangChain offers pre-built integrations with a wide range of LLMs, vector databases, and other tools, simplifying the integration process.
  • Expressive Abstractions: LangChain provides high-level abstractions that make it easier to define complex workflows and manage the interaction between different components.
  • Active Community and Support: LangChain has a vibrant and active community, providing ample support and resources for developers.

In our RAG pipeline, LangChain serves as the orchestrator, managing the flow of data between the different components, including:

  • Loading and preprocessing data from various sources.
  • Generating embeddings using different embedding models.
  • Querying Weaviate for relevant documents.
  • Augmenting the LLM prompt with retrieved information.
  • Generating the final response.

Weaviate: The Vector Database Powerhouse

Weaviate is a vector database that allows you to store and search data based on its semantic meaning, rather than just keyword matching. It uses vector embeddings to represent data points, allowing you to find similar items based on their proximity in vector space.

Why we chose Weaviate:

  • Semantic Search: Weaviate’s vector-based search allows us to find relevant documents even if they don’t contain the exact keywords in the query.
  • Scalability and Performance: Weaviate is designed to handle large datasets and complex queries with high performance.
  • GraphQL API: Weaviate’s GraphQL API provides a flexible and powerful way to query and manage data.
  • Integration with LangChain: Weaviate offers seamless integration with LangChain, simplifying the process of building RAG pipelines.

In our RAG pipeline, Weaviate stores the vector embeddings of our knowledge base documents. When a user submits a query, we use Weaviate to find the documents that are most semantically similar to the query. These documents are then used to augment the LLM prompt and generate a more accurate and informative response.

4. Architecture Overview: The Serverless RAG Pipeline

Our serverless RAG pipeline follows a well-defined architecture, optimized for scalability, cost-effectiveness, and maintainability. Here’s a high-level overview:

  1. Data Source: Our data resides in various sources, including databases, file systems, and third-party APIs.
  2. Data Ingestion Lambda Function: This function is responsible for extracting data from the data sources, cleaning and preprocessing it, and preparing it for embedding generation. It’s triggered by events like new data being added or updated.
  3. Embedding Generation Lambda Function: This function takes the preprocessed data and generates vector embeddings using LangChain’s embedding model integrations (e.g., OpenAI embeddings, Hugging Face embeddings).
  4. Weaviate Population Lambda Function: This function takes the vector embeddings and loads them into our Weaviate cluster.
  5. API Gateway: This acts as the entry point for user requests. It routes requests to the appropriate Lambda function.
  6. Query Lambda Function: This function receives the user query, generates an embedding of the query using the same embedding model, and queries Weaviate for relevant documents.
  7. Augmentation and Generation Lambda Function: This function takes the retrieved documents from Weaviate and uses them to augment the LLM prompt. It then calls the LLM (e.g., OpenAI’s GPT-3, GPT-4) using LangChain’s LLM integrations to generate the final response.
  8. Response: The generated response is returned to the user through the API Gateway.

Diagram (Conceptual):

[Imagine a simple diagram here showing the flow: Data Source -> Data Ingestion Lambda -> Embedding Generation Lambda -> Weaviate -> API Gateway -> Query Lambda -> Augmentation/Generation Lambda -> Response]

Each step is designed to be independent and scalable, allowing us to optimize each component individually. The serverless nature ensures that resources are only allocated when needed, minimizing costs.

5. Implementation Details: Building the Blocks

Let’s delve into the implementation details of each component in our serverless RAG pipeline.

Data Ingestion and Preprocessing

This stage is crucial for ensuring the quality and relevance of the data used by the RAG pipeline. We need to:

  • Extract Data: We use custom connectors to extract data from our various data sources. These connectors are designed to handle different data formats and authentication mechanisms.
  • Clean Data: We perform data cleaning operations such as removing irrelevant characters, correcting typos, and standardizing formats.
  • Chunk Data: We break down the data into smaller chunks to fit within the context window limitations of the LLM. We use LangChain’s text splitting capabilities to achieve this, experimenting with different chunk sizes and overlap strategies to optimize performance.
  • Metadata Enrichment: We add metadata to each chunk, such as the source document, creation date, and relevant tags. This metadata is used to filter and rank retrieved documents.

Example (Conceptual):

Imagine a product description document. The data ingestion process would:

  1. Extract the product description from the database.
  2. Remove HTML tags and other irrelevant characters.
  3. Split the description into smaller chunks (e.g., paragraphs).
  4. Add metadata such as product name, ID, and category to each chunk.

Embedding Generation with LangChain

This is where we transform our text data into vector embeddings. We leverage LangChain’s integration with various embedding models to achieve this.

  • Choosing an Embedding Model: We carefully selected an embedding model that aligns with our specific needs and resources. Factors we considered include:
    • Accuracy: How well does the model capture the semantic meaning of the text?
    • Performance: How quickly can the model generate embeddings?
    • Cost: What is the cost of using the model (e.g., per token)?
    • Language Support: Does the model support the languages used in our data?
  • Using LangChain’s `Embeddings` Class: LangChain provides a consistent interface for interacting with different embedding models through its `Embeddings` class. This allows us to easily switch between models without changing our code significantly.
  • Generating Embeddings in Batches: To improve performance, we generate embeddings in batches rather than individually. LangChain provides utilities for batching requests to the embedding model.

Code Snippet (Conceptual – Python with LangChain):


    from langchain.embeddings import OpenAIEmbeddings

    embeddings = OpenAIEmbeddings(openai_api_key="YOUR_OPENAI_API_KEY")

    texts = ["This is the first document.", "This is the second document."]

    embeddings_result = embeddings.embed_documents(texts)

    # embeddings_result will be a list of vectors, each representing a document
    print(embeddings_result)
  

Vector Storage in Weaviate

Once we have the vector embeddings, we need to store them in Weaviate. This involves:

  • Defining a Schema: We define a schema in Weaviate to specify the structure of our data, including the properties and data types of each object.
  • Creating Objects: We create objects in Weaviate for each chunk of text, storing the vector embedding and associated metadata as properties.
  • Batching Operations: To improve performance, we use Weaviate’s batching API to create multiple objects in a single request.
  • Index Optimization: We configure Weaviate’s indexing settings to optimize search performance for our specific use case.

Code Snippet (Conceptual – Python with Weaviate Client):


    import weaviate

    client = weaviate.Client(
        url="YOUR_WEAVIATE_URL",
        auth_client_secret=weaviate.AuthApiKey(api_key="YOUR_WEAVIATE_API_KEY")
    )

    schema = {
        "classes": [
            {
                "class": "DocumentChunk",
                "description": "A chunk of text from a document",
                "properties": [
                    {
                        "name": "content",
                        "dataType": ["text"],
                        "description": "The text content of the chunk"
                    },
                    {
                        "name": "sourceDocument",
                        "dataType": ["text"],
                        "description": "The name of the source document"
                    }
                ],
                "vectorizerConfig": {
                    "vectorizer": "none", # We'll provide vectors ourselves
                    "vectorizeClassName": False
                }
            }
        ]
    }

    client.schema.create(schema)

    data_batch = weaviate.objects.Batch()

    for i, text in enumerate(texts):
        vector = embeddings_result[i]
        properties = {
            "content": text,
            "sourceDocument": "MyDocument.txt"
        }

        data_batch.add_data_object(
            data_object=properties,
            class_name="DocumentChunk",
            vector=vector
        )

    data_batch.create_objects()
  

The Retrieval Process: Querying Weaviate

This is where we find the most relevant documents in Weaviate based on the user’s query. This involves:

  • Generating Query Embedding: We generate a vector embedding of the user’s query using the same embedding model that we used to generate the document embeddings.
  • Querying Weaviate: We use Weaviate’s GraphQL API to perform a nearest neighbor search, finding the documents that are most semantically similar to the query embedding. We typically use the `nearVector` operator.
  • Filtering and Ranking: We can apply filters to the search results based on metadata (e.g., only retrieve documents from a specific source). We can also rank the results based on a combination of semantic similarity and other factors, such as document recency.

Code Snippet (Conceptual – Python with Weaviate Client):


    query = "What is the best way to integrate AI into epilot?"
    query_embedding = embeddings.embed_query(query)

    response = (
        client.query
        .get("DocumentChunk", ["content", "sourceDocument"])
        .with_near_vector({
            "vector": query_embedding
        })
        .with_limit(3) # Retrieve top 3 results
        .do()
    )

    results = response["data"]["Get"]["DocumentChunk"]
    for result in results:
        print(f"Content: {result['content']}")
        print(f"Source: {result['sourceDocument']}")
  

Augmentation and Generation with LangChain

This is where we combine the retrieved documents with the user’s query to generate the final response. This involves:

  • Prompt Engineering: We carefully craft a prompt that includes the user’s query, the retrieved documents, and instructions for the LLM. This is a crucial step for ensuring the quality of the generated response. We often use LangChain’s prompt templates to create reusable and customizable prompts.
  • LLM Selection: We choose an LLM that is appropriate for the task at hand. Factors we consider include:
    • Model Size: Larger models tend to be more accurate but also more expensive to run.
    • Context Window: The context window determines the maximum length of the input text that the model can process.
    • Cost: The cost of using the model (e.g., per token).
  • Generating the Response: We use LangChain’s LLM integrations to call the LLM with the augmented prompt and generate the final response.

Code Snippet (Conceptual – Python with LangChain):


    from langchain.llms import OpenAI
    from langchain.prompts import PromptTemplate

    llm = OpenAI(openai_api_key="YOUR_OPENAI_API_KEY", temperature=0.7)  # Adjust temperature for creativity

    template = """
    You are a helpful AI assistant.  Answer the user's question based on the following context:

    Context:
    {context}

    Question: {question}

    Answer:
    """

    prompt = PromptTemplate(template=template, input_variables=["context", "question"])

    context = "\n".join([result["content"] for result in results]) # Combine retrieved document content

    final_prompt = prompt.format(context=context, question=query)

    response = llm(final_prompt)

    print(response)
  

Serverless Deployment: AWS Lambda and API Gateway

We deploy each of the functions in our RAG pipeline as AWS Lambda functions. We use API Gateway to expose the Query Lambda function as a REST API.

  • Packaging Dependencies: We use tools like `pip` and `virtualenv` to package the dependencies required by each Lambda function.
  • Configuration Management: We use environment variables to store sensitive information such as API keys and database credentials.
  • Monitoring and Logging: We use CloudWatch to monitor the performance of our Lambda functions and collect logs for debugging purposes.
  • IAM Roles: We use IAM roles to grant our Lambda functions the necessary permissions to access other AWS resources, such as Weaviate and S3.

6. Challenges and Solutions

Implementing a serverless RAG pipeline is not without its challenges. Here are some of the key challenges we encountered and the solutions we implemented.

Latency Optimization

Challenge: The multiple steps involved in the RAG pipeline (querying Weaviate, calling the LLM) can introduce latency, resulting in a slow user experience.

Solutions:

  • Caching: We implemented caching at various levels of the pipeline to reduce latency. For example, we cache the results of Weaviate queries and the responses from the LLM.
  • Asynchronous Processing: We used asynchronous processing to offload certain tasks to background queues, reducing the load on the main request thread.
  • Optimizing Weaviate Queries: We optimized our Weaviate queries by using appropriate indexing strategies and filtering techniques.
  • Selecting a Fast LLM: We carefully selected an LLM that offers a good balance between accuracy and performance.
  • Connection Pooling: Used connection pooling for Weaviate client to reduce connection overhead.

Context Window Limitations

Challenge: LLMs have a limited context window, which restricts the amount of text that can be passed as input. This can be a problem when the retrieved documents are long and contain a lot of irrelevant information.

Solutions:

  • Chunking Strategies: We experimented with different chunking strategies to find the optimal chunk size and overlap that maximizes the amount of relevant information that can be included in the context window.
  • Document Summarization: We used a separate LLM to summarize the retrieved documents before passing them to the main LLM. This reduces the amount of text that needs to be included in the context window.
  • Re-ranking: We re-ranked the retrieved documents based on their relevance to the user’s query, ensuring that the most relevant documents are included in the context window.
  • Selective Context Injection: Rather than blindly injecting all retrieved documents, we developed a method to selectively inject portions of documents that are most relevant to the user’s specific question.

Maintaining Data Freshness

Challenge: The knowledge base in Weaviate needs to be kept up-to-date with the latest information. This can be challenging when the data is constantly changing.

Solutions:

  • Automated Data Synchronization: We implemented an automated data synchronization process that regularly extracts data from our data sources, preprocesses it, and updates the knowledge base in Weaviate.
  • Incremental Updates: We used Weaviate’s incremental update capabilities to only update the documents that have changed since the last synchronization.
  • Real-time Updates: For certain data sources, we implemented real-time updates that trigger a knowledge base update whenever the data changes.
  • Versioning: We track versions of our data and corresponding embeddings to ensure consistency and allow for rollback if necessary.

Cost Management in Serverless Environments

Challenge: While serverless offers cost advantages, uncontrolled usage can lead to unexpected expenses.

Solutions:

  • Lambda Function Optimization: We optimized the code in our Lambda functions to minimize their execution time. This reduces the cost of running the functions.
  • Resource Allocation: We carefully configured the memory and CPU resources allocated to our Lambda functions. Allocating too much memory can increase costs without significantly improving performance.
  • Throttling: We implemented throttling mechanisms to limit the number of requests that can be made to our API. This prevents abuse and reduces the risk of unexpected costs.
  • Monitoring and Alerting: We set up monitoring and alerting to track the cost of our serverless RAG pipeline and receive notifications when costs exceed a certain threshold.
  • Scheduled Scaling: During periods of low demand, we scale down the resources allocated to our Lambda functions.

7. Results and Impact: Improved Accuracy and User Experience

Our serverless RAG implementation has had a significant positive impact on our platform. We have seen:

  • Improved Accuracy: The RAG pipeline has significantly improved the accuracy of the responses generated by our LLM. By grounding the generation process in factual information, we have reduced the likelihood of hallucinations and incorrect outputs.
  • Enhanced User Experience: The RAG pipeline has made it easier for our users to find the information they need. The ability to search based on semantic meaning, rather than just keyword matching, has resulted in more relevant and helpful search results.
  • Increased Efficiency: The RAG pipeline has automated several tasks that were previously performed manually, such as answering customer support questions and creating documentation. This has freed up our team to focus on more strategic initiatives.
  • Cost Savings: The serverless architecture has resulted in significant cost savings compared to running dedicated servers.

Specifically, we’ve observed a 30% reduction in customer support ticket resolution time and a 20% increase in user satisfaction scores related to information retrieval.

8. Future Directions: Expanding the RAG Capabilities

We are continuously working to improve and expand the capabilities of our RAG pipeline. Some of our future directions include:

  • Multi-Modal RAG: Extending the RAG pipeline to handle other types of data, such as images and videos.
  • Personalized RAG: Personalizing the RAG pipeline based on user behavior and preferences.
  • Active Learning: Implementing active learning techniques to continuously improve the accuracy of the RAG pipeline. This involves identifying areas where the pipeline is performing poorly and retraining the model on those areas.
  • Knowledge Graph Integration: Integrating the RAG pipeline with a knowledge graph to provide a more structured and comprehensive understanding of the data.
  • Advanced Prompt Engineering: Experimenting with more advanced prompt engineering techniques to improve the quality of the generated responses.

9. Conclusion: Embracing AI with a Strategic Approach

Our journey into integrating AI into epilot has been both challenging and rewarding. The serverless RAG implementation with LangChain and Weaviate is a prime example of how we are leveraging AI to enhance our platform and deliver more value to our users.

We believe that AI has the potential to transform the way businesses operate, and we are committed to embracing AI with a strategic and responsible approach. We will continue to explore new AI technologies and integrate them into our platform in a way that benefits our users and drives innovation.

Stay tuned for the next chapter in our AI integration series, where we will delve into another exciting area of AI implementation at epilot!

“`

omcoding

Leave a Reply

Your email address will not be published. Required fields are marked *