2

Introduction to LangChain: Retrieval Augmented Generation

2

Today, we dive into Retrieval Augmented Generation. This is a way to augment LLMs with additional data coming from a database. The data is first encoded into vectors, and they are stored in a vector database for fast retrieval. We are going to cover the following points:

  • Indexing the data in a local index and augmenting an LLM with it

  • Indexing the data in a Pinecone data and augmenting an LLM with it

  • Providing the data source when answering questions with LLMs

  • Indexing a website and using its data to augment an LLM

  • Indexing GitHub repo to ask questions about the code base


Below is the code used in the video!

Indexing data

Let’s load and split the pdf file of Element of Statistical Learning:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

file_path = '...'

loader = PyPDFLoader(file_path=file_path)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0
)

data = loader.load_and_split(text_splitter=text_splitter)
data

Let’s install the FAISS package:

%pip install faiss-cpu

And let’s embed text with the OpenAI embeddings

from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(show_progress_bar=True)
vector1 = embeddings.embed_query('How are you?')
len(vector1)

> 1536

Let’s embed the book data

from langchain.vectorstores import FAISS

index = FAISS.from_documents(data, embeddings)

We can search in that index:

index.similarity_search_with_relevance_scores(
    "What is machine learning?"
)

We can use that index in a chain

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.callbacks import StdOutCallbackHandler

retriever = index.as_retriever()
retriever.search_kwargs['fetch_k'] = 20
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10

llm = ChatOpenAI()

chain = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=retriever,
    verbose=True
)

handler = StdOutCallbackHandler()

chain.run(
    'What is machine learning?',
    callbacks=[handler]
) 

Machine learning is a field of study that involves the development of algorithms and models that can learn from data and make predictions or decisions without being explicitly programmed. It focuses on creating computer systems that can automatically learn and improve from experience, rather than being explicitly programmed for specific tasks. Machine learning algorithms analyze large amounts of data to identify patterns, make predictions, or learn from examples and feedback. It is widely used in various fields such as science, finance, and industry for tasks like predicting stock prices, medical diagnoses, and customer behavior analysis.

Loading data into a Vector Database

We are going to load the data in Pinecone. Let’s install the Python package

%pip install pinecone-client

And let’s load the data into the database

import pinecone 
from langchain.vectorstores import Pinecone

pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_ENV  # next to api key in console
)

index_name = "langchain-demo"
db = Pinecone.from_documents(
    data, 
    embeddings, 
    index_name=index_name
)

And we can now augment an LLM with the database

chain = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=db.as_retriever(),
    verbose=True
)

chain.run(
    'What is machine learning?',
    callbacks=[handler]
)

Providing sources

I will show to provide the sources as we answer questions. Let’s install the NewsAPI Python package:

%pip install newsapi-python

Let’s get the news about “Artificial Intelligence“ from the past week:

from datetime import date, timedelta
from newsapi import NewsApiClient

newsapi = NewsApiClient(api_key=NEWS_API_KEY)

today = date.today()
last_week = today - timedelta(days=7)

latest_news = newsapi.get_everything(
    q='artificial intelligence',
    from_param=last_week.strftime("%Y-%m-%d"),
    to=today.strftime("%Y-%m-%d"),
    sort_by='relevancy',
    language='en'
)

and let’s create documents:

from langchain.docstore.document import Document
docs = [
    Document(
    page_content=article['title'] + '\n\n' + article['description'], 
    metadata={
        'source': article['url'],
    }
    ) for article in latest_news['articles']
]

Let’s create a chain that provides the sources with the answers

from langchain.chains import create_qa_with_sources_chain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.prompts import PromptTemplate

qa_chain = create_qa_with_sources_chain(llm)

doc_prompt = PromptTemplate(
    template="Content: {page_content}\nSource: {source}",
    input_variables=["page_content", "source"],
)

final_qa_chain = StuffDocumentsChain(
    llm_chain=qa_chain,
    document_variable_name="context",
    document_prompt=doc_prompt,
)

index = FAISS.from_documents(docs, embedding=embeddings)


chain = RetrievalQA(
    retriever=index.as_retriever(), 
    combine_documents_chain=final_qa_chain
)

Let’s ask a question:

question = """
What is the most important news about artificial intelligence from last week?
"""

answer = chain.run(question)
answer
{
  "answer": "The most important news about artificial intelligence from last week is the use of AI to train on the works of authors Stephen King and Margaret Atwood. These authors responded to the revelation that their work is being used to train AI. Additionally, AI took the stage at the Edinburgh Fringe festival, raising the question of whether AI can deliver a satisfying punchline. Furthermore, a tech expert from the University of Oxford highlighted the potential workplace threats of AI, including the possibility of AI becoming a monitoring boss. Finally, AI is being seen as a tool that can help companies connect with customers in a more personalized and efficient way.",
  "sources": [
    "https://www.theatlantic.com/newsletters/archive/2023/09/books-briefing-ai-stephen-king-margaret-atwood/675213/?utm_source=feed",
    "https://www.cnet.com/tech/ai-took-the-stage-at-the-worlds-largest-arts-festival-heres-what-happened/",
    "https://www.foxnews.com/tech/tech-expert-existential-fears-ai-are-overblown-sees-very-disturbing-workplace-threats",
    "https://www.techradar.com/pro/ai-could-help-companies-connect-with-customers-like-never-before"
  ]
}

Indexing a website

We are going to use Apify to crawl a website. Let’s download the Python package

%pip install apify-client chromadb

and let’s create a loader that will crawl the AiEdge Newsletter website:

from langchain.utilities import ApifyWrapper
from langchain.document_loaders.base import Document

apify = ApifyWrapper()

loader = apify.call_actor(
    actor_id="apify/website-content-crawler",
    run_input={
        "startUrls": [{"url": "https://newsletter.theaiedge.io/"}],
        "aggressivePrune": True,
    },
    dataset_mapping_function=lambda item: Document(
        page_content=item["text"] or "", metadata={"source": item["url"]}
    ),
)

Let’s index the website data:

from langchain.indexes import VectorstoreIndexCreator

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0
)

index = VectorstoreIndexCreator(
    text_splitter=text_splitter
).from_loaders([loader])

Let’s make a search on that index:

query = "What is the main subject of the aiedge newsletter?"

index.query_with_sources(query)
{
    'question': 'What is the main subject of the aiedge newsletter?',
    'answer': ' The main subject of the AiEdge newsletter is Machine Learning applications, Machine Learning System Design, MLOps, and the latest techniques and news about the field.\n',
    'sources': ''
}

Let’s now pass that index to a chain

retriever = index.vectorstore.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=retriever,
)

query = "What is the most recent article of the aiedge newsletter?"

qa.run(
    query, 
    callbacks=[handler]
)

"I'm sorry, but I don't have access to the specific articles or the most recent content of the AiEdge Newsletter. As an AI language model, I don't have real-time access to current articles or newsletters. It would be best to subscribe to the newsletter and check the latest edition for the most recent article."

Indexing a GitHub repo

Let’s install the Python package:

%pip install GitPython

Let’s load a repo

from langchain.document_loaders import GitLoader

loader = GitLoader(
    clone_url="https://github.com/langchain-ai/langchain",
    repo_path="./data/repo/",
    file_filter=lambda file_path: file_path.endswith(".py"),
    branch="master",
)

documents = loader.load()

Let’s resplit the documents for the Python language:

from langchain.text_splitter import Language

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, 
    chunk_size=1000, 
    chunk_overlap=200
)

documents = python_splitter.split_documents(documents)

Let’s index the data and create a chain

index = FAISS.from_documents(documents, embeddings)
retriever = index.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=retriever,
)

query = "What is a stuff chain?"

qa.run(query, callbacks=[handler])

'A stuff chain is a sequence of operations performed on a language model (LLM) to generate or process text. It typically consists of a language model chain (LLMChain) and a document chain (StuffDocumentsChain). The LLMChain is responsible for generating text based on a prompt, while the StuffDocumentsChain is used to process and manipulate documents or summaries. The specific details and functionality of a stuff chain can vary depending on the context and configuration.'

2 Comments
Authors
Damien Benveniste