Today, we dive into Retrieval Augmented Generation. This is a way to augment LLMs with additional data coming from a database. The data is first encoded into vectors, and they are stored in a vector database for fast retrieval. We are going to cover the following points:
Indexing the data in a local index and augmenting an LLM with it
Indexing the data in a Pinecone data and augmenting an LLM with it
Providing the data source when answering questions with LLMs
Indexing a website and using its data to augment an LLM
Indexing GitHub repo to ask questions about the code base
Below is the code used in the video!
Indexing data
Let’s load and split the pdf file of Element of Statistical Learning:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
file_path = '...'
loader = PyPDFLoader(file_path=file_path)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=0
)
data = loader.load_and_split(text_splitter=text_splitter)
data
Let’s install the FAISS package:
%pip install faiss-cpu
And let’s embed text with the OpenAI embeddings
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(show_progress_bar=True)
vector1 = embeddings.embed_query('How are you?')
len(vector1)
> 1536
Let’s embed the book data
from langchain.vectorstores import FAISS
index = FAISS.from_documents(data, embeddings)
We can search in that index:
index.similarity_search_with_relevance_scores(
"What is machine learning?"
)
We can use that index in a chain
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.callbacks import StdOutCallbackHandler
retriever = index.as_retriever()
retriever.search_kwargs['fetch_k'] = 20
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10
llm = ChatOpenAI()
chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
verbose=True
)
handler = StdOutCallbackHandler()
chain.run(
'What is machine learning?',
callbacks=[handler]
)
Machine learning is a field of study that involves the development of algorithms and models that can learn from data and make predictions or decisions without being explicitly programmed. It focuses on creating computer systems that can automatically learn and improve from experience, rather than being explicitly programmed for specific tasks. Machine learning algorithms analyze large amounts of data to identify patterns, make predictions, or learn from examples and feedback. It is widely used in various fields such as science, finance, and industry for tasks like predicting stock prices, medical diagnoses, and customer behavior analysis.
Loading data into a Vector Database
We are going to load the data in Pinecone. Let’s install the Python package
%pip install pinecone-client
And let’s load the data into the database
import pinecone
from langchain.vectorstores import Pinecone
pinecone.init(
api_key=PINECONE_API_KEY, # find at app.pinecone.io
environment=PINECONE_ENV # next to api key in console
)
index_name = "langchain-demo"
db = Pinecone.from_documents(
data,
embeddings,
index_name=index_name
)
And we can now augment an LLM with the database
chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=db.as_retriever(),
verbose=True
)
chain.run(
'What is machine learning?',
callbacks=[handler]
)
Providing sources
I will show to provide the sources as we answer questions. Let’s install the NewsAPI Python package:
%pip install newsapi-python
Let’s get the news about “Artificial Intelligence“ from the past week:
from datetime import date, timedelta
from newsapi import NewsApiClient
newsapi = NewsApiClient(api_key=NEWS_API_KEY)
today = date.today()
last_week = today - timedelta(days=7)
latest_news = newsapi.get_everything(
q='artificial intelligence',
from_param=last_week.strftime("%Y-%m-%d"),
to=today.strftime("%Y-%m-%d"),
sort_by='relevancy',
language='en'
)
and let’s create documents:
from langchain.docstore.document import Document
docs = [
Document(
page_content=article['title'] + '\n\n' + article['description'],
metadata={
'source': article['url'],
}
) for article in latest_news['articles']
]
Let’s create a chain that provides the sources with the answers
from langchain.chains import create_qa_with_sources_chain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.prompts import PromptTemplate
qa_chain = create_qa_with_sources_chain(llm)
doc_prompt = PromptTemplate(
template="Content: {page_content}\nSource: {source}",
input_variables=["page_content", "source"],
)
final_qa_chain = StuffDocumentsChain(
llm_chain=qa_chain,
document_variable_name="context",
document_prompt=doc_prompt,
)
index = FAISS.from_documents(docs, embedding=embeddings)
chain = RetrievalQA(
retriever=index.as_retriever(),
combine_documents_chain=final_qa_chain
)
Let’s ask a question:
question = """
What is the most important news about artificial intelligence from last week?
"""
answer = chain.run(question)
answer
{
"answer": "The most important news about artificial intelligence from last week is the use of AI to train on the works of authors Stephen King and Margaret Atwood. These authors responded to the revelation that their work is being used to train AI. Additionally, AI took the stage at the Edinburgh Fringe festival, raising the question of whether AI can deliver a satisfying punchline. Furthermore, a tech expert from the University of Oxford highlighted the potential workplace threats of AI, including the possibility of AI becoming a monitoring boss. Finally, AI is being seen as a tool that can help companies connect with customers in a more personalized and efficient way.",
"sources": [
"https://www.theatlantic.com/newsletters/archive/2023/09/books-briefing-ai-stephen-king-margaret-atwood/675213/?utm_source=feed",
"https://www.cnet.com/tech/ai-took-the-stage-at-the-worlds-largest-arts-festival-heres-what-happened/",
"https://www.foxnews.com/tech/tech-expert-existential-fears-ai-are-overblown-sees-very-disturbing-workplace-threats",
"https://www.techradar.com/pro/ai-could-help-companies-connect-with-customers-like-never-before"
]
}
Indexing a website
We are going to use Apify to crawl a website. Let’s download the Python package
%pip install apify-client chromadb
and let’s create a loader that will crawl the AiEdge Newsletter website:
from langchain.utilities import ApifyWrapper
from langchain.document_loaders.base import Document
apify = ApifyWrapper()
loader = apify.call_actor(
actor_id="apify/website-content-crawler",
run_input={
"startUrls": [{"url": "https://newsletter.theaiedge.io/"}],
"aggressivePrune": True,
},
dataset_mapping_function=lambda item: Document(
page_content=item["text"] or "", metadata={"source": item["url"]}
),
)
Let’s index the website data:
from langchain.indexes import VectorstoreIndexCreator
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=0
)
index = VectorstoreIndexCreator(
text_splitter=text_splitter
).from_loaders([loader])
Let’s make a search on that index:
query = "What is the main subject of the aiedge newsletter?"
index.query_with_sources(query)
{
'question': 'What is the main subject of the aiedge newsletter?',
'answer': ' The main subject of the AiEdge newsletter is Machine Learning applications, Machine Learning System Design, MLOps, and the latest techniques and news about the field.\n',
'sources': ''
}
Let’s now pass that index to a chain
retriever = index.vectorstore.as_retriever()
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
)
query = "What is the most recent article of the aiedge newsletter?"
qa.run(
query,
callbacks=[handler]
)
"I'm sorry, but I don't have access to the specific articles or the most recent content of the AiEdge Newsletter. As an AI language model, I don't have real-time access to current articles or newsletters. It would be best to subscribe to the newsletter and check the latest edition for the most recent article."
Indexing a GitHub repo
Let’s install the Python package:
%pip install GitPython
Let’s load a repo
from langchain.document_loaders import GitLoader
loader = GitLoader(
clone_url="https://github.com/langchain-ai/langchain",
repo_path="./data/repo/",
file_filter=lambda file_path: file_path.endswith(".py"),
branch="master",
)
documents = loader.load()
Let’s resplit the documents for the Python language:
from langchain.text_splitter import Language
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1000,
chunk_overlap=200
)
documents = python_splitter.split_documents(documents)
Let’s index the data and create a chain
index = FAISS.from_documents(documents, embeddings)
retriever = index.as_retriever()
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
)
query = "What is a stuff chain?"
qa.run(query, callbacks=[handler])
'A stuff chain is a sequence of operations performed on a language model (LLM) to generate or process text. It typically consists of a language model chain (LLMChain) and a document chain (StuffDocumentsChain). The LLMChain is responsible for generating text based on a prompt, while the StuffDocumentsChain is used to process and manipulate documents or summaries. The specific details and functionality of a stuff chain can vary depending on the context and configuration.'
Introduction to LangChain: Retrieval Augmented Generation