Deep Dive: How to Build a Smart Chatbot in 10 mins with LangChain
Building Machine Learning Solutions
LangChain is an incredible tool to interact with LLMs. In this Deep Dive, I’ll show how to use databases, tools and memory to build a smart Chatbot. At the end, I even show to ask investment advices to ChatGPT. We cover:
What is LangChain?
Indexing and searching new Data
Let’s get some data
Pinecone: A vector database
Storing the data
Retrieving data with ChatGPT
Giving ChatGPT access to tools
Providing a conversation memory
Putting everything together
Giving access to Google Search
Utilizing the database as a tool
Solving a difficult problem: Should I invest in Google today?
What is LangChain?
LangChain is a package to build applications using LLMs. It is composed of 6 modules:
Prompts: This module allows you to build dynamic prompts using templates. It can adapt to different LLM types depending on the context window size and the input variables used as context (conversation history, search results, previous answers, …).
Models: This module provides an abstraction layer to connect to most 3rd party LLM APIs available. It has API connections to ~40 of the public LLMs, chat and embedding models.
Memory: It gives to the LLMs access to the conversation history.
Indexes: Indexes refer to ways to structure documents so that LLMs can best interact with them. This module contains utility functions for working with documents, different types of indexes, and then examples for using those indexes in chains.
Agents: Some applications will require not just a predetermined chain of calls to LLMs/other tools, but potentially an unknown chain that depends on the user’s input. In these types of chains, there is an “agent” with access to a suite of tools. Depending on user input, the agent can decide which, if any, of these tools to call.
Chains: Using an LLM in isolation is fine for some simple applications, but many more complex ones require chaining LLMs - either with each other or with other experts. LangChain provides a standard interface for Chains, as well as some common implementations of chains for ease of use.
Currently the API is not really well documented and all over the place, but if you are willing to dig into the source code, it is well worth the price. I advise you to watch the following introductory video to get more familiar with what the tool is about:
In the following letter I am going to demo how to use LangChain. You can install all the necessary libraries by running the following:
pip install pinecone-client langchain openai wikipedia google-api-python-client unstructured tabulate pdf2image
Indexing and searching new Data
One difficulty with Large Language Models is that they only know what they learned during training. So how do we get them to use private data? One way to do it is to make new text data discoverable by the LLM. The typical way to do that is to convert all private data into embeddings stored in a vector database. The process is as follows:
We chunk the data into small pieces
We pass that data through a LLM and the resulting final layer of the network can be used as a semantic vector representation of the data
That data can then be stored in a database of the vector representation used to recover that piece of data.
When we ask a question we can then convert that question into an embedding (the query) and search for pieces of data close to it in the embedding space. We can then feed those relevant documents to the LLM for it to extract the answer from them:
Let’s get some data
I was looking for interesting data for a demo and I chose the earnings reports from the Alphabet company (Google): https://abc.xyz/investor/previous/
For simplicity, I downloaded them and stored them on my computer:
We can now load those documents into memory with LangChain with 2 lines of code:
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader(
'./Langchain/data/', # my local directory
glob='**/*.pdf', # we only get pdfs
show_progress=True
)
docs = loader.load()
docs
And we split them into chunks. Each chunk will correspond to an embedding vector
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=0
)
docs_split = text_splitter.split_documents(docs)
docs_split
As a result, we will need to convert that data into embeddings and store those in a database.
Pinecone: A vector database
To store the data, I use Pinecone. You can create an account for free and you are automatically given API keys to access the database::
In the “indexes” tab click on “create index“. Give it a name and a dimension. I use 1536 for the dimension as it is the size of the embedding from the OpenAI embedding model I will use. I use the cosine similarity metric to search for similar documents:
This is going to create a vector table:
Storing the data
Before continuing, make sure to get your OpenAI API key by signing up in the OpenAI platform:
Let’s first write down our API keys
import os
PINECONE_API_KEY = ... # find at app.pinecone.io
PINECONE_ENV = ... # next to api key in console
OPENAI_API_KEY = ... # found at platform.openai.com/account/api-keys
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
We upload the data to the vector database. The default OpenAI embedding model used in Langchain is 'text-embedding-ada-002'
(OpenAI embedding models). It is used to convert data into embedding vectors
import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
# we use the openAI embedding model
embeddings = OpenAIEmbeddings()
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_ENV
)
doc_db = Pinecone.from_documents(
docs_split,
embeddings,
index_name='langchain-demo'
)
We can now search for relevant documents in that database using the cosine similarity metric
query = "What were the most important events for Google in 2021?"
search_docs = doc_db.similarity_search(query)
search_docs
Retrieving data with ChatGPT
We can now use a LLM to utilize the database data. Let’s get a LLM. We could get GPT-3 using
Keep reading with a 7-day free trial
Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.