Deep Dive: How to Build a Smart Chatbot in 10 mins with LangChain

Building Machine Learning Solutions

May 25, 2023

LangChain is an incredible tool to interact with LLMs. In this Deep Dive, I’ll show how to use databases, tools and memory to build a smart Chatbot. At the end, I even show to ask investment advices to ChatGPT. We cover:

What is LangChain?
Indexing and searching new Data
- Let’s get some data
- Pinecone: A vector database
- Storing the data
- Retrieving data with ChatGPT
Giving ChatGPT access to tools
Providing a conversation memory
Putting everything together
- Giving access to Google Search
- Utilizing the database as a tool
- Solving a difficult problem: Should I invest in Google today?

What is LangChain?

LangChain is a package to build applications using LLMs. It is composed of 6 modules:

Prompts: This module allows you to build dynamic prompts using templates. It can adapt to different LLM types depending on the context window size and the input variables used as context (conversation history, search results, previous answers, …).
Models: This module provides an abstraction layer to connect to most 3rd party LLM APIs available. It has API connections to ~40 of the public LLMs, chat and embedding models.
Memory: It gives to the LLMs access to the conversation history.
Indexes: Indexes refer to ways to structure documents so that LLMs can best interact with them. This module contains utility functions for working with documents, different types of indexes, and then examples for using those indexes in chains.
Agents: Some applications will require not just a predetermined chain of calls to LLMs/other tools, but potentially an unknown chain that depends on the user’s input. In these types of chains, there is an “agent” with access to a suite of tools. Depending on user input, the agent can decide which, if any, of these tools to call.
Chains: Using an LLM in isolation is fine for some simple applications, but many more complex ones require chaining LLMs - either with each other or with other experts. LangChain provides a standard interface for Chains, as well as some common implementations of chains for ease of use.

Currently the API is not really well documented and all over the place, but if you are willing to dig into the source code, it is well worth the price. I advise you to watch the following introductory video to get more familiar with what the tool is about:

In the following letter I am going to demo how to use LangChain. You can install all the necessary libraries by running the following:

pip install pinecone-client langchain openai wikipedia google-api-python-client unstructured tabulate pdf2image

Indexing and searching new Data

One difficulty with Large Language Models is that they only know what they learned during training. So how do we get them to use private data? One way to do it is to make new text data discoverable by the LLM. The typical way to do that is to convert all private data into embeddings stored in a vector database. The process is as follows:

We chunk the data into small pieces
We pass that data through a LLM and the resulting final layer of the network can be used as a semantic vector representation of the data
That data can then be stored in a database of the vector representation used to recover that piece of data.

When we ask a question we can then convert that question into an embedding (the query) and search for pieces of data close to it in the embedding space. We can then feed those relevant documents to the LLM for it to extract the answer from them:

Let’s get some data

I was looking for interesting data for a demo and I chose the earnings reports from the Alphabet company (Google): https://abc.xyz/investor/previous/

For simplicity, I downloaded them and stored them on my computer:

We can now load those documents into memory with LangChain with 2 lines of code:

from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    './Langchain/data/', # my local directory
    glob='**/*.pdf',     # we only get pdfs
    show_progress=True
)
docs = loader.load()
docs

And we split them into chunks. Each chunk will correspond to an embedding vector

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=0
)
docs_split = text_splitter.split_documents(docs)
docs_split

As a result, we will need to convert that data into embeddings and store those in a database.

Pinecone: A vector database

To store the data, I use Pinecone. You can create an account for free and you are automatically given API keys to access the database::

In the “indexes” tab click on “create index“. Give it a name and a dimension. I use 1536 for the dimension as it is the size of the embedding from the OpenAI embedding model I will use. I use the cosine similarity metric to search for similar documents:

This is going to create a vector table:

Storing the data

Before continuing, make sure to get your OpenAI API key by signing up in the OpenAI platform:

Let’s first write down our API keys

import os

PINECONE_API_KEY = ... # find at app.pinecone.io
PINECONE_ENV = ...     # next to api key in console
OPENAI_API_KEY = ...   # found at platform.openai.com/account/api-keys

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

We upload the data to the vector database. The default OpenAI embedding model used in Langchain is 'text-embedding-ada-002' (OpenAI embedding models). It is used to convert data into embedding vectors

import pinecone 
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings

# we use the openAI embedding model
embeddings = OpenAIEmbeddings()
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENV
)

doc_db = Pinecone.from_documents(
    docs_split, 
    embeddings, 
    index_name='langchain-demo'
)

We can now search for relevant documents in that database using the cosine similarity metric

query = "What were the most important events for Google in 2021?"
search_docs = doc_db.similarity_search(query)
search_docs

Retrieving data with ChatGPT

We can now use a LLM to utilize the database data. Let’s get a LLM. We could get GPT-3 using

from langchain import OpenAI
llm = OpenAI()

or we could get ChatGPT using

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI()

Let’s use the RetrievalQA module to query that data:

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type='stuff',
    retriever=doc_db.as_retriever(),
)

query = "What were the earnings in 2022?"
result = qa.run(query)

result

> 'The total revenues for the full year 2022 were $282,836 million, with operating income and operating margin information not provided in the given context.'

RetrievalQA is actually a wrapper around a specific prompt. The chain type “stuff“ will use a prompt assuming the whole query text fits into the context window. It uses the following prompt template:

Use the following pieces of context to answer the users question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
{context}

{question}

Here the context will be populated with the user’s question and the results of the retrieved documents found in the database. You can use other chain types: “map_reduce”, “refine”, and “map-rerank” if the text is longer than the context window.

Giving ChatGPT access to tools

Up to now the LLM is forced to use the database data. We can give it access to multiple tools. LangChain employs "agents" which, based on user input, decide which tools to utilize from a suite they have access to, with two main types being "Action Agents" that take actions one step at a time, and "Plan-and-Execute Agents" that decide a plan of actions first and then execute them one at a time. For example let’s give ChatGPT access to Wikipedia (make sure to install the wikipedia package with pip). Using the “ZERO_SHOT_REACT_DESCRIPTION“ flag, the LLM is able to understand how to use Wikipedia just based on the tool description:

from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType


llm = ChatOpenAI()
# we load wikipedia
tools = load_tools(['wikipedia'], llm=llm)

agent = initialize_agent(
    tools, 
    llm, 
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, 
    verbose=True
)

agent.run('When was google created?')

> Entering new AgentExecutor chain...
I need to find out the date of Google's creation
Action: Wikipedia
Action Input: "Google creation date"
Observation: Page: Google Scholar
Summary: Google Scholar ...
...

Thought:The creation date of Google is September 4, 1998, according to Wikipedia's page on Google. 
Final Answer: September 4, 1998.

I am not showing the whole result here but the agent called Wikipedia multiple times and generated multiple calls to the LLM. The default prompt template is as follows:

Answer the following questions as best you can. You have access to the following tools:
{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question"""

Begin!

Question: {input}
Thought:{agent_scratchpad}

Providing a conversation memory

Another problem is that the LLM doesn’t remember what was said

agent.run('When was google created?')

> 'Google was created on September 4, 1998.'

agent.run('By whom?')

> 'None, as the original question was incomplete and required more context.'

We can provide it with memory by doing this simple modification

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key='chat_history')

agent = initialize_agent(
    tools, 
    llm, 
    agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION, 
    verbose=True,
    memory=memory
)

Let’s rerun the previous questions now

agent.run('When was google created?')

> 'Google was founded on September 4, 1998, by Larry Page and Sergey Brin while they were PhD students at Stanford University.'

agent.run('By whom?')

> 'Larry Page and Sergey Brin founded Google on September 4, 1998, while they were PhD students at Stanford University. Together they own about 14% of the company's publicly listed shares and control 56% of its stockholder voting power through super-voting stock. Google is now one of the largest technology companies in the world, offering a wide range of products and services.'

Effectively, we modified the prompt template under the hood and allowed the LLM to use the conversation history to form its answers. We can look at the new prompt template by running

agent.agent.llm_chain.prompt.template

'Assistant is a large language model trained by OpenAI.

Assistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.

Assistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.

Overall, Assistant is a powerful tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist.

TOOLS:
------

Assistant has access to the following tools:

> Wikipedia: A wrapper around Wikipedia. Useful for when you need to answer general questions about people, places, companies, facts, historical events, or other subjects. Input should be a search query.

> Calculator: Useful for when you need to answer questions about math.

To use a tool, please use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take, should be one of [Wikipedia, Calculator]
Action Input: the input to the action\nObservation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```

Begin!

Previous conversation history:
{chat_history}

New input: {input}
{agent_scratchpad}'

That is a lot of context for the LLM to answer the question right!

Putting everything together

Giving access to Google Search

Let’s give ChatGPT access to Google Search as. We need the API key for. Follow those steps to get those:

Go to the Google Cloud Console.
If you don't already have an account, create one and log in
Create a new project by clicking on the Select a Project dropdown at the top of the page and clicking New Project
Give it a name and click Create
Set up a custom search API and add to your .env file:
1. Go to the APIs & Services Dashboard
2. Click Enable APIs and Services
3. Search for Custom Search API and click on it
4. Click Enable
5. Go to the Credentials page
6. Click Create Credentials
7. Choose API Key
8. Copy the API key
Enable the Custom Search API on your project. (Might need to wait few minutes to propagate.) Set up a custom search engine and add to your .env file:
1. Go to the Custom Search Engine page
2. Click Add
3. Set up your search engine by following the prompts. You can choose to search the entire web or specific sites
4. Once you've created your search engine, click on Control Panel
5. Click Basics
6. Copy the Search engine ID

Now we just need to set the environment variables and import the tools:

import os 

os.environ['GOOGLE_CSE_ID'] = ...
os.environ['GOOGLE_API_KEY'] = ...

llm = ChatOpenAI()
tools = load_tools([
    'wikipedia', 
    'llm-math', 
    'google-search'
], llm=llm)

Utilizing the database as a tool

We want to make sure ChatGPT can use the database alongside the other tools. We need to cast our RetrievalQA agent as a tool:

from langchain.agents import Tool

name = """
Alphabet quarterly earning reports database
"""

description = """
Useful for when you need to answer questions about the earnings of Google and Alphabet in 2021, 2022 and 2023. Input may be a partial or fully formed question.
"""

search_tool = Tool(
    name=name,
    func=qa.run,
    description=description,
)

tools.append(search_tool)

Solving a difficult problem: Should I invest in Google today?

Now let's try to solve a difficult problem using the Plan and Execute toolbox. The LLM will create a set of steps to solve the problem. It will then execute each of the steps and assess if the problem has been solved

from langchain.experimental.plan_and_execute import (
    PlanAndExecute, 
    load_agent_executor, 
    load_chat_planner
)

memory = ConversationBufferMemory(memory_key='chat_history')
planner = load_chat_planner(llm)
executor = load_agent_executor(llm, tools, verbose=True)

agent = PlanAndExecute(
    planner=planner, 
    executor=executor, 
    verbose=True, 
    reduce_k_below_max_tokens=True
)

So should I invest in Google? Let’s ask the LLM:

agent.run('Should I invest in Google now?')

The LLM outputs its whole reasoning to arrive to a conclusion. Here is the plan that ChatGPT made to solve that problem:

Gather relevant information about Google's current financial situation.
Analyze Google's financial and stock market performance.
Consider any external factors that may impact Google's future performance.
Make an informed decision based on the analysis.
Communicate the decision to the user.

Here is the resulting set of operations that ChatGPT did:

Apparently I should invest based on its assessment! What do you think?

That’s all Folks!

Durga

Aug 13, 2023Edited

Great post! I have just started following along but getting a "NameError: name 'partition_pdf' is not defined" at the pdf loader step: docs = loader.load()

Using default Google Colab notebook. After some online search, downgraded "unstructured" package to 0.7.12, but still get the error. Would be nice if you can update the post with the Python version/env details.

Expand full comment