The AiEdge Newsletter

Share this post

Deep Dive: How to Build a Smart Chatbot in 10 mins with LangChain

newsletter.theaiedge.io
Deep Dives

Deep Dive: How to Build a Smart Chatbot in 10 mins with LangChain

Building Machine Learning Solutions

Damien Benveniste
May 25, 2023
∙ Paid
26
8
Share

LangChain is an incredible tool to interact with LLMs. In this Deep Dive, I’ll show how to use databases, tools and memory to build a smart Chatbot. At the end, I even show to ask investment advices to ChatGPT. We cover:

  • What is LangChain?

  • Indexing and searching new Data

    • Let’s get some data

    • Pinecone: A vector database

    • Storing the data

    • Retrieving data with ChatGPT

  • Giving ChatGPT access to tools

  • Providing a conversation memory

  • Putting everything together

    • Giving access to Google Search

    • Utilizing the database as a tool

    • Solving a difficult problem: Should I invest in Google today?


What is LangChain?

LangChain is a package to build applications using LLMs. It is composed of 6 modules:

  • Prompts: This module allows you to build dynamic prompts using templates. It can adapt to different LLM types depending on the context window size and the input variables used as context (conversation history, search results, previous answers, …).

  • Models: This module provides an abstraction layer to connect to most 3rd party LLM APIs available. It has API connections to ~40 of the public LLMs, chat and embedding models.

  • Memory: It gives to the LLMs access to the conversation history.

  • Indexes: Indexes refer to ways to structure documents so that LLMs can best interact with them. This module contains utility functions for working with documents, different types of indexes, and then examples for using those indexes in chains.

  • Agents: Some applications will require not just a predetermined chain of calls to LLMs/other tools, but potentially an unknown chain that depends on the user’s input. In these types of chains, there is an “agent” with access to a suite of tools. Depending on user input, the agent can decide which, if any, of these tools to call.

  • Chains: Using an LLM in isolation is fine for some simple applications, but many more complex ones require chaining LLMs - either with each other or with other experts. LangChain provides a standard interface for Chains, as well as some common implementations of chains for ease of use.

Currently the API is not really well documented and all over the place, but if you are willing to dig into the source code, it is well worth the price. I advise you to watch the following introductory video to get more familiar with what the tool is about:

In the following letter I am going to demo how to use LangChain. You can install all the necessary libraries by running the following:

pip install pinecone-client langchain openai wikipedia google-api-python-client unstructured tabulate pdf2image

Indexing and searching new Data

One difficulty with Large Language Models is that they only know what they learned during training. So how do we get them to use private data? One way to do it is to make new text data discoverable by the LLM. The typical way to do that is to convert all private data into embeddings stored in a vector database. The process is as follows:

  • We chunk the data into small pieces

  • We pass that data through a LLM and the resulting final layer of the network can be used as a semantic vector representation of the data

  • That data can then be stored in a database of the vector representation used to recover that piece of data.

When we ask a question we can then convert that question into an embedding (the query) and search for pieces of data close to it in the embedding space. We can then feed those relevant documents to the LLM for it to extract the answer from them:

Let’s get some data

I was looking for interesting data for a demo and I chose the earnings reports from the Alphabet company (Google): https://abc.xyz/investor/previous/

For simplicity, I downloaded them and stored them on my computer:

We can now load those documents into memory with LangChain with 2 lines of code:

from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    './Langchain/data/', # my local directory
    glob='**/*.pdf',     # we only get pdfs
    show_progress=True
)
docs = loader.load()
docs

And we split them into chunks. Each chunk will correspond to an embedding vector

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=0
)
docs_split = text_splitter.split_documents(docs)
docs_split

As a result, we will need to convert that data into embeddings and store those in a database.

Pinecone: A vector database

To store the data, I use Pinecone. You can create an account for free and you are automatically given API keys to access the database::

In the “indexes” tab click on “create index“. Give it a name and a dimension. I use 1536 for the dimension as it is the size of the embedding from the OpenAI embedding model I will use. I use the cosine similarity metric to search for similar documents:

This is going to create a vector table:

Storing the data

Before continuing, make sure to get your OpenAI API key by signing up in the OpenAI platform:

Let’s first write down our API keys

import os

PINECONE_API_KEY = ... # find at app.pinecone.io
PINECONE_ENV = ...     # next to api key in console
OPENAI_API_KEY = ...   # found at platform.openai.com/account/api-keys

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

We upload the data to the vector database. The default OpenAI embedding model used in Langchain is 'text-embedding-ada-002' (OpenAI embedding models). It is used to convert data into embedding vectors

import pinecone 
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings

# we use the openAI embedding model
embeddings = OpenAIEmbeddings()
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENV
)

doc_db = Pinecone.from_documents(
    docs_split, 
    embeddings, 
    index_name='langchain-demo'
)

We can now search for relevant documents in that database using the cosine similarity metric

query = "What were the most important events for Google in 2021?"
search_docs = doc_db.similarity_search(query)
search_docs

Retrieving data with ChatGPT

We can now use a LLM to utilize the database data. Let’s get a LLM. We could get GPT-3 using

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
Previous
© 2023 AiEdge
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing