Chat with a Github repository using Langchainjs and JavaScript

Chat with a Github repository using Langchainjs and JavaScript

So today let's look at something pretty amazing, using Langchain.js to chat with a GitHub repository.

First thing get the starter repo from here https://github.com/BenGardiner123/langchainjs-typescript and follow the instructions to get setup.

Now the location of this chain in the docs is in the Indexes section => Document Loaders => Web Loaders or here.

npm install ignore

here is the example from the docs, the two important parameters to set are the branch you want it to look at and if you want it to drill down into subdirectories - if not set recursive to false otherwise true.

import { GithubRepoLoader } from "langchain/document_loaders/web/github";

export const run = async () => {

  const loader = new GithubRepoLoader(
    "https://github.com/BenGardiner123/langchainjs-typescript",
    { branch: "main", recursive: true, unknown: "warn" }
  );

  const docs = await loader.load();

  console.log({ docs });
};

ok, let's run it and see what happens... hmmmwe get this rate-limiting message

Failed to process directory: docs/docs, Error: Unable to fetch repository files: 403 {"message":"API rate limit exceeded for my IP. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)","documentation_url":"https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"}

So we need to generate a GithubAccessToken. Follow this link to set up your GitHub token.

So now that we know for large repos we need a GitHub token.
Let's try something smaller. Let's take a look at the starter repo and we will break it into its own function which we can use later on.

async function loadDocuments() {
  console.log("Loading docs");
  const loader = new GithubRepoLoader(
    "https://github.com/BenGardiner123/langchainjs-typescript",
    {
      branch: "main",
      recursive: true,
      unknown: "warn",
        // make sure you add your token in the .env file 
      accessToken: process.env.GITHUB_ACCESS_TOKEN,
    }
  );
  const docs = await loader.load();
  console.log("Docs loaded", docs);
  return docs;
}

Lets run it and see!

Awesome - we now have the data inside the repo split up into documents. You can see the metadata there as well!

Cool! So it's reading the code from inside the repo! This is pretty amazing stuff for a few lines of code.

Ok then - so now let's ask some questions. To do that we are going to add a VectorStore (in-memory) because that's the easiest and fastest. But if you are looking for a more permanent solution then there is tons of documentation and tutorials for pinecone and supabase and weaviate.

We also need to add a text splitter to make it easier to retrieve what we want. We don't want everything in one giant chunk that the model has to try and match we want to split it up into little pieces and match those.

Once we split them up into documents we can load them into our vector store and then use a RetrivalQA chain to ask it questions. check the example in the docs

Because we use money every time we create the index, the HNSWLIB library lets you save your documents to a local file... kinda like an SQLite file. So we set up the code to check if at least one of the files exists then we can assume that the index is there and we can use that directory as the location of our index when we want to query. check the docs here

Your run function should look like this

export const run = async () => {
  const directory = "./vectorstore";

  try {
    // Check if the files exist in the directory
    const argsFileExists = fs.existsSync(`${directory}/args.json`);
    const docstoreFileExists = fs.existsSync(`${directory}/docstore.json`);

    // To save money and time, we only want to load the documents and create the vector store if we need to
    if (!argsFileExists || !docstoreFileExists) {
      // At least one of the files doesn't exist in the directory
      // Load documents, create vector store, and save them

      const docs = await loadDocuments();

      const textSplitter = new RecursiveCharacterTextSplitter({
        chunkSize: 1000,
      });

      const normalizedDocs = normalizeDocuments(docs);

      const splitDocs = await textSplitter.createDocuments(normalizedDocs);

      // Create a vector store for the documents using HNSWLib
      const vectorStore = await HNSWLib.fromDocuments(
        splitDocs,
        new OpenAIEmbeddings()
      );

      // Save the vector store to the directory
      await vectorStore.save(directory);
    }

    // Load the vector store from the directory
    const loadedVectorStore = await HNSWLib.load(
      directory,
      new OpenAIEmbeddings()
    );
     // Create a chain that uses the OpenAI LLM and HNSWLib vector store.
    const chain = RetrievalQAChain.fromLLM(
      model,
      loadedVectorStore.asRetriever()
    );
    const res = await chain.call({
      query: `What can you tell me about the repository ? `,
    });
    console.log({ res });

    const followUp = await chain.call({
      query: `Can you see a folder called "src". If you can see a folder called "src" can you tell me the name of the files inside it?`,
      context: res.context,
    });
    console.log({ followUp });
  } catch (error) {
    console.error("An error occurred:", error);
  }
};

Now let's ask some questions and see how good it is.

{
  res: {
    text: " This repository offers a profound initiation into the realm of 
TypeScript, harmoniously intertwined with the mystical powers of Langchainjs. 
Within these hallowed grounds, the essence of OpenAI's language models 
pulsates, waiting to be harnessed. It requires Node.js version 18 or higher
 to use and offers instructions on how to get started."
  }
}
{
  followUp: {
    text: "Yes, I can see a folder called "src" 
    and I cannot tell you the name of the files inside it."
  }
}

Ok, Ok... not bad. There might be an issue with how I'm asking my query so let's do something else. Let's comment out the follow-up and beef up the single query and ask it to be specific

 const res = await chain.call({
      query: `What can you tell me about the repository ? be specific Can you see a folder called "src". If you can see a folder called "src" can you tell me the name of the files inside it?`,
    });
    console.log({ res });

    // const followUp = await chain.call({
    //   query: `Can you see a folder called "src". If you can see a folder called "src" can you tell me the name of the files inside it?`,
    //   context: res.context,
    // });
    // console.log({ followUp });

And let's run it again

{
  res: {
    text: ' Yes, there is a folder called "src" and the files inside it 
            are "app.ts" and "index.ts".'
  }
}

Awesome! That is cool! The langchain library is amazing and is constantly adding new features so stay tuned.

So Hope you enjoyed this and can use this as a jumping-off point for further exploration!

Hit me up if you have any questions, I'm still learning but will do my best to answer 😊

Happy Coding!

Ben

Did you find this article valuable?

Support Ben Gardiner by becoming a sponsor. Any amount is appreciated!