Chat with a Github repository using Langchainjs and JavaScript
So today let's look at something pretty amazing, using Langchain.js to chat with a GitHub repository.
First thing get the starter repo from here https://github.com/BenGardiner123/langchainjs-typescript and follow the instructions to get setup.
Now the location of this chain in the docs is in the Indexes section => Document Loaders => Web Loaders or here.
npm install ignore
here is the example from the docs, the two important parameters to set are the branch you want it to look at and if you want it to drill down into subdirectories - if not set recursive to false otherwise true.
import { GithubRepoLoader } from "langchain/document_loaders/web/github";
export const run = async () => {
const loader = new GithubRepoLoader(
"https://github.com/BenGardiner123/langchainjs-typescript",
{ branch: "main", recursive: true, unknown: "warn" }
);
const docs = await loader.load();
console.log({ docs });
};
ok, let's run it and see what happens... hmmmwe get this rate-limiting message
Failed to process directory: docs/docs, Error: Unable to fetch repository files: 403 {"message":"API rate limit exceeded for my IP. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)","documentation_url":"https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"}
So we need to generate a GithubAccessToken. Follow this link to set up your GitHub token.
So now that we know for large repos we need a GitHub token.
Let's try something smaller. Let's take a look at the starter repo and we will break it into its own function which we can use later on.
async function loadDocuments() {
console.log("Loading docs");
const loader = new GithubRepoLoader(
"https://github.com/BenGardiner123/langchainjs-typescript",
{
branch: "main",
recursive: true,
unknown: "warn",
// make sure you add your token in the .env file
accessToken: process.env.GITHUB_ACCESS_TOKEN,
}
);
const docs = await loader.load();
console.log("Docs loaded", docs);
return docs;
}
Lets run it and see!
Awesome - we now have the data inside the repo split up into documents. You can see the metadata there as well!
Cool! So it's reading the code from inside the repo! This is pretty amazing stuff for a few lines of code.
Ok then - so now let's ask some questions. To do that we are going to add a VectorStore (in-memory) because that's the easiest and fastest. But if you are looking for a more permanent solution then there is tons of documentation and tutorials for pinecone and supabase and weaviate.
We also need to add a text splitter to make it easier to retrieve what we want. We don't want everything in one giant chunk that the model has to try and match we want to split it up into little pieces and match those.
Once we split them up into documents we can load them into our vector store and then use a RetrivalQA chain to ask it questions. check the example in the docs
Because we use money every time we create the index, the HNSWLIB library lets you save your documents to a local file... kinda like an SQLite file. So we set up the code to check if at least one of the files exists then we can assume that the index is there and we can use that directory as the location of our index when we want to query. check the docs here
Your run function should look like this
export const run = async () => {
const directory = "./vectorstore";
try {
// Check if the files exist in the directory
const argsFileExists = fs.existsSync(`${directory}/args.json`);
const docstoreFileExists = fs.existsSync(`${directory}/docstore.json`);
// To save money and time, we only want to load the documents and create the vector store if we need to
if (!argsFileExists || !docstoreFileExists) {
// At least one of the files doesn't exist in the directory
// Load documents, create vector store, and save them
const docs = await loadDocuments();
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
});
const normalizedDocs = normalizeDocuments(docs);
const splitDocs = await textSplitter.createDocuments(normalizedDocs);
// Create a vector store for the documents using HNSWLib
const vectorStore = await HNSWLib.fromDocuments(
splitDocs,
new OpenAIEmbeddings()
);
// Save the vector store to the directory
await vectorStore.save(directory);
}
// Load the vector store from the directory
const loadedVectorStore = await HNSWLib.load(
directory,
new OpenAIEmbeddings()
);
// Create a chain that uses the OpenAI LLM and HNSWLib vector store.
const chain = RetrievalQAChain.fromLLM(
model,
loadedVectorStore.asRetriever()
);
const res = await chain.call({
query: `What can you tell me about the repository ? `,
});
console.log({ res });
const followUp = await chain.call({
query: `Can you see a folder called "src". If you can see a folder called "src" can you tell me the name of the files inside it?`,
context: res.context,
});
console.log({ followUp });
} catch (error) {
console.error("An error occurred:", error);
}
};
Now let's ask some questions and see how good it is.
{
res: {
text: " This repository offers a profound initiation into the realm of
TypeScript, harmoniously intertwined with the mystical powers of Langchainjs.
Within these hallowed grounds, the essence of OpenAI's language models
pulsates, waiting to be harnessed. It requires Node.js version 18 or higher
to use and offers instructions on how to get started."
}
}
{
followUp: {
text: "Yes, I can see a folder called "src"
and I cannot tell you the name of the files inside it."
}
}
Ok, Ok... not bad. There might be an issue with how I'm asking my query so let's do something else. Let's comment out the follow-up and beef up the single query and ask it to be specific
const res = await chain.call({
query: `What can you tell me about the repository ? be specific Can you see a folder called "src". If you can see a folder called "src" can you tell me the name of the files inside it?`,
});
console.log({ res });
// const followUp = await chain.call({
// query: `Can you see a folder called "src". If you can see a folder called "src" can you tell me the name of the files inside it?`,
// context: res.context,
// });
// console.log({ followUp });
And let's run it again
{
res: {
text: ' Yes, there is a folder called "src" and the files inside it
are "app.ts" and "index.ts".'
}
}
Awesome! That is cool! The langchain library is amazing and is constantly adding new features so stay tuned.
So Hope you enjoyed this and can use this as a jumping-off point for further exploration!
Hit me up if you have any questions, I'm still learning but will do my best to answer 😊
Happy Coding!
Ben