Batch loading resources in EmbedJS for RAG
I recently shared a post showing how to get started with RAG using EmbedJS.
In the post, I asked this question:
What's the intended EmbedJS pattern for batch loads?
While writing the post I answered my own question in concept, but hadn’t yet implemented it:
The right move is probably to simply skip adding loaders in the builder and instead add them at runtime.
Indeed that turns out to work nicely.
Now I’d like to share a repo and walk you through it.
Script flow
Before we get into any specifics, let’s look at the flow of the script.
In index.js you can see the full flow play out in main():
- Initialize the RAG model (getRagApplication())
- Load the content of multiple URLs (loadResources())
- Take a prompt from the command line (getPrompt())
- Prompt the RAG model (promptRag())
- Print the response from the RAG model (printRagOutput())
The main thrust of what’s different in today’s script compared to the one I shared last time is contained in loadResources(), so we'll have a look at that helper function.
Batch loading in EmbedJS
First, from the EmbedJS docs:
You can add new loaders at any point dynamically (even after calling the build function on RAGApplicationBuilder). To do this, simply call the addLoader method -
await ragApplication.addLoader(new YoutubeLoader({ videoIdOrUrl: 'pQiT2U5E9tI' }));
This means we can skip adding our resources when setting up our RAG application using the EmbedJS builder. Instead, we can add them any time we like with the addLoader() method.
So if we, say, have an array of URLs that we want to ingest into the RAG app, we can simply loop over them and call addLoader() for each.
Let’s see it in action.
Loading a list of web URLs into your RAG app
In today’s example script I have hardcoded an array of URLs as dataUrls.
I have then created a loadResources() helper function that will do the following:
- Take the RAG app and URLs as parameters
- Map over each data URL
- Call the RAG app's addLoader()method for each URL, passing in a newWebLoaderwith said URL
Throughout, I’m doing some console logs just to give you an idea of what’s happening at runtime.
Here’s the full helper function:
const loadResources = async (ragApplication, dataUrls) => {
  const loaderSummaries = await Promise.all(
    dataUrls.map(async (url) => {
      console.log("Adding loader for:", url);
      const loaderSummary = await ragApplication.addLoader(
        new WebLoader({ urlOrContent: url })
      );
      return loaderSummary;
    })
  );
  console.log(
    "\nLoader summaries:\n",
    loaderSummaries.map((summary) => JSON.stringify(summary)).join("\n")
  );
  return loaderSummaries;
};
Let’s look at that function’s return value.
What are loader summaries for?
My loadResources() helper function is returning an array of what I’m calling "loader summaries".
Each “loader summary” is an object returned by EmbedJS’s addLoader() method that looks like this:
{
	entriesAdded: 2,
	uniqueId: "WebLoader_c40b4d270ae136db1a61bd20cd2cbc4a",
	loaderType: "WebLoader"
}
I’m not yet sure how meaningful this data is in practice. The EmbedJS docs don’t currently cover it at all.
In the script’s main() function, I’m gating the happy path with this logic:
loaderSummaries.length && loaderSummaries.length === dataUrls.length)
But I’m not sure this is meaningful. I hope to learn more about it.
Happy hacking
Go grab the repo and give it a spin!
I'd love to hear your reaction to it.