RAG in Node.js with EmbedJS

A look at a Node.js library called EmbedJS for RAG AI applications along with a repo to help you get started.

Now that RAG (Retrieval-Augmented Generation) is the current AI buzz for app developers, I've been curious how to go about implementing it in Node.js.

Let's take a look at a library called EmbedJS, which makes it super simple for mere mortal application devs to get started with RAG.

(If you're looking for a primer on RAG, "Applied AI Software Engineering: RAG" on The Pragmatic Engineer is a fine starting point).

A basic Node.js RAG implementation

Last week, I put together this tiny repo which shows how to do a basic RAG implementation using EmbedJS.

The crux of it is this:

const ragApplication = await new RAGApplicationBuilder()
  .setEmbeddingModel(new OpenAi3SmallEmbeddings())
  .setVectorDb(new LanceDb({ path: path.resolve("./db") }))
  .setCache(new LmdbCache({ path: path.resolve("./cache") }))
  .addLoader(new WebLoader({ urlOrContent: "https://www.ashryan.io" }))

console.log(await ragApplication.query('What does Ash write about?'));

EmbedJS uses a builder pattern to configure the RAG application. Once you have a RAG application instance you can prompt it with ragApplication.query().

As you can see, EmbedJS abstracts away a ton of setup and configuration. Developers who want fine control of the various aspects of RAG will no doubt want to take a more hands-on approach.

But at least for hobby and learning projects, EmbedJS seems like a fine way to start dipping your toes in the water.

My current questions

Getting up and running with EmbedJS was pretty quick and simple. Now I have questions about how to really start digging in.

What follows are some of my current questions.

At what point do I need persistent vector and cache databases?

My above implementation is using LanceDB and LmdbCache, both persistent databases. But EmbedJS also offers connectors for in-memory alternatives: HNSWLib for vector store and its own built-in MemoryCache for cache.

I've been hearing some opinions that vector databases are overkill unless your application is dealing with truly insane amounts of data. But I'm not yet sure where the edges are.

Of course, in-memory storage isn't truly persistent in any situation. For RAG, I wonder when that's a good thing and when it isn't. I've already found myself deleting the persistent databases just to get the library to take on new or changed information (I would assume there's an API in EmbedJS to help with this while avoiding brute-force deletion).

I need to dig in deeper here.

What's the intended EmbedJS pattern for batch loads?

The library's builder pattern feels a bit limiting if I need to batch load a lot of documents. Imagine the code snippet above, where instead of EmbedJS's WebLoader, I'm using its CsvLoader for 100 different CSVs.

As far as I can tell, I can't pass an array of loaders to addLoader(). I could roll my own builder wrapper to dynamically add an addLoader() call with a CsvLoader for each file, but it feels awkward to, uhh, build a builder like that.

Well, as I'm writing this, I'm answering my own question. The right move is probably to simply skip adding loaders in the builder and instead add them at runtime. The EmbedJS docs do say this is possible:

You can add new loaders at any point dynamically (even after calling the build function on RAGApplicationBuilder). To do this, simply call the addLoader method -
await ragApplication.addLoader(new YoutubeLoader({ videoIdOrUrl: 'pQiT2U5E9tI' }));

I've already asked this question on the repo's GitHub Discussions. So I'll go update my post there.

Give the repo a spin

I pushed a repo to GitHub. It's intended to help you get started with EmbedJS and not much more. Go check it out.

Hopefully it helps you spring past hello world and on to your own RAG implementation!

GitHub - ashryanbeats/embedjs-rag: Getting started with EmbedJS for RAG
Getting started with EmbedJS for RAG. Contribute to ashryanbeats/embedjs-rag development by creating an account on GitHub.