Building Web Browser Experiences w/ Meta Llama 3.2
How to leverage lightweight OSS models to run in-browser LLM inferencing
👋 Hey there, my name is Paulo.
Welcome to my blog where I write about AI, Security, and Product.
Subscribe to see more content
Meta recently released Llama 3.2, a set of lightweight on-device models (1B, 3B) and multimodal models (11B, 90B).
I've been interested in running OSS LLMs on-device for a while now because of the privacy it unlocks for customers. Especially if many of your LLM calls are lightweight, using something like Llama 3.2 in-browser and then simply storing the results in a database could be a good alternative, depending on your audience's comfort with accepting LLM usage.
I took some time to experiment with Llama 3.2 and built out a simple writing tool that makes auto-complete suggestions while writing. Once it detects that the user has stopped typing for half a second, it fetches a suggestion from Llama 3.2 3B, which is running in-browser.
To run Llama 3.2 in-browser, I leveraged WebLLM, an in-browser LLM inference engine. I then referred to the documentation to initialize Llama 3.2 in the browser like so:
const selectedModel = "Llama-3.2-3B-Instruct-q4f16_1-MLC";
const engine = await CreateMLCEngine(selectedModel);
At this point, I simply make an API call like so:
const result = await chatModule.chat.completions.create({
messages: [{ role: "user", content: prompt }],
temperature: 0.7,
max_tokens: 30,
});
It takes a minute or two for the model to load, but after that, the inferencing is very fast, generating at least a few words in the browser. I even tried connecting it to Ollama in the browser to stream longer messages, and that also worked well.
Question
How far do you think we can quantize these models?
The weights being released today are based on BFloat16 numerics. Our teams are actively exploring quantized variants that will run even faster, and we hope to share more on that soon. — Meta
It's exciting to see that Meta is already exploring this because it will get us one step closer to being able to do much of this computation locally on any device in the world, whether it's a $2,000 laptop or a $100 smartphone.