Running Inference In Web Extensions
Image generated by DALL*E
We’re shipping a new API in Firefox Nightly that will let you use our Firefox AI runtime to run offline machine learning tasks in your web extension.
Firefox AI Runtime
We’ve recently shipped a new component inside of Firefox that leverages Transformers.js (a JavaScript equivalent of Hugging Face’s Transformers Python library) and the underlying ONNX runtime engine. This component lets you run any machine learning model that is compatible with Transformers.js in the browser, with no server-side calls beyond the initial download of the models. This means Firefox can run everything on your device and avoid sending your data to third parties.
Web applications can already use Transformers.js in vanilla JavaScript, but running through our platform offers some key benefits:
- The inference runtime is executed in a dedicated, isolated process, for safety and robustness
- Model files are stored using IndexedDB and shared across origins
- Firefox-specific performance improvements are done to accelerate the runtime
This platform shipped in Firefox 133 to provide alt text for images in PDF.js, and will be used in several other places in Firefox 134 and beyond to improve the user experience.
We also want to unblock the community’s ability to experiment with these capabilities. Starting later today, developers will be able to access a new trial “ml” API in Firefox Nightly. This API is basically a thin wrapper around Firefox’s internal API, but with a few additional restrictions for user privacy and security.
There are two major differences between this API and most other WebExtensions APIs: the API is highly experimental and permission to use it must be requested after installation.
This new API is virtually guaranteed to change in the future. To help set developer expectations, the “ml” API is exposed under the “browser.trial” namespace rather than directly on the “browser” global object. Any API exposed on “browser.trial” may not be compatible across major versions of Firefox. Developers should guard against breaking changes using a combination of feature detection and strict_min_version declarations. You can see a more detailed description of how to write extensions with it in our documentation.
Running an inference task
Performing inference directly in the browser is quite exciting. We expect people will be able to build compelling features using the browser’s data locally.
Like the original Transformers that inspired it, Transformers.js uses “tasks” to abstract away implementation details for performing specific kinds of ML workloads. You can find a description of all tasks that Transformers.js supports in the project’s official documentation.
For our first iteration, Firefox exposes the following tasks:
- text-classification – Assigning a label or class to a given text.
- token-classification – Assigning a label to each token in a text.
- question-answering – Retrieve the answer to a question from a given text.
- fill-mask – Masking some of the words in a sentence and predicting which words should replace those masks.
- summarization – Producing a shorter version of a document while preserving its important information.
- translation – Converting text from one language to another.
- text2text-generation – converting one text sequence into another text sequence
- text-generation – Producing new text by predicting the next word in a sequence.
- zero-shot-classification – Classifying text into classes that are unseen during training.
- image-to-text – Output text from a given image
- image-classification – Assigning a label or class to an entire image
- image-segmentation – Divides an image into segments where each pixel is mapped to an object.
- zero-shot-image-classification – Classifying images into classes that are unseen during training
- object-detection – Identify objects of certain defined classes within an imag
- zero-shot-object-detection – Identify objects of classes that are unseen during training
- document-question-answering – Answering questions on document image
- image-to-image – Transforming a source image to match the characteristics of a target image or a target image domain
- depth-estimation – Predicting the depth of objects present in an image.
- feature-extraction – Transforming raw data into numerical features that can be processed while preserving the information in the original dataset.
- image-feature-extraction – Transforming raw data into numerical features that can be processed while preserving the information in the original image.
For each task, we’ve selected a default model. See the list here EngineProcess.sys.mjs – mozsearch. These curated models are all stored in our Model Hub at https://model-hub.mozilla.org/. A Model Hub is how Hugging Face defines an online storage of models, see The Model Hub. Whether used by Firefox itself or an extension, models are automatically downloaded on the first use and cached.
Below is example below showing how to run a summarizer in your extension with the default model:
async function summarize(text) {
await browser.trial.ml.createEngine({taskName: "summarization"});
const result = await browser.trial.ml.runEngine({args: [text]});
return result[0]["summary_text"];
}
If you want to use another model, you can use any model published on Hugging Face by Xenova or the Mozilla organization. For now, we’ve restricted downloading models from those two organizations, but we might relax this limitation in the future.
To use an allow-listed model from Hugging Face, you can use an options object to set the “modelHub” option to “huggingface” and the “taskName” option to the appropriate task when creating an engine.
Let’s modify the previous example to use a model that can summarize larger texts:
async function summarize(text) {
await browser.trial.ml.createEngine({
taskName: "summarization",
modelHub: "huggingface",
modelId: "Xenova/long-t5-tglobal-base-16384-book-summary"
});
const result = await browser.trial.ml.runEngine({args: [text]});
return result[0]["summary_text"];
}
Our PDF.js alt text feature follows the same pattern:
- Gets the image to describe
- Use the “image-to-text” task with the “mozilla/distilvit” model
- Run the inference and return the generated text
This feature is built directly into Firefox, but we’ve also made a web extension example out of it, that you can find in our source code and use as a basis to build your own. See https://searchfox.org/mozilla-central/source/toolkit/components/ml/docs/extensions-api-example. For instance, it includes some code to request the relevant permission, and a model download progress bar.
We’d love to hear from you
This API is our first attempt to enable the community to build on the top of our Firefox AI Runtime. We want to make this API as simple and powerful as possible.
We believe that offering this feature to web extensions developers will help us learn and understand if and how such an API could be developed as a web standard in the future.
We’d love to hear from you and see what you are building with this.
Come and say hi in our dedicated Mozilla AI discord #firefox-ai. Discord invitation: https://discord.gg/Jmmq9mGwy7
Last but not least, we’re doing a deep dive talk at the FOSDEM in the Mozilla room Sunday February 2nd in Brussels. There will be many interesting talks in that room, see: https://fosdem.org/2025/schedule/track/mozilla/