{"id":81570,"date":"2025-08-28T10:51:24","date_gmt":"2025-08-28T17:51:24","guid":{"rendered":"https:\/\/blog.mozilla.org\/?p=81570"},"modified":"2025-10-01T09:19:01","modified_gmt":"2025-10-01T16:19:01","slug":"speeding-up-firefox-local-ai-runtime","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/","title":{"rendered":"Speeding up Firefox Local AI Runtime"},"content":{"rendered":"\n<p>Last year we rolled out the <strong><a href=\"https:\/\/firefox-source-docs.mozilla.org\/toolkit\/components\/ml\/\">Firefox AI Runtime<\/a><\/strong>, the engine that quietly powers features such as <a href=\"https:\/\/hacks.mozilla.org\/2024\/05\/experimenting-with-local-alt-text-generation-in-firefox-nightly\/\">PDF.js generated alt text<\/a> and, more recently, <a href=\"https:\/\/support.mozilla.org\/en-US\/kb\/how-use-ai-enhanced-tab-groups\">our <em>smart tab grouping<\/em><\/a>. The system worked, but not quite at the speed we wanted.<\/p>\n\n\n\n<p>This post explains how we accelerated inference by replacing the default <strong><a href=\"https:\/\/github.com\/microsoft\/onnxruntime\">onnxruntime\u2011web<\/a><\/strong> that powers <a href=\"https:\/\/huggingface.co\/docs\/transformers.js\/en\/index\">Transformers.js<\/a> with its native C++ counterpart that now lives inside Firefox.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where we started<\/h2>\n\n\n\n<p>Transformers.js is the JavaScript counterpart to Hugging Face\u2019s Python library. Under the hood it relies on <strong>onnxruntime\u2011web<\/strong>, a WebAssembly (WASM) build of ONNX Runtime.<\/p>\n\n\n\n<p>A typical inference cycle:<\/p>\n\n\n\n<ol>\n<li><strong>Pre\u2011processing<\/strong> in JavaScript (tokenization, tensor shaping)<\/li>\n\n\n\n<li><strong>Model execution<\/strong> in WASM<\/li>\n\n\n\n<li><strong>Post\u2011processing<\/strong> back in JavaScript<\/li>\n<\/ol>\n\n\n\n<p>Even with warm caches, that dance crosses multiple layers. The real hotspot is the matrix multiplications, implemented with generic SIMD when running on CPU.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why plain WASM wasn\u2019t enough<\/h2>\n\n\n\n<p>WASM SIMD is great, but it can\u2019t beat hardware\u2011specific instructions such as NEON on Apple Silicon or AVX\u2011512 on modern Intel chips.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.firefox.com\/en-US\/features\/translate\/\">Firefox Translations<\/a> (uses <em>Bergamot<\/em>) already proves that diving to native code speeds things up: it uses <strong>WASM built\u2011ins<\/strong> which are small hooks that let WASM call into C++ compiled with those intrinsics. The project, nicknamed <em><a href=\"https:\/\/github.com\/mozilla\/gemmology\">gemmology<\/a><\/em>, works brilliantly.<\/p>\n\n\n\n<p>We tried porting that trick to ONNX, but the huge number of operators made a one\u2011by\u2011one rewrite unmaintainable. And each cold start still paid the JS\/WASM warm\u2011up tax.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Switching to ONNX C++<\/h2>\n\n\n\n<p>Transformers.js talks to ONNX Runtime through a <em>tiny<\/em> surface. It creates a session, pushes a Tensor, and pulls a result. It makes it simple to swap the backend without touching feature code.<\/p>\n\n\n\n<p>Our steps to achieve this were:<\/p>\n\n\n\n<ol>\n<li><strong>Vendor ONNX Runtime C++<\/strong> directly into the Firefox tree.<\/li>\n\n\n\n<li><strong>Expose<\/strong> it to JavaScript via a thin WebIDL layer.<\/li>\n\n\n\n<li><strong>Wire<\/strong> Transformers.js to the new backend.<\/li>\n<\/ol>\n\n\n\n<p>From the perspective of a feature like PDF alt\u2011text, nothing changed, it still calls await pipeline(\u2026). Underneath, tensors now go straight to native code.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Integration of ONNX Runtime to the build system<\/h2>\n\n\n\n<p>Upstream ONNX runtime does not support all of our build configuration, and it\u2019s a large amount of code. As a consequence we chose not to add it in-tree. Instead, a configuration flag can be used to provide a compiled version of the ONNX runtime. It is eventually automatically downloaded from <a href=\"https:\/\/taskcluster.net\/\">Taskcluster<\/a> (where we build it for a selection of supported configuration) or provided by downstream developers. This provides flexibility while not slowing down our usual build and requiring low maintenance.<\/p>\n\n\n\n<p>Building ONNX on Taskcluster required some configuration changes and upstream patches. The goal was to find a balance between speed and binary size, while being compatible with native code requirements from the Firefox repo.&nbsp;<\/p>\n\n\n\n<p>Most notably:<\/p>\n\n\n\n<ul>\n<li>Building without exception and RTTI support required some patches upstream<\/li>\n\n\n\n<li>Default build configuration is set to MinSizeRel, compilation uses LTO<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">The payoff<\/h2>\n\n\n\n<p>Because the native backend is a drop\u2011in replacement, we can enable it feature by feature and gather real\u2011world numbers. Early benchmarks shows from <strong>2 to 10\u202f\u00d7 faster<\/strong> inference, with zero WASM warm\u2011up overhead.<\/p>\n\n\n\n<p>For example, the Smart Tab Grouping topic suggestion, which can be laggy on first run, is now quite snappy, and this is the first feature we gradually moved to this backend for Firefox 142.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"1018\" height=\"658\" src=\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/image.png\" alt=\"graph showing the difference between WASM and c++ backend. the C++ being way faster\" class=\"wp-image-81582\" srcset=\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/image.png 1018w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/image-300x194.png 300w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/image-768x496.png 768w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/image-1000x646.png 1000w\" sizes=\"(max-width: 1018px) 100vw, 1018px\" \/><\/figure>\n\n\n\n<p>The image to text model used for PDF.js alt-text feature also benefited from this change. On the same hardware the latency went from from 3.5s to 350ms.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What\u2019s next<\/h2>\n\n\n\n<p>We\u2019re gradually rolling out this new backend to additional features throughout the summer, so all capabilities built on Transformers.js can take advantage of it.&nbsp;<\/p>\n\n\n\n<p>And with the C++ API at hand, we\u2019re planning to tackle a few long\u2011standing pain points, and enable GPU.<\/p>\n\n\n\n<p>Those changes will ship in our vendored ONNX Runtime and offer us the best possible performance for Transformers.js-based features in our runtime in the future.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. DequantizeLinear goes multi\u2011threaded<\/h3>\n\n\n\n<p>The DequantizeLinear operation is single\u2011threaded and often dominated inference time. While upstream work recently merged an improvement (<a href=\"https:\/\/github.com\/microsoft\/onnxruntime\/pull\/24818\">PR\u202f#24818<\/a>), we built a patch to spread the work across cores, letting the compiler auto\u2011vectorize the inner loops. The result is an almost linear speedup, especially on machines with many cores.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2. Matrix transposition goes multi-threaded<\/h3>\n\n\n\n<p>Similarly, it is typical to have to transpose very large (multiple dozen megabytes) matrices when performing an inference task. This operation was done naively with nested for loops. Switching to a multi-threaded cache-aware tiled transposition scheme, and leveraging SIMD allowed to take advantage of modern hardware and speed up this operation by a supra-linear factor, typically twice the number of threads allocated to this task, for example a 8x speedup using 4 threads.<\/p>\n\n\n\n<p>This can be explained by the fact that the naive for loop was auto-vectorized, but otherwise did poor usage of CPU caches.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3. Caching the compiled graph<\/h3>\n\n\n\n<p>Before an inference can run, ONNX Runtime compiles the model graph for the current platform. On large models such as <em>Qwen 2.5 0.5B<\/em> this can cost up to five seconds every launch.&nbsp;<\/p>\n\n\n\n<p>We can cache the compiled graph separately from the weights on the fly, shaving anywhere from a few milliseconds to the full five seconds.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. Using GPUs<\/h3>\n\n\n\n<p>Currently, we\u2019ve integrated only CPU-based providers. The next step is to support GPU-accelerated ONNX backends, which will require more effort. This is because GPU support demands additional sandboxing to safely and securely interact with the underlying hardware.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Conclusion<\/h1>\n\n\n\n<p>What is interesting about this migration is the fact that we could improve performance that much, while migrating features gradually, and all that in complete isolation, without having to change any feature code.<\/p>\n\n\n\n<p>While the speed ups are already visible from a UX standpoint, we believe that a lot of improvement can and will happen in the future, further improving the efficiency of the ML-based features, and making them more accessible to a wider audience.<\/p>\n\n\n\n<p>Have ideas, questions or bug reports? Ping us on Discord in the firefox-ai channel (<a href=\"https:\/\/discord.gg\/TBZXDKnz\">https:\/\/discord.gg\/TBZXDKnz<\/a>) or file an issue on Bugzilla, we\u2019re all ears.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last year we rolled out the Firefox AI Runtime, the engine that quietly powers features such as PDF.js generated alt text and, more recently, our smart tab grouping. The system worked, but not quite at the speed we wanted. This post explains how we accelerated inference by replacing the default onnxruntime\u2011web that powers Transformers.js with [&hellip;]<\/p>\n","protected":false},"author":571,"featured_media":81995,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[464274,464303],"tags":[],"coauthors":[464290,464308,464351],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Speeding up Firefox Local AI Runtime<\/title>\n<meta name=\"description\" content=\"We accelerated inference by replacing the default onnxruntime\u2011web that powers Transformers.js with its native C++ counterpart that now lives inside Firefox.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/\",\"url\":\"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/\",\"name\":\"Speeding up Firefox Local AI Runtime\",\"isPartOf\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/Firefox-native-onnx.png\",\"datePublished\":\"2025-08-28T17:51:24+00:00\",\"dateModified\":\"2025-10-01T16:19:01+00:00\",\"author\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/aca87dfe79b30da7610884ccd00144ba\"},\"description\":\"We accelerated inference by replacing the default onnxruntime\u2011web that powers Transformers.js with its native C++ counterpart that now lives inside Firefox.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#primaryimage\",\"url\":\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/Firefox-native-onnx.png\",\"contentUrl\":\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/Firefox-native-onnx.png\",\"width\":934,\"height\":916,\"caption\":\"This image shows the logo of Firefox with onnx written in the background.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.mozilla.org\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Speeding up Firefox Local AI Runtime\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/#website\",\"url\":\"https:\/\/blog.mozilla.org\/en\/\",\"name\":\"The Mozilla Blog\",\"description\":\"News and Updates about Mozilla\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.mozilla.org\/en\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/aca87dfe79b30da7610884ccd00144ba\",\"name\":\"Tarek Ziad\u00e9\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/image\/4b0534bdcf6128df7f4dd34147fcc358\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/a75a18983be643c76d3219fb2ffc9aa7?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/a75a18983be643c76d3219fb2ffc9aa7?s=96&d=mm&r=g\",\"caption\":\"Tarek Ziad\u00e9\"},\"url\":\"https:\/\/blog.mozilla.org\/en\/author\/tziademozilla-com\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Speeding up Firefox Local AI Runtime","description":"We accelerated inference by replacing the default onnxruntime\u2011web that powers Transformers.js with its native C++ counterpart that now lives inside Firefox.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/","url":"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/","name":"Speeding up Firefox Local AI Runtime","isPartOf":{"@id":"https:\/\/blog.mozilla.org\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#primaryimage"},"image":{"@id":"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/Firefox-native-onnx.png","datePublished":"2025-08-28T17:51:24+00:00","dateModified":"2025-10-01T16:19:01+00:00","author":{"@id":"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/aca87dfe79b30da7610884ccd00144ba"},"description":"We accelerated inference by replacing the default onnxruntime\u2011web that powers Transformers.js with its native C++ counterpart that now lives inside Firefox.","breadcrumb":{"@id":"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#primaryimage","url":"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/Firefox-native-onnx.png","contentUrl":"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2025\/08\/Firefox-native-onnx.png","width":934,"height":916,"caption":"This image shows the logo of Firefox with onnx written in the background."},{"@type":"BreadcrumbList","@id":"https:\/\/blog.mozilla.org\/en\/firefox\/firefox-ai\/speeding-up-firefox-local-ai-runtime\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.mozilla.org\/en\/"},{"@type":"ListItem","position":2,"name":"Speeding up Firefox Local AI Runtime"}]},{"@type":"WebSite","@id":"https:\/\/blog.mozilla.org\/en\/#website","url":"https:\/\/blog.mozilla.org\/en\/","name":"The Mozilla Blog","description":"News and Updates about Mozilla","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.mozilla.org\/en\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/aca87dfe79b30da7610884ccd00144ba","name":"Tarek Ziad\u00e9","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/image\/4b0534bdcf6128df7f4dd34147fcc358","url":"https:\/\/secure.gravatar.com\/avatar\/a75a18983be643c76d3219fb2ffc9aa7?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a75a18983be643c76d3219fb2ffc9aa7?s=96&d=mm&r=g","caption":"Tarek Ziad\u00e9"},"url":"https:\/\/blog.mozilla.org\/en\/author\/tziademozilla-com\/"}]}},"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/posts\/81570"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/users\/571"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/comments?post=81570"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/posts\/81570\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/media\/81995"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/media?parent=81570"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/categories?post=81570"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/tags?post=81570"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/coauthors?post=81570"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}