Overview of pdf.js guts

Andreas posted a general overview of pdf.js. I’d like to briefly cover some more-technical parts of the renderer.

pdf.js (currently) parses raw arrays of bytes into streams of PDF “bytecode”, compiles the bytecode into JavaScript programs, then executes the programs. (Yes, it’s a sort of PDF JIT :). The side effect of those programs is to draw on an HTML5 <canvas>. The commands in the bytecode include simple things like “draw a curve”, “draw this text here”, and more complicated things like filling areas with “shading patterns” specified by custom functions (see Shaon’s post). Additionally, the stream of commands itself, and other data embedded within like fonts and images, might be compressed and/or encrypted in the raw byte array. pdf.js has basic support to decompress some of these streams, all the code written in JavaScript.

The rendering of fonts embedded in PDFs using web technologies is a big enough topic to merit its own blog post. A post might eventually appear on Vivien’s, but if not, somewhere else ;).

There are several ways to write a PDF renderer on top of the web platform; pdf.js’s current implementation, drawing to a <canvas>, is just one way. We chose canvas initially because it is (should be) the fastest way for us to draw to the screen. We want the first-paint of pages to be quick so that pdf.js startup feels zippy, and users can start reading content ASAP. Canvas is great for this.

Canvas has problems though. First, it’s missing many features needed to render PDFs. We’ve starting adding some of these to the web platform, when it makes sense to do so, and there’s more to
come. A second, and much bigger problem, is that while canvas’s immediate-mode design allows us to render with minimal overhead, it means that the user agent doesn’t have enough information to allow users to select text, or navigate using accessibility interfaces (through a screen reader, e.g.). Printing canvases with high fidelity is another issue.

The web platform already offers a (potential) solution to these problems, however: SVG. SVG is richly featured, retained-mode, and has its own DOM. In theory, user agents should have text-selection, a11y, and printing support for SVG. (If they don’t, it needs to be added.) So, SVG provides (in theory) the features missing from <canvas>, just at a higher cost in the form of more overhead.

Putting all this together, we currently plan on doing a fast first-paint of pages using canvas, concurrently building an SVG document for the page in the background, and when the SVG document is ready, switching to that. Or if that doesn’t work well, we could implement text-selection (and hopefully a11y) in pdf.js itself, on top of canvas, possibly creating new web APIs along the way. Or if, say, font loading dominates the critical path to first-paint, we might only use SVG and forget canvas. It’s great to have these options available.

There’s a ton of work left on pdf.js, from implementing features to improving the user interface to exploring crazy ideas (like for example using WebGL to speed up rendering). The project is open and the code is libre: we’d love for you to get involved! Have a look at our github and our wiki, or talk to us on IRC in #pdfjs.

Comments (31)

  1. Awesome work! I look forward to seeing the performance characteristics of the SVG version once you’ve done that (especially after jwatt’s display list-ification of SVG rendering is complete).

    Wednesday, June 15, 2011 at 21:33 #
  2. Ivan Enderlin wrote::

    Hey :-),

    Just a typo about Vivien’s blog: vingtetun.org/ and not vingetun.org/ ;-).

    Wednesday, June 15, 2011 at 22:43 #
  3. marek wrote::

    I in the past worked a lot with PDF and I assume that canvas-backed rendering will lead rather to more or less bitmap based PDF rather than vector based, where SVG natively leads. All DTP guys hate raster based PDFs and prefer much much more vector based ones, usually for scaling/filesize/disassembling purposes, so I’d be really glad for SVG backed library… Good luck in your doing…

    Thursday, June 16, 2011 at 00:51 #
  4. Anton wrote::

    “The rendering of fonts embedded in PDFs using web technologies”
    Oh, please, pretty please, that would be one very interesting article!

    Thursday, June 16, 2011 at 02:14 #
  5. cjones wrote::

    We’re focused on displaying a PDF to users, which for us means rasterizing to a bitmap for display on a computer monitor. Both SVG and canvas can render to bitmaps that match the output device resolution, but with canvas a bit more work on the part of the rendering application is required.

    Thursday, June 16, 2011 at 02:29 #
  6. cjones wrote::

    Thanks, fixed. (Sorry, though I’d posted this earlier …)

    Thursday, June 16, 2011 at 02:29 #
  7. cjones wrote::

    That article will definitely be written :).

    Thursday, June 16, 2011 at 02:30 #
  8. Anton wrote::

    Cool! Looking forward to it! :)

    Thursday, June 16, 2011 at 03:31 #
  9. Anton wrote::

    Btw., do you also plan to support CMYK?

    Thursday, June 16, 2011 at 03:54 #
  10. Rob Crowther wrote::

    > In theory, user agents should have text-selection, a11y, and printing support for SVG

    Does this mean you’re going to have a look at this six year old defect?

    https://bugzilla.mozilla.org/show_bug.cgi?id=292498

    Thursday, June 16, 2011 at 05:55 #
  11. Jeramie wrote::

    How can a client-side browser component only request a specific page from a server-side PDF without some server-side component to extract it individually?

    Thursday, June 16, 2011 at 08:02 #
  12. Brian wrote::

    Could you please quantify the performance difference between displaying a pdf using canvas and using svg?

    Thursday, June 16, 2011 at 08:11 #
  13. Kevin wrote::

    Thank you for picking with a non-copyleft license on this. I do hope you will stick with that license as you add more and more features. :)

    About the project on reddit: http://www.reddit.com/r/javascript/comments/i0eke/pdf_reader_in_javascript_pdfjs_is_a_prototype_to/

    Thursday, June 16, 2011 at 10:45 #
  14. cjones wrote::

    (Sorry, wordpress seems to be eating my replies to comments.)

    Anton: Yes, we plan to support CMYK and other color spaces.

    Rob: Someone will, yes. It’s not clear whether browser-builtin SVG text-selection will work well enough for PDF text selection, but SVG text selection is generally useful enough that we should implement it and find out.

    Jeramie: PDFs keep tables that say for page K, where the bytes of K begin in the .pdf and how many bytes K occupies. With that information, we can make an HTTP byte-range request (more specifically, XMLHttpRequest with a “Range”) header to fetch only the bytes for page K.

    Brian: As soon as we know, we’ll post. The SVG backend doesn’t exist yet :).

    Thursday, June 16, 2011 at 11:52 #
  15. Paulius wrote::

    I wonder if any PDF forms technology (AcroForms or XFA) is going to be supported in the future? Not an expert here, but I think PDF forms might be valuable for transporting e-documents between computers.

    Friday, June 17, 2011 at 11:54 #
  16. cjones wrote::

    Paulius: It’s definitely possible, but very low on the list of priorities right now. Stay tuned for future developments :).

    Friday, June 17, 2011 at 13:08 #
  17. Jeramie wrote::

    Much obliged for the info, cjones. I never knew that. I tried your demo and it worked great, by the way. Keep up the good work.

    Friday, June 17, 2011 at 15:26 #
  18. Jens wrote::

    Hi,

    nice idea you have.
    But i think you didn´t understand the idea of PDF.
    The idea is to show content on different platforms and show it on all this plattforms in same way. That is the idea. As long as your Prototype only works in a small amut of browsers it never repesent the idea of PDF.
    A reference PDF Reader for all relevant Platform can be used to show an PDF in correct way.

    And belive me. You can never show complex PDFs correct. This can only be done done by 2 companies in world today.

    Acrobat and ACCHSH.

    All other PDF Readers works but not correct in PDF rendering.

    Sunday, June 19, 2011 at 10:09 #
  19. ThomasW wrote::

    “The rendering of fonts embedded in PDFs using web technologies”

    Why reinvent the wheel? Wouldn’t SVG fonts be perfect for this? In my opinion they *really* should be implemented. Webkit and Opera already support them. Anyone who wants to vote for them?

    https://bugzilla.mozilla.org/show_bug.cgi?id=119490

    Sunday, June 19, 2011 at 13:23 #
  20. cjones wrote::

    Jens: I’m not quite sure how to reply to your comment, because I think we’re arguing from the same ideas but we’re using the same language in different ways. Re: your second point: pdf.js is not intended to achieve feature parity with AcroRead, first because it’s not (sanely/legally) possible since there are proprietary Adobe PDF extensions, and second because our goal is to render the vast majority of ISO-compliant PDFs faithfully, not 100% of all PDFs ever created. I haven’t heard of ACCHSH so I’m not sure how it fits into the conversation.

    ThomasW: Since we (the pdf.js team) haven’t posted about how pdf.js renders fonts, it’s somewhat difficult to have this conversation. But, rest assured that (i) “reinventing the wheel” is the last thing we’re interested in doing; and (ii) SVG fonts would not help us.

    Monday, June 20, 2011 at 01:15 #
  21. J. McNair wrote::

    ThomasW: The biggest problem with the CURRENT VERSION of SVG Fonts is poor support for languages with “complex shaping” needs, like Hindi.

    In 2011, there’s no good reason to implement a “world-wide” web standard that is useless for literally hundreds of millions of people.

    Furthermore, to render the “vast majority of compliant PDFs on the web”, you will need complex shaping.

    Tuesday, June 21, 2011 at 11:31 #
  22. Baz wrote::

    Re the ‘big project’ for text selection: I did some work on fixing this in poppler, which had taken the approach mentioned (identifying & chaining textruns), but it works *really* badly for multicolumn text.

    What I did to fix this was to use the reading-order sort algorithm used in Ocropus [T Breuel: High performance document layout analysis. Symposium on Document Image Understanding Technology, pp.209-218 (2003)]. This is a massive improvement, but poppler still has issues because its paragraph detection is primitive; implementing the gutter detection part of the same paper would help with this.

    Adobe’s solution is tagged pdfs, which contain all the info to determine reading order, but trying to deal with tagged pdfs in poppler’s existing data structures is messy. It would have been better to deal with tagged PDFs first, then apply the reading-order heuristics above to infer a tagging structure for ‘normal’ pdfs.

    Another problem is that poppler suffers from a lack of an accessible API from the outset. Text and selections work by looking at text within a bounding box, rather than finding a start character and end character and selecting all characters in between. (like eg ATK.Text)

    It’d be nice if pdf.js, with its clean slate, can avoid these problems :)

    Monday, June 27, 2011 at 05:54 #
  23. ThomasW wrote::

    O.K., then my comment was a bit short-sighted. I thought that this complex shaping might be handled in form of ligatures, but if it can’t, then I’m remaining silent…

    Monday, June 27, 2011 at 08:56 #
  24. cjones wrote::

    Baz: That is incredibly helpful, thank you. I’ve linked your notes from our wiki. Of course, the code is at https://github.com/andreasgal/pdf.js, if you ever have some free time … ;)

    Monday, June 27, 2011 at 09:41 #
  25. Bashev wrote::

    This version does not support Cyrillic documents.

    May be you fixed this in some of next releases?

    Sunday, July 10, 2011 at 08:15 #
  26. cjones wrote::

    Bashev: Pdf.js should support Cyrillic documents (it doesn’t really know what alphabet it’s using to render). Have you tried http://andreasgal.github.com/pdf.js/multi_page_viewer.html ? If that’s still broken, can you let us know which PDF file isn’t working for you? Pdf.js still has lots of bugs ;).

    Sunday, July 10, 2011 at 14:16 #
  27. Leif Halvard Silli wrote::

    ACCHSH: http://www.puzzleflow.com/company

    Saturday, July 16, 2011 at 00:29 #
  28. abhilash wrote::

    When do you think will start working on the SVG version. I would like to contribute to the code once basic text selection is possible.

    Wednesday, September 14, 2011 at 23:00 #
  29. shobo wrote::

    How can a client-side browser component only request a specific page from a server-side PDF without some server-side component to extract it individually?

    Can please somebody answer this question? I would really appreciate it!

    Thank you!

    Wednesday, November 23, 2011 at 07:39 #
  30. cjones wrote::

    abhilash: There’s an issue on file in the pdf.js github tracking the SVG backend. @notmasteryet made a prototype.

    shobo: HTTP byte-range requests.

    Wednesday, November 23, 2011 at 17:45 #
  31. po_chuang wrote::

    This idea is gorgeous!

    Monday, January 9, 2012 at 00:35 #