Towards tagged PDF

Ross Moore
Macquarie university, Sydney, Australia
Play (45min) Download: MP4 | MP3

This talk will demonstrate recent work done by the author and Han The Thanh, to enrich pdfTeX with the primitives to allow the production of “tagged PDF”. As this is still very much work-in-progress, the talk will concentrate on presenting various aspects of tagging that allow the advantages of tagging to be easily appreciated. These advantages include, but are not limited to:

  • substitution of Unicode characters, for glyph combinations from fonts that use encodings other than Unicode, via CMap resources and other techniques;
  • alternative text, to be read by screen-readers;
  • extraction of text from PDFs in XML format;
  • extraction of mathematical content, in MathMLformat.

Each of these aspects will be illustrated by examples constructed using an enhanced version of pdfTeX.

Also, I’ll try to explain the extra complexity of internal PDF structures required for generating properly tagged structure and content. If there is sufficient time, this may be followed by a discussion of the requirements needed to adjust the LaTeX format and packages, to facilitate the automatic production of properly tagged PDF, to become conformant with the ISO–32000–2 standard — also known as PDF/UA (Universal Accessibility, http://pdf.editme.com/PDFUA/). This standard includes MathML tagging of mathematical content; I wish to acknowledge Neil Soiffer (Design Science Inc., http://accessiblemath.dessci.com/) for motivation and much helpful advice, and testing, concerning this aspect.