Exporting Content from QuarkXPress
by Danny Sofer, 4 April 2002
QuarkXPress is designed for expressing the physical layout of a document, not its formal structure. It is a tool designed for print; not the Web. This means that extracting content in a meaningful way from a QuarkXPress file is not always a trivial task.
This paper will outline the alternative approaches that can be taken to the large-scale conversion of QuarkXPress documents into a structured form suitable for online delivery, the criteria required of software used for this purpose, and a review of existing products.
The key to the process of exporting content prepared in QuarkXPress is in the translation of the spatial layout information contained in the QuarkXPress file into the underlying logical structure of the documents contained in the file.
There are three critical steps to achieving the goal:
1. Identifying the boundaries between different documents or stories within a file;
2. Assembling the text elements within a document in the appropriate order;
3. Distinguishing between the different elements within a document, such as the headline, byline and body copy.
A typical QuarkXPress document is unlikely to submit to the accurate completion of these three steps without either very careful preparation in its creation or a good deal of effort and guesswork after its delivery.
The two broad approaches to the process of conversion are either to add information to a document at design time or to reconstruct this information after a document has been delivered.
The design-time approach assumes that an organization is willing to make significant changes to its production processes to ensure that content being entered into a QuarkXPress document is appropriately marked up to allow for identification of its logical structure.
It is likely to be disruptive to current production work and require some degree of re-training of staff. It puts a burden of responsibility onto these staff and may add significantly to their existing workload.
On the other hand, documents prepared effectively in this way are likely to be easy to translate without further significant effort or resources.
Software taking the design-time approach is likely to need installing on every production computer.
The post-delivery approach does not necessarily impose significant additional demands on production staff, but it does require some degree of consistency in layout and in the application of styles so that the structure of content can be pieced together after initial work has been completed.
So, it is likely to be less disruptive to current print-based production work, but it introduces an additional, and potentially laborious, step to the production process.
Software taking this approach will only need to be installed on those computers that are executing the conversion process.
Criteria for success
The aim is to realise effective delivery of content in a format suitable for electronic delivery, without adversely affecting print production.
In order to assist in judging the suitability of different solutions to this problem, it may be helpful to think in terms of the following twelve criteria for success: six of primary importance and six more that are somewhat less critical, more specific in nature or particular to certain circumstances.
• Reliability of tagging mechanism. The accuracy with which documents are tagged; ability to enforce tagging guidelines.
• Content grouping. The work required to ensure that separate elements of text are grouped effectively for output.
• Exception handling. The ease with which content that has been incorrectly or incompletely translated - perhaps text that has an unrecognised style applied to it or a document that has an incomplete set of elements - can be identified and fixed.
• Speed of delivery. The speed with which pages can be marked up, checked and converted.
• Ease of maintenance. Ease with which changes can be made to style mappings, grouping rules and to output formats.
• Impact on production. The additional work required of production staff; training requirements.
• Style mapping. The work required to set up the relationships between QuarkXPress styles and structured elements in the output.
• Character mapping. The ease with which characters that appear in QuarkXPress documents can be mapped to appropriate characters in the output text.
• Image export. Support for the export of images, association with text and conversion to appropriate file formats.
• Table support. Identification and export of table data.
• Addition of metadata. such as publication name, and issue date.
• Management of output. How data is exported and made available to Web production staff.
Products and vendors
There follows a list of vendors of software for extracting content from QuarkXPress documents. This list is unlikely to be exhaustive, but it should be representative.
The products on offer take a variety of approaches to data conversion, but none of them is able to cope with entirely unstructured documents and they all require a significant amount of effort by production personnel to ensure that documents are appropriately marked up.
No product is likely to be ideal in all circumstances; the best solution is likely to depend on your requirements, on resources, and on the adaptability of your staff.
avenue.quark takes a design-time approach to conversion. Time-consuming to set up and use, it is not yet an obvious choice for converting large quantities of content to structured formats.
Price: £10,000 and up.
Developed in the UK by Press Computer Systems, Ed-exemble is the latest incarnation of a product originally called Compositor that has been in use at The Guardian for some years. It changed its name to Story Manager and was adopted by Associated Newspapers before being re-released as Ed-exemble in 2001.
The product allows structural information about a document to be added as it is being created in QuarkXPress, which can then be exported as XML. It is excellent at enforcing style guidelines, but needs to be installed on all production machines and is likely to have a significant effect on the practice of production personnel.
Price: from £5,000.
Developed in the UK by Easypress Technologies.
Like other products that take a post-delivery approach, AtomikXT offers no tools for ensuring the consistent use of styles and relies on their rigorous application at design time.
Its speed of delivery and ease of maintenance are good. Error handling has been improved in version 3.0, released early in 2002.
Price: $30,000 and up.
The Internet Content Publishing Suite has been developed in the US by PCI.
Developed by VisualCom Systems in Spain. Takes a post-delivery approach to conversion, that relies on the consistent application of styles and accurate grouping of text boxes.
Price: $60,000 and up.
NAPS Translation System has been developed in the US by North Atlantic Publishing Systems Inc (www.napsys.com). It takes a post-delivery approach that relies on accurate use of styles.
Developed in Sweden by Infomaker.
Developed by Extensis for converting QuarkXPress pages directly into HTML. Not appropriate for large-scale content conversion.
Do it yourself
Pretty effective conversion utilities can be created using AppleScript and other scripting languages, such as Perl. On an ad hoc basis, these can more effective than packaged solutions.
QuarkXPress was never designed for marking up the structure of documents and is ill-equipped for doing so. It may not be the ideal tool for cross-platform content delivery, but while it remains in widespread use, there are ways to work around its limitations.
Different software products take different approaches to the problem. Some attempt to enforce standards at design time; others assume that these standards will be upheld voluntarily.
Either way, there are no immediate alternatives to either employing staff to laboriously re-construct pages after they have been produced or getting existing production staff to take a good deal more care in the preparation of QuarkXPress documents than they are likely to have been used to.
Once the process of translating QuarkXPress files into some sort of structured form has been completed, further steps, such as translation into XML or importation into your favourite CMS product, is, or should be, trivial by comparison.