Difference between revisions of "Research/all-in-one publishing"

Revision as of 07:10, 24 September 2015

Introduction

The core of our research is the development of a workflow in which HTML and CSS files are used as the primary source for a publication. This publication is then made available for print, epub and a responsive website. This approach was chosen because it can potentially improve the publication workflow for the following reasons:

Both the design and the content can be updated at any time, fully independently of each other.
Every update in content or design can be exported to every available publication medium.
HTML and CSS are very widespread transparent file formats that have been used for decades without much change, and will remain this way for years to come. This makes them much more suitable for digital archiving than most proprietary formats.
For largely the same reasons, the resulting publication is very suitable for (collaborative) reuse and redesign.

About the process

For this research, we have chosen to take an existing publication and explore how it could be rebuilt from scratch using our newly developed workflow. The original publication was only designed for print (a pdf). Our goal was to replicate this design as closely as possible with our new print version, and adapt the design to also fit the other media.

The global process consisted of the following steps:

The original Word documents with the contents of the publication were converted to html files using Pandoc conversion software.
These html files were 'cleaned up', deleting everything design-related and leaving only structural information.
Additional HTML was added to improve semantic value and facilitate CSS styling.
CSS stylesheets were created for every output format.
The outputs were created and tested using Pandoc (for epub), PrinceXML and Weasyprint (for pdf), Chrome (for responsive web) and Firefox (for direct browser print) and checked for consistency. PrinceXML is proprietary software, Weasyprint is not.

Of course the process wasn't as linear as it appears here. The process can be better described as being iterative, looping through the steps continuously. The commits list in github can be a good portrait of this iteration. Available at: https://github.com/arjensuijker/all-in-one-publishing/commits/master

Challenges

Because we are off the beaten tracks for the duration of this research, we encountered many challenges that didn't have readily available solutions. We will discuss the challenges for every step of our process:

From Word to HTML

At this stage, we encountered very little problems. Pandoc proved easy to install and worked flawlessly.

Cleaning up the HTML

In Word, there are basically two methods to style a document: using styles or by manually changing the font properties for every heading and paragraph. The first method results in html files with a better (semantic) structure, because the styles can directly translated to HTML tags. The second method creates very messy HTML files that require lots of cleaning because the HTML is littered with unnecessary styling information. The documents that we used as the source were somewhere in between: most styling was done correctly, but a lot of manual cleaning was still needed. Additionally, most figures had to be extracted manually.

Adding semantic structure and classes

Before we started adding this information to the HTML, we walked through the publication. Together we decided what would be the correct semantic structure and added this as annotations to the file. We could then work in a parallel fashion, individually implementing the structure in the html files. An issue we encountered while defining the structure is that most semantic HTML elements are aimed at the web, so deciding on a suitable semantic structure for both print and web was often a matter of compromise. Because we knew what design we were aiming at, it was easier for us to decide what elements should have classes and which should not. If the design is not ready yet, this may be more of a challenge and will probably change during the course of the project.

Creating the CSS

This was by far the most challenging part of this project, so we will discuss the challenges separately for every output format.

Web CSS

For the website, the whole design of the book had to be reimagined. Pages do not exist, information doesn't have to be linear and many interactive posibilities open up. We decided to keep the website largely linear because it seemed to fit the content well. Some interactivity was added: the index was redesigned to become a navigation menu and the footnotes were made into toolips. Because the font (Metric) was not licensed for online use, we had to find a font that was similar. This proved to be very difficult, so we had to compromise. We tried to avoid using javascript when possible, as this can decrease compatibility and increase the complexity of the code. The only exception we made was for the footnotes, because this was simply not possible in CSS. Numbering of figures, pages and tables was also done in CSS. This turned out to work out quite well, although browser support is somewhat limited. At a later stage we realized that this may not have been the optimal solution, since this numbering is vital information and not just a matter of styling.

Responsive web CSS

Instead of creating a separate mobile output, we made the regular website responsive to screen size. The challenges we encountered while doing this were not much different from the challenges that come with all mobile websites. Even though these challenges also belong to our research topic, they do not represent the most innovative aspects of it. For this reason, documentation regarding this output seems redundant to us.

Epub CSS

Epub stylesheets proved to be very badly documented. Every device has its own interpretation of CSS, and almost all of them are quite limited. Advanced CSS like numbering works on almost none of the available devices.

We used Pandoc to create the epub, and one limitation was that pandoc can only add one stylesheet. We divided the stylesheet files into a global stylesheet, a print stylesheet and a specific epub stylesheet, all of which should be used for the epub. To solve this, we used a command line utility to merge these stylesheets and save them as one, so that we could use the resulting css for Pandoc.

This resulted in the following command for windows:

type "css\style.css" "css\epub.css" "css\print.css" > "css\epub_composite.css"
pandoc container.html -o book.epub --epub-cover-image=img\cover.jpg --epub-stylesheet=css\epub_composite.css

Print CSS

Because one of our team member had good experiences in the past with Chrome's print functionality, this was the browser we used for testing. This turned out to be a mistake, because lately Chrome's printing functionality has deteriorated (possibly caused by their change of rendering engine). Many hours were wasted on trying to get the design to work on Chrome, after which we decided to focus on firefox. This proved to be much easier, and most CSS worked correctly.

Prince CSS

None of the team members had used PrinceXML before, so the process was pretty much 'trial and error' in the begining. There were issues with the CSS files being used for the web version. They had to be imported in the CSS file used for Prince (even though they were being requested in the HTML head).
The HTML head, where the CSS files are requested).

The import command in prince.css (seems redundant but is actually necessary).

Prince has some useful features, such as retrieving content from HTML elements and assigning them to variables (useful if you need to use a Chapter's title in the page footer, for example).

One possible downside is that Prince XML is proprietary software.

To use Prince, a command like the one below is needed:

prince container.html -s css/prince.css book.pdf

Documentation on how to use and install PrinceXML can be found here: http://www.princexml.com/doc/

Weasyprint CSS

Weasyprint works similarly to PrinceXML. However, it is not proprietary software. It is trickyer to install when compared to Prince and does not have all the features that we found in PrinceXML.

weasyprint container.html -s css/weasyprint.css weasybook_final.pdf

Documentation:

how to install http://weasyprint.org/docs/install/#by-platform
command-line API http://weasyprint.org/docs/api/#command-line-api

Collaboration

We worked in parallel on the different outputs, which worked well but sometimes caused little hickups. The most common problem was caused by the fact that we used multiple css stylesheet for every output. For example, when one person was working on the print stylesheet, this could accidentally break the pdf export.

Conclusion

Overall, this workflow proved to work well. We needed some time to get used to the process, but it already appears to be a viable alternative to traditional workflows. The biggest challenge is probably that every member of the development team needs a fair bit of technical skill together with an eye for design. If such a team is available, this process can streamline the entire publication process.

Future directions

Because this process is based on HTML and CSS, it can be combined with several existing technologies and platforms. For example, in the future we plan to use a wiki as the source. Wiki's allow for the collaborative creation of content, which can easily and automatically be converted to HTML. This HTML can then serve as the source file for the publication process that we just described.

Another possible future research subject is the possibility of developing a WYSIWYG editor that strongly enforces a semantically and structurally correct layout. This would vastly simplify the first two steps of our process, and ideally render them obsolete.

Output previews

All-in-one Publishing
(RESPONSIVE) WEBSITE	EPUB	PRINCE pdf	WEASYPRINT pdf
VISIT	File:All-in-one-publishing-epub.zip

@@ Line 10: / Line 10: @@
 === About the process ===
-For this research, we have chosen to take an existing publication and explore how it could be rebuilt from scratch using our newly developed workflow. The original publication was only designed for print (a pdf). Our goal was to replicate this design as closely as possible with our new print version, and adapt the design to also fit the other media.
+For this research, we have chosen to take an existing publication and explore how it could be '''rebuilt from scratch using our newly developed workflow'''. The original publication was only designed for print (a pdf). Our goal was to replicate this design as closely as possible with our new print version, and adapt the design to also fit the other media.
 The global process consisted of the following steps:
-# The original Word documents with the contents of the publication were converted to html files using Pandoc conversion software.
+# The original '''Word documents''' with the contents of the publication were converted '''to html files using Pandoc''' conversion software.
-# These html files were 'cleaned up', deleting everything design-related and leaving only structural information.
+# These html files were ''''cleaned up'''', deleting everything design-related and leaving only structural information.
-# Additional HTML was added to improve semantic value and facilitate CSS styling.
+# '''Additional HTML''' was added to improve '''semantic value''' and facilitate '''CSS styling'''.
-# CSS stylesheets were created for every output format.
+# '''CSS stylesheets''' were created for every output format.
-# The outputs were created and tested using Pandoc (for epub), PrinceXML and Weasyprint (for pdf), Chrome (for responsive web) and Firefox (for direct browser print) and checked for consistency. PrinceXML is proprietary software, Weasyprint is not.
+# The outputs were '''created and tested''' using Pandoc (for epub), PrinceXML and Weasyprint (for pdf), Chrome (for responsive web) and Firefox (for direct browser print) and checked for consistency. PrinceXML is proprietary software, Weasyprint is not.
-Of course the process wasn't as linear as it appears here. The process can be better described as being iterative, looping through the steps continuously. The commits list in github can be a good portrait of this iteration. Available at: https://github.com/arjensuijker/all-in-one-publishing/commits/master
+Of course the process wasn't as linear as it appears here. The process can be better described as being '''iterative, looping through the steps''' continuously. The commits list in github can be a good portrait of this iteration. Available at: https://github.com/arjensuijker/all-in-one-publishing/commits/master
 == Challenges ==
@@ Line 30: / Line 30: @@
 === Cleaning up the HTML ===
-In Word, there are basically two methods to style a document: using styles or by manually changing the font properties for every heading and paragraph. The first method results in html files with a better (semantic) structure, because the styles can directly translated to HTML tags. The second method creates very messy HTML files that require lots of cleaning because the HTML is littered with unnecessary styling information. The documents that we used as the source were somewhere in between: most styling was done correctly, but a lot of manual cleaning was still needed. Additionally, most figures had to be extracted manually.
+In Word, there are basically two methods to style a document: '''using styles or by manually changing the font properties''' for every heading and paragraph. The first method results in html files with a better (semantic) structure, because the styles can directly translated to HTML tags. The second method creates very messy HTML files that require lots of cleaning because the HTML is littered with unnecessary styling information. The documents that we used as the source were somewhere in between: most styling was done correctly, but a lot of manual cleaning was still needed. Additionally, most '''figures had to be extracted manually'''.
 === Adding semantic structure and classes ===

Anonymous

Search

Difference between revisions of "Research/all-in-one publishing"

Namespaces

More

Page actions

Revision as of 07:10, 24 September 2015

Contents

Introduction

About the process

Challenges

From Word to HTML

Cleaning up the HTML

Adding semantic structure and classes

Creating the CSS

Web CSS

Responsive web CSS

Epub CSS

Print CSS

Prince CSS

Weasyprint CSS

Collaboration

Conclusion

Future directions

Output previews

Navigation

Main navigation

Namespaces

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Research/all-in-one publishing"

Revision as of 07:10, 24 September 2015

Introduction

About the process

Challenges

From Word to HTML

Cleaning up the HTML

Adding semantic structure and classes

Creating the CSS

Web CSS

Responsive web CSS

Epub CSS

Print CSS

Prince CSS

Weasyprint CSS

Collaboration

Conclusion

Future directions

Output previews

Navigation

Wiki tools

Page tools

Categories