Publisher:All-in-One Publishing documentation
Step 1: Extracting the structure
Our source material is a pdf document that has little to no structural information attached to it that is not purely visual. Our first instinct was to directly convert the pdf into an HTML file and to add the structural information to the resulting HTML file. But because we are working in a team, it is important to define this structure prior to implementing it, to avoid miscommunication later in the process.
Therefore we worked together on defining the structure and adding this information as comments to the original pdf file. Defining this structure first made it much easier for team members to work separately on implementing it into the html files. Apart from structural information, we also predefined the css classes that would be needed for the layout and also added these to the pdf file. All of the files were placed on a shared Google Drive folder.
Step 2: Implementing the structure
Pandoc does not offer the functionality to convert pdf to html. Therefore, we found the source .docx files and converted these to html through pandoc. This worked surprisingly well, the resulting html looked very clean. The result was four different html files for the four chapters of the book. We started by adding the right elements in these html files, while also adding classes that might be needed for lay-out. In choosing the right type of html elements, we tried to use HTML 5 semantic elements as much as we could, both for clarity and standards-compliancy.