Courses/EPUB/day2

From Publication Station

<slidy theme="aa" />

This session will be dedicated to producing ePubs, without having to go through the tedious process of creating them from scratch.

In order to do that we will try to go through a possible workflow to arrive at this result.

Although is not the only route for creating ePubs, nor necessarily the best it is a simple and consistent one, that can yield quality results and can serve as a starting point from which you can developing your own workflows.

Disclaimer: Many of the information contained in this page was taken from

D. P. T. Collective. From Print to Ebooks: A Hybrid Publishing Toolkit for the Arts. Vol. 1. Institute of Network Cultures, 2014. http://networkcultures.org/blog/publication/from-print-to-ebooks-a-hybrid-publishing-toolkit-for-the-arts/.

Tools

Pandoc

Pandoc is the Swiss-army knife' of text converters, an Open Source software application able to convert between a wide variety of document formats, or to be more precise Markup languages.

In the workflow we'll be working on, Pandoc is the central tool for converting documents between different formats, until we arrive at an ePub.

If we happen to start with a Microsoft Word document, Pandoc can be used to convert that document to Markdown, and later used for converting the Markdown to EPUB.

Note: Pandoc can only convert between markup languages. If content is not in a markup language, if it is in a PDF or doc file, it will be impossible for Pandoc to convert it. Yet, there are way around it. Microsoft Word docx and its open source counterpart odt, are essential XML (a markup language) based file formats. This fact allows them to be converted to markup languages with Pandoc.

Markup language

A markup language consists of a set of symbols inserted in a plain text document in order to define additional attributes such as formatting instructions (bold, italics, etc.), structuring instructions (chapters, headings, references, etc.) or metadata. Based on the information contained in the markup code, software applications can render the text document for a specific output medium (such as a screen or print).


Some common markup languages are: HTML, XML, Markdown, Mediawiki syntax, LaTeX, ICML(inCopy files).


What follows is the same section of text marked by different markup languages:

Markdown:

# Revenge of the Text

There is a room in the **Musée d’Orsay** that I call the *room of possibilities*.

That room contains:

* a snow flake
* the end of a cloud
* a bit of nothing

HTML:

<h1 id="revenge-of-the-text">Revenge of the Text</h1>
<p>There is a room in the <strong>Musée d’Orsay</strong> that I call the <em>room of possibilities</em>.</p>
<p>That room contains:</p>
<ul>
<li>a snow flake</li>
<li>the end of a cloud</li>
<li>a bit of nothing</li>
</ul>

Mediawiki:

= Revenge of the Text =

There is a room in the '''Musée d’Orsay''' that I call the ''room of possibilities''.

That room contains:

* a snow flake
* the end of a cloud
* a bit of nothing

ICML (XML):

<ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/Header1">
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content>Revenge of the Text</Content>
  </CharacterStyleRange>
</ParagraphStyleRange>
<Br />
<ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/Paragraph">
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content>There is a room in the </Content>
  </CharacterStyleRange>
  <CharacterStyleRange AppliedCharacterStyle="CharacterStyle/Bold">
    <Content>Musée d’Orsay</Content>
  </CharacterStyleRange>
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content> that I call the </Content>
  </CharacterStyleRange>
  <CharacterStyleRange AppliedCharacterStyle="CharacterStyle/Italic">
    <Content>room of possibilities</Content>
  </CharacterStyleRange>
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content>.</Content>
  </CharacterStyleRange>
</ParagraphStyleRange>
<Br />
<ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/Paragraph">
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content>That room contains:</Content>
  </CharacterStyleRange>
</ParagraphStyleRange>
<Br />
<ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/BulList &gt; first" NumberingContinue="false">
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content>a snow flake</Content>
  </CharacterStyleRange>
</ParagraphStyleRange>
<Br />
<ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/BulList">
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content>the end of a cloud</Content>
  </CharacterStyleRange>
</ParagraphStyleRange>
<Br />
<ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/BulList">
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content>a bit of nothing</Content>
  </CharacterStyleRange>
</ParagraphStyleRange>

converting between markups

In the previous examples, the same semantic meaning is expressed using different symbols, depending of the chosen markup languages, however their meaning – the result in the rendered text – is the same.

These representation of the same meaning, using a different sets of symbols, allows for a relatively straight forward convertion, from one markup to another.

Those conversions are Pandoc's job.

converting with Pandoc

To do a simple conversion we need only an input file in any of the allowed markups.

pandoc input.html --from html --to markdown --standalone --atx-headers -o output.md

Meaning:code

  • input.html - Specifies the input file
  • --from=/-f - Specifies the input format
  • --to=/-t - Specifies the output format
  • --standalone/-s - (optional) Produces a standalone output with an appropriate header and footer (important for conversions to HTML or EPUB)
  • --atx-headers - (optional) Use ATX style headers (## heading2) in markdown
  • --output=/-o - Specifies the output file

Word document

If your starting document is the typical MS Word document, Pandoc does a good job at converting it.

However you need to pay attention to a few details:

  • Word file must be saved with .docx - this is a XML-based extension and Pandoc can deal with it
  • The Word file should make use of styles, in order to give the document a structure.


When converting docx files to another markup you can specify: -extract-media=DIR. This option might be useful in extracting images and other media contained in a docx to the directory you specify.

Markdown

John Gruber, with substantial contributions from Aaron Swartz, created the Markdown language in 2004 with the goal of enabling people "to write using an easy-to-read, easy-to-write plain text format, and optionally convert it to structurally valid XHTML (or HTML)”.

Markdown is simple markup language which uses common, easily readable symbols such as #, * and _ to define document formatting.

Markdown was originally developed for blogs, as a quick and easy way of writing texts that would be eventually be converted into HTML.

It easiness-to-read and write, but also its limited scope (e.g. titles can only look like titles if they are marked as headings), make it ideal as a intermediary format between the source manuscript and the EPUB publication.

HTML snippets can also be added to a markdown file. when converted to HTML/EPUB it will retain the added HTML.

Markdown Software

Markdown Syntax

https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet

http://daringfireball.net/projects/markdown/syntax


Markdown Flavors

There is no Markdown standard, but instead a large numbers of flavors of dialects.

Standard Markdown - a kind of plain flavor Markdown - doesn't include features necessary such as footnotes, tables, mathematical formulas, cross-references, bibliographies. Therefore flavors such as MultiMarkdown and Pandoc Markdown were developed to fulfill these needs.

Note: When you convert something in pandoc to/from markdown, you'll be using Pandoc's Markdown.


Challenge

Create a EPUB from the material you brought.

You should to it two/three steps:

  1. prepare you material, structuring it semantically
  2. using Pandoc convert your original material to a Markdown file (edit it if you want/need to)
  3. using Pandoc generate an EPUB out of your Markdown file

Pandoc conversion to EPUB3

pandoc \
input.md \
--from markdown \
--to epub3 \
--self-contained \
--epub-chapter-level=2 \
--epub-stylesheet=styles.css \
--epub-cover-image=cover.png \
--epub-metadata=metadata.xml \
--toc-depth=2 \
-o book.epub \