Today I Learned : Making ebooks

14 janv. 2017 - 3min

So while I was uploading the ebooks on my ebooks reader, a question popped into my head : can we create ebooks easily ? To be clear, I’m talking about making an epub file, not hacking any DRM or whatever.

So, I did some research and found out that the structure of an epub is pretty simple ! It is just a bunch of HTMLs zipped together. I invite you to unzip any DRM-free epub to see by yourself the follwing structure:

epub
├── META-INF
│    └── container.xml
├── OEBPS
│    ├── cover.jpg
│    ├── 1-cover.xhtml
│    ├── 2-copyright.xhtml
│    ├── ...
│    ├── style.css
│    ├── toc.xhtml or toc.ncx
│    └── content.opf
└── mimetype

Let’s try to understand this architecture.

mimetype

Pretty easy to guess, it defines the MIME type of a file. In this case it wil be: application/epub+zip

container.xml

<?xml version='1.0' encoding='utf-8'?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
<rootfiles>
	<rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>

container.xml will indicate where the path for content.opf and the media type (OEBPS, Open eBook Publication Structure)

content.opf

content.opf is the manifest, the content list, written in XML. It contains basically all the metadata of the ebook.

<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="uid" version="2.0">
		<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" 
		xmlns:opf="http://www.idpf.org/2007/opf">
			<dc:title id="en_title" 
			xml:lang="en-gb">ENTER BOOK TITLE HERE</dc:title>
			<dc:creator id="creator_aut" opf:file-as="SURNAME, FIRST NAME"
			opf:role="aut">AUTHOR NAME</dc:creator>
			<dc:identifier xmlns:dc="http://purl.org/dc/elements/1.1/" 
			id="uid" opf:scheme="ISBN">urn:isbn:ISBN NUMBER (no hyphens)
			</dc:identifier>
			<dc:description id="en_description" 
			xml:lang="en-us">eBook, NUMBER pages</dc:description>
			<dc:publisher id="en_publisher" 
			xml:lang="en-us">YOUR/COMPANY</dc:publisher>
			<dc:date id="date_1" opf:event="creation">YEAR</dc:date>
			<dc:rights id="en_rights" 
			xml:lang="en-us">Copyright YEAR AUTHOR/COMPANY NAME</dc:rights>
			<dc:language id="en_language">en-gb</dc:language>
			<dc:type id="en_type_1">Non-fiction</dc:type>
			<meta name="cover" content="coverimage" />
		</metadata>
	<manifest>
		<item id="toc" media-type="application/x-dtbncx+xml" 
		href="toc.ncx"></item>
		<item id="item1" media-type="application/xhtml+xml" 
		href="1-cover.xhtml"></item>
		<item id="item2" media-type="application/xhtml+xml" 
		href="2-copyright.xhtml"></item>
		<!-- other items here -->
		<item id="coverimage" href="cover.jpg" media-type="image/jpeg"></item>
		<item id="image1" href="img-1-1.jpg" media-type="image/jpeg"></item>
		<item id="css" href="styles.css" media-type="text/css"></item>
	</manifest>
	<spine toc="toc">
		<itemref idref="item1"/>
		<itemref idref="item2"/>
		<!-- other itemrefs here -->
	</spine>
	<guide>
		<reference href="1-cover.xhtml" type="cover" title="Cover"/>
	</guide>
</package>

toc.ncx

TOC stands for Table Of Content and NCX stands for Navigation Control file for XML. Nevertheless, this file format is not a part of the EPUB specification.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">

<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">

	<head>
		<meta name="dtb:uid" content="urn:isbn:ISBN NUMBER (no hyphens)"/>
		<meta name="dtb:depth" content="2"/>
		<meta name="dtb:totalPageCount" content="0"/>
		<meta name="dtb:maxPageNumber" content="0"/>
	</head>

	<docTitle>
		<text>ENTER BOOK TITLE HERE</text>
	</docTitle>

	<navMap>
		<navPoint id="navPoint-1" playOrder="1">
			<navLabel><text>Cover</text></navLabel>
			<content src="1-cover.xhtml"/>
		</navPoint>
		<navPoint id="navPoint-2" playOrder="2">
			<navLabel><text>Copyright</text></navLabel>
			<content src="2-copyright.xhtml"/>
		</navPoint>
		<navPoint id="navPoint-3" playOrder="3">
			<navLabel><text>Contents</text></navLabel>
			<content src="3-TOC.xhtml"/>
		</navPoint>
		<!-- etc. -->
	</navMap>

</ncx>

toc.xhtml

This table of content is defined for EPUB3. By the way, nav section must be identical to the one mentionned in content.opf.

<?xml version='1.0' encoding='UTF-8'?>

<html xmlns="http://www.w3.org/1999/xhtml" 
xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en">
<head>
	<title>TITLE</title>
	<meta name="dtb:uid" content="..."/>
	<meta name="dtb:depth" content="1"/>
	<meta name="dtb:generator" content="..."/>
	<meta name="dtb:totalPageCount" content="0"/>
	<meta name="dtb:maxPageNumber" content="0"/>
</head>
<body epub:type="frontmatter">
	<nav epub:type="toc">
	<ol>
		<li>
		<a href="1-cover" id="np-1">Cover</a>
		</li>
		<!-- etc. -->
	</ol>
	</nav>
</body>
</html>

Sometimes, there are both .ncx and .xhtml because .ncx is used for EPUB2 and .xhtml for EPUB3. Then, they put both files to be sure the epub will be read everywhere…

ZIP to EPUB

Even it is easier to create an epub with a software like Calibre, you can make an epub with the CLI. Thus, you cannot really “just zip it and rename to .epub” because we need uncompressed mimetype.

# Zip Options:
# -x   exclude file
# -r   recurse into directories
# -D   do not add directory entries
# -9   compress better
# -X   eXclude eXtra file attributes
# -0   store only 

# create epub only with uncompressed mimetype:
zip -X0 "quoted form of ePubFilePath" mimetype
# add the other files to the epub:
zip -rDX9 "full path to new epub file" * -x "*.DS_Store" -x mimetype

There are probably plenty of other solutions, pretty sure you can find one on StackOverflow.

So… epub are basically a zipped website ?

Well, as epub (following EPUB3 specs) can include HTML5, CSS and some other assets, EPUB readers have to use the same technology as web browsers, it requires to support web technologies for sure.