Understanding e-Book File Formats

November 2, 2011 - 3:35pm — darnzen

One of the first things that confronts a writer wishing to e-publish their work is the confusing array of file formats and meta-data and the seeming lack of any standardization. In this article I will explore what an e-book really is, information regarding the different formats, tools to convert between formats, and how minor changes made as you write will make things much easier when you are ready to publish.

Why can't I just save my doc file as an e-book?

When I was young, most writers used a typewriter to create a manuscript. There is some nostalgia surrounding typewriter, the snapping of the keys, the sound of the bell for each finished line; this was the sound of progress being made. When people were using typewriters, content was separated from format, layout, and design. You wrote a manuscript, double spaced, with whatever typeface was in your typewriter, typically in 10 or 12 point font. If you wanted to indicate special formatting, you would add hand-drawn markup to the document, or use some simple character based markup such as asterisks to indicate *bold* or underscore for _underline_and a slash for /italics/. At the publishing house, they would also add markup to the hard-copy, indicating margins, fonts, page-breaks, vertical spacing, table layout, images etc.. Authors worried about content, and publishers, for the most part, handled the presentation.

Nowadays, almost everyone uses a word processor of one type or another, with Microsoft's Word being used by the majority. Publishers still want manuscripts in the same format (double spaced, 1 inch margins, etc.), but with the advent of word-processing, the markup can be embedded in the file. So when you italicize a phrase, as I did in the previous sentence, there is a code embedded in the text stream marking the start and end of the italic text. When viewed or printed, the phrase is shown in italics. This is known as presentational markup, and is what is used most often on word-processors. With presentational markup, you can change type family, size, weight, style, decorations, etc. You can go nuts and dO crazy things. This allows writers to make bold, large titles, and chapter headings, or put the telepathic robot conversation in some odd font / style / weight to differentiate it from normal dialog.

The problem with presentational markup is that it is often used where descriptive or semantic markup should be used. Semantic markup differs from presentational markup in that it labels the individual parts of the document, such as the title, a paragraph, an image caption, or a heading, without defining presentation. For instance, the title is distinguished from the rest of the text by surrounding it with the appropriate markup codes, or tags. In html the markup tags are human readable and indicated by surrounding the tag name with angle brackets <tag> to open an element and including a trailing slash to close the element <tag /> as follows.

<title>This is a Title</title>

With semantic markup, the presentation is defined elsewhere, either in a separate file (known as a style sheet), or at the beginning of the document. In this way, equivalent parts of the document will have the same styling throughout. It is therefore easy to define and change the styling of every piece of the document that shares the markup, such as paragraphs, or chapter headers. It also becomes easy to generate a table of contents for a book by creating links to each chapter heading. This is important because the concept of a page is generally no longer meaningful due to variations in reading device sizes and capabilities.

This discussion of markup is necessary because all e-book formats require the document to have some sort of semantic markup. If you are self-publishing or want to understand why you can't simply publish your MS-Word doc file as an e-book, you need to understand a little of what's going on inside the e-book files themselves. The “e-book” is a container that supplies the document text, styling information, cover art, and meta-data to the reading application or hardware. A *.doc file is a document with hardly any semantic markup, containing mostly proprietary presentational markup.

File formats: the big 3

There are three major e-book formats that are supported on the majority of reading systems: ePUB, MOBI, and PDF. EPUB is an open format defined by the International Digital Publishing Forum (<idpf>), it is the primary format used on the iPad, Sony Reader, and the Barnes & Nobel NOOK, and can be read by any PC or internet based e-book reading software (eg. Calibre, Stanza, Bookworm, Ibis). Basically, all e-readers except the Kindle can read ePUB files without fuss.

MOBI, the Mobipocket reader file format now owned by Amazon, can have the *.azw, *.prc, or *.mobi extention. AZW is Amazon's version of the mobi format that can be read on the Kindle. It is essentially the mobi file structure with its own DRM scheme, and no javascript support. The Kindle can also read unprotected *.prc or *.mobi files directly. MOBI is technically an off-shoot of the ePUB format and shares many of the same conventions.

PDF isn't really an e-book format at all, it is a document format based on PostScript (PS). PDF is useful when you need to keep the “page” concept, and positioning on the page is important. It is also supports scalable vector graphics, so it is good for rendering technical drawings and diagrams. This really isn't a good format for e-readers, most will read them, but it often requires horizontal panning which is no fun. It is useful if you need the e-reader version to match the printed version, or you need scalable graphics and mathematical formatting.

There are two other formats worth mentioning at this point, plain text (*.txt) and HTML (*.html, *.htm). Plain text has the advantage that it is readable on all e-readers. There are several formatting issues that need attention with respect to line wrapping, and there is no support for images, links or TOC, but for a simple document, it works well. HTML is important because not only is it the basis for web display, it is the underlying format for both ePUB and MOBI! Plain old HTML files can be viewed by the majority of e-readers without modification.

There are many other e-book formats, but with ePUB and MOBI, you have a book that can be read without conversion on any device currently available. There are methods for reading ePUB on the Kindle utilizing the built in browser, but they rely on an active internet connection. It is simple enough to convert an ePUB to the MOBI format, that really there is no reason to not publish in both formats. Publishing only in PDF should be avoided where possible since it is a fixed width format, and is poorly supported on many devices.

Format

File extensions

Devices that CAN read

Devices that CAN NOT read

Text

*.txt

All

None

HTML, XHTML

*.htm, *.html, *.xhtml

Kindle, iOS, Android, Nook

Sony, iREX, Kobo

ePUB

*.epub

Android, iOS, Nook, iRex, Sony, Kobo

Kindle

Mobi-pocket

*.mobi, *.prc, *.azw

Kindle, Android, iOS, iRex

Nook, Kobo, Sony

PDF

*.pdf

All

None

All except the Kindle 1.0, and WISEreader. Reading experience varies greatly, typically much worse than other eBook formats.

E-book files: What's inside?

The two e-book formats discussed above, mobi and ePUB are really collections of files that are rolled into a single file for distribution. EPUB files are actually standard *.zip archives and can be opened by changing the file name extension to “zip” or by using 7-zip software which can open ePUB files without changing the extension. Mobi files use a proprietary compression scheme, but are essentially the same concept, so I'll limit the remaining discussion to ePUB.

As an example, let's look inside How to Publish on WritelyDone. As you can see in the figure, there are a number of html files, images, a CSS style sheet, and three other files: content.opf, toc.ncx, and mimetype. The images, css, and html files are the document. The other files describe the document so readers can reconstruct it. The content.opf is an XML file that describes the document and includes meta data (author, language, title, publisher, etc.), along with a list of all the files that comprise the e-book. XML is a plain text document, you can open it in notepad or your browser if you want. The toc.ncx file is also an XML file that describes the table of contents and links to each file in order. Brave readers may want to look at those files and get a feel for what's in there. The EPUB format documentation can be found on the <idpf> web site. EPUB version 2.0 is currently supported by the available readers, but the EPUB 3.0 spec is available if you want to see what features will be available in the next generation of e-books.

spacer
 

Notice how the document is a bunch of html files instead of one. Each html file is a section or a chapter and each section terminates on a page-break. You normally don't want text from chapter 2 to reflow into the bottom of the last page of chapter 1. You want to have it start on a new page. Putting each chapter into a separate file forces the e-reader to do this.

Writing for the Web

Since the two big formats are containers for HTML documents, it makes sense to keep that in mind while you are writing. Converting your document to HTML might take a lot of effort if you aren't planning for the conversion as you write. For instance, if you write an entire novel in one file, separating chapters by inserting page-breaks, and typing chapter headings by changing the font size 24pt and making it bold, you will likely spend a lot of time trying to get your HTML to look right in the ebook.

Here's some tips to make it easier:

  1. Create a separate file for each chapter as you write, or at least make it easy to find chapter boundaries (ie. Include “Chapter” in the heading, or some other indication, don't rely on style alone). This also makes it easy to “give away” a sample chapter by posting it on the internet.

  2. Use styles instead of formats. In other words, use the built in title, heading 1, text body, caption, etc. styles. If you don't know how already, learn how to modify the styles to adjust the display, not format each piece of text by changing it's typeface, font size, color, etc.

  3. Try to save and edit the document as HTML. You can switch from document view to source view using Word or Open Office. This allows you to edit in the WYSIWYG editor, or edit the HTML directly.

  4. If you need to be able to have the manuscript as a single DOC file (to send to a publisher for instance), you can use “master documents” to combine the front matter, chapter files, index and whatever else you have into one big file at the end. This is a feature in most major word-processors and makes it easier to work on large documents in general.

Putting it all together

Once you know what's in there, making the e-book is straightforward. First convert your manuscript into a set of HTML files, one for each chapter. Then add the content.opf , toc.ncx and mimetype and then zip all the files up into a single *.zip file. Change the “zip” extension to “epub” and you're done. The challenge is in creating the ocf and ncx files. Luckily there are tools to help do this.

The most popular free tool for creating and converting ebook files is Calibre. Calibre can convert your manuscript into an epub file with little effort. The ocf and ncx files get created automatically. However, there are a lot of options, and any wrong setting could lead to an incomplete book with missing meta-data, cover image problems, or a broken TOC. Some of these problems may only become apparent when you try to convert the file to another format, such as MOBI.

Sigil is currently the only tool that allows you to create and edit ePUB files directly, without writing in one format and then converting. It is free, open source software and can work well once you are more comfortable with the ePUB format.

If you use OpenOffice or LibreOffice, you can use an Open Office extension that lets you save your manuscript directly as an ePUB. I chose to link directly to the file since the site is in Italian. If you want to see the original site and give kudos to the author, you can find it at lukesblog.it/ebooks/ebook-tools/writer2epub/ where there is information on installation and use. Google translate does a good job with the translation.

Adobe InDesign is another option, but only if you have $700 to spare. Adobe now offers a monthly subscription license for $35/month if that better fits in your budget.

Once you have a well-formatted, publication quality ePUB, you can convert it to MOBI using Calibre, or use an online service such as 2epub.com or Convert.Files. Both services use the Calibre engine to do the conversion, but 2epub prompts for some metadata and overall is a better experience in my opinion. In order to create a Kindle e-book for distribution on Amazon, you need to convert the epub to the AZW format using their KindleGen application.

If you plan to have a publication quality document, you must make sure you do a thorough quality check by using an ePUB validator in addition to viewing the document in a reading application or better yet, on the target device. If you are not very tech savvy, you may want to delegate this e-book creation and conversion to a professional. It is my hope that you can find the help you need here on WritelyDone.

EDIT: Since this article was originally published, WritelyDone has added an automated e-book generator. Simply "copy/paste" your document into a book outline and generate both and epub and Kindle compatable mobi files automatically. This is free and there are no strings or licensing restrictions. You can find more information in the help documentation.

  • darnzen's blog
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer

Comments

Excellent

March 8, 2012 - 9:09pm — Shaun Wallace (not verified)
  • spacer
  • spacer
  • spacer
Dan, This an excellent read on the guts of how epubs work, obviously much more to it and a highly evolving industry. As you commented and many authors realize, the process isn't necassirily simple and at the end of the day a coder like myself can resolve a validation or other error in a few minutes opposed to a writer spending a day or maybe two solving a problem. A couple notes, you can publish straight to Amazon via mobi, azw is not necassirily required. Indesign also works in the way described as you have for Sigik when complemented bt Dreamweaver, most of the projects I work on are completed in CS 5.5 and I find Sigil to be a bit insufficient; however, it's free. Also, your format and reader table compatibility is a bit outdated as you consider app translators. For example, you can read anything within modern iOS, your article even indicated an ePub could be viewed on a Kindle, anyway, just a couple thoughts but by far the best intro breakdown on how it all works, look forward to a similar read on the new iBA format. Shaun

Thanks!

March 8, 2012 - 9:53pm — darnzen
  • spacer
  • spacer
  • spacer
spacer

Thanks for the feedback Shaun. You are correct on all your points. Ebook standards are evolving continuously and it makes sense to  keep up to date.

Amazon accepts EPUB but they still recommend conversion to MOBI/AZW using their kindlegen tool. Supposedly all new (post gen 1) Kindles can read EPUB without conversion, but I'm not sure how well. I threw a number of random ePUB's at my Kindle when they first made the claim that it was compatible and had mixed results (though much better than PDF!). I haven't tried this with the newest firmware or the DX or Fire, so it is likely the support is much improved. 

ePUB 3.0 provides many possibilities that are interesting (CSS3, embeded media, etc.) but I imagine it will be a long time before mainstream authors even consider using the full potential of the format; likely the ones pushing the technology envelope will be magazines and such.

I read an interesting article on iBA, but have yet to explore the format in detail myself.  The fact that Apple chose to make something different from ePUB3 for no technical benefit is worrisome. Basically they are further segmenting the market when openness and portability is in the consumers best interest. 

It is especially frustrating since, as you mentioned, Amazon is moving towards fully supporting ePUB as a format. Convergence on a standard will make everything simpler. 

Post new comment

Comments from anonymous users will need to be approved before they are displayed. Create an account and log in to eliminate this delay.

User Registration

Which describes you best?

Browse

  • All
  • Genre
    • Fiction
      • Adventure
      • Crime
      • Erotic
      • Fan-fic
      • Fantasy
      • Folk lore
      • Historical
      • Horror
      • Literary
      • Mystery
      • Romance
      • Science Fiction
      • Western
    • Non-Fiction
      • Autobiography
      • Biography
      • Dictionary
      • Economics
      • Essay
      • Geography
      • History
      • Natural History
      • Psychology
      • Science
      • Textbook
      • Travelogue
      • User Manual
    • Opinion
      • Book Review
      • Paranormal
      • Philosophy
      • Pseudo-science
      • Religious
      • Self-help
  • Form
    • Flash Fiction
    • Article
    • Short Story
    • Compilation
    • Novelette
    • Novella
    • Novel
    • Epic Novel
  • Language
    • English

Browse by Author

Browse by Tag

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.