There’s no better feeling than putting your heart onto digital paper, uploading a Word file and have your book sold in the largest book store in the world. I’m in love with self-publishing.
However, your Kindle looks like a dog’s dinner because of the crap that Microsoft Word inadvertently puts into it: colors in headings, inexplicable font changes, the bullet points coming out smaller than the body text, too big, too small, you name it.
I’ve spend months pulling my hair out wrestling with this (I wasn’t bald before). Then I discovered how to get rid of Microsoft Word’s rubbish from an ePub with Sigil (free software) and Regular Expressions (a sequence of characters that forms a search pattern).
By the way, Kindle’s Mobi format is almost the same as an ePub, so creating a perfect ePub with Sigil results in a perfect Mobi for Amazon’s Kindle.
E-books should have totally clean HTML. What do I mean by that? I mean that there should be no font, color, text size or line height specified – hardly any styling, in fact. This is because e-books are completely accessible – the reader chooses the font, the color and the size of the text they want to read. And that’s the way they should be.
If you’re a complete HTML novice then don’t worry because it’s real simple. This is the sort of HTML you should be seeing in the body of your e-book:
<h1>This is a chapter heading</h1>
<p>This is a normal paragraph</p>
<h2>This is a subheading</h2>
<p>This is another normal paragraph in which you can have <b>bolds</b> and <i>italics</i> and <a href="https://robcubbon.com">links</a>.</p>
<p>And, here's another paragraph coming up with an image in it.</p>
<p><img alt=“alternative text here" src="../Images/image001.jpg" /></p>
You see? Very simple. This is how the above should be created in Microsoft Word.
Notice the <h1>
is created by the Heading 1 style, the <h2>
is created by the Heading 2 style and the <p>
is created by the Normal style. In order to get the Microsoft Word styles, click the Home tab to display the “ribbon” with all the basic formatting on it, you’ll be able to assign styles there by clicking buttons with text selected.
How to extract Word’s rubbish from an ePub and a Kindle Mobi with Sigil
If you open up Sigil after installing it, it will immediately open up a “bare” ePub file (see below).
The above image shows a new ePub in Sigil with a Word document pasted in “Book view” and then switched to “Code view”. As you can see in the left hand pane, Sigil has created all the necessary files and folders for the ePub to successfully validate (and create an awesome Kindle Mobi). You really don’t have to know about how all this works as Sigil has it covered.
You do, however, have to sort out the text and images in your e-book.
As you can see in the above image, Sigil sorts out the beginning of the HTML:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
After this, you should go straight into the <h1>
of your first heading. There will be a bunch of crap in green that you’ll need to manually delete.
(You could clean up the Word HTML before you enter it to Sigil with Word2cleanHTML.com)
Using Regular Expressions in Sigil to clean up a Word doc
You’ll then need to clean up the HTML further. All the styles, spans, classes, divs, etc., have to go. Even in a 10,000 word book this seems like a huge task but you can do it in a few minutes by using Regular Expressions. I owe a huge debt to Peter Clough of Get Publishing and the London Kindlers Meetup Group for putting me onto this.
The above video shows you how I do the HTML cleaning and ePub creation in Sigil.
Here are the regular expressions I find and replace (change the Find/Replace mode in Sigil to “Regex”):
Find | Replace | What it does |
---|---|---|
<p[^>]*> |
<p> |
gets rid of unwanted styling after opening p tag |
<span[^>]*> |
[nothing] | gets rid of all opening span tags |
And here are the normal find and replaces (you can keep the Find/Replace mode in Sigil as “Regex”, you don’t have to put it back to “Normal”):
Find | Replace | What it does |
---|---|---|
</span> |
[nothing] | gets rid of all closing span tags |
  |
[space] | gets rid of all non-breaking spaces |
The above will get rid of 90% of the rubbish Word puts in your lovely ePub. You can weed out more unwanted HTML by scrolling through and using Sigil’s excellent validation tool.
Images
If you have images in your book import them in Sigil (Insert > File, Cmd/Ctrl-I) and write the alt text. Images in Kindles should can be GIF, JPG or PNG, about 500 pixels wide and as small as possible.
Import an image in a new paragraph (so after you’ve hit return). Don’t bother trying to flow text around an image. If you want image to be thumbnail size (a headshot, for example) then add a 300 pixel wide white background to the right of your 200 pixel wide image, for example. Credit for this idea goes to ‘Chris_Mac_Artworker’.
Later on we will add a 100% width rule in the CSS to make sure your images fill the width of whatever device your book is being read on.
Table of contents and HTML Table of Contents
Kindle Mobi files should, ideally, have two Tables of Contents (TOCs).
One TOC is visible when you tap the Menu button on your Kindle reading device. To create this TOC go Tools > Table of Contents > Generate Table of Contents… assuming that all your chapter titles are h1
‘s, you can choose “Up to level 1” in the drop-down.
Additionally, many non-fiction authors like to add a TOC as an extra page at the beginning of the book. To do this go TOC go Tools > Table of Contents > Create HTML Table of Contents.
Thank you to my Facebook friend, Hynek Palatin, for help with the above. I met Hynek on Pat’s First Kindle Book which is the best online self-publishing group in the whole world – no arguments!
CSS styling
There are two CSS rules I like to add to the CSS file in the ePub. The CSS file sets the style or formatting for the document. If you created an HTML Table of Contents, you should already have one. Otherwise you can right-click on the Styles folder in the Book Browser pane and Add Blank Stylesheet. In the CSS file I put the following:
img {width: 100%;}
h1 {page-break-before: always;}
The first rule ensures images stretch or squeeze to 100% of the width of the device the e-book is being read on. The second rule ensures that every chapter starts on a new page (assuming that all your chapter titles are h1
‘s).
You have to reference this in the <head>
of the HTML with this:
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/>
… making sure that you have the correct name of the CSS file.
Adding Metadata
Add the title, author and specify the language of the e-book by going Tools > Metadata editor … (this is actually unnecessary for Kindle Mobi as Amazon will add this for you but it makes it validate so I always do it.)
Validating your ePub
As I mentioned before, there is a Validation button in Sigil (a large green checkmark at the top right) that lists out the issues. You can arrive at the problem line of the relevant file by double-clicking on the line number.
You can then save and view your ePub in Calibre and Kindle Previewer (both free).
Uploading your Kindle to Amazon
The file format for Amazon’s Kindle is Mobi. A Mobi is created when you upload a Word doc at kdp.amazon.com. Personally, I upload the ePub here because I think their conversion is fine. However, some people advocate using Calibre to convert from ePub to Mobi before uploading.
In the above video I upload a Word document as I go through the publishing process at kdp.amazon.com. But, instead of uploading a Word file, you can upload your perfect, clean, beautiful ePub.
Adding cover image
Technically, all ePubs should have a cover image embedded in the file, however the cover is uploaded separately during the publishing process above. In Sigil, you add your cover to your ePub in Tools > Add Cover…
You can do it
The most important thing is to not let these formatting and publishing quirks deflect you from writing a great book.
If you are a new author just starting out, my advice to you is to just write that damn book and forget about all this geekery. In Microsoft Word, use the Heading 1, Heading 2 and Normal styles when necessary and write! Uploading a Word file at kdp.amazon.com is not the worse thing you can do. It won’t look that bad.
If, like me, you’ve published a few Kindles and you’re exasperated that your text is not looking in a certain way, then I hope this helps. Let me know in the comments.
Cathleen Keene says
Great tutorial! Thank you!
Rob Cubbon says
Thank you, Cathleen 🙂
Lis Sowerbutts says
I found Sigil to be unstable on my old laptop. I do something very similar to what you describe but I just write the HTML in an editior (I like notpad++) – I start from word2cleanhtml code and then add in the CSS and create the ncx and opf files Sigil seems ok on my nw ultrabook – but Istill don’t quite trust it
Rob Cubbon says
I’m using it on Mac, Lis, and it’s super stable but I have heard it’s buggy on PCs. 🙁
You can use these regular expressions on Notepad++ but the advantage of Sigil is that it creates the ncx and opf files and creates the two types of TOCs automatically. 🙂
Either way, it’s still a confusing process to explain to someone without any HTML experience.
Thanks for you feedback on Sigil, Lis, let’s keep our eyes on this one.
Bobby Burns says
Hello Rob,
Great tutorial and I am keen to try it out. However, I just read your response to Lis and I am a PC user. Will I have to upload my finished work to Kindle before knowing if it’s dodgy?
I’m not going to convert to a Mac (ha!) and I am loathe to learn HTML unless I absolutely have to, so this approach of using Sigl seemed like a gift!
Thanks,
~Bobby
Rob Cubbon says
Hi Bobby, I don’t know what the problem with PCs and Sigil is but I think it crashes often. I don’t think it creates dodgy ePubs. But I don’t know. Best of luck and ask questions here if you have trouble.