Creating your Ebook: Step 1 – Word to HTML

16 Saturday Feb 2013

Written by J. Abram Barneck in Ebook

≈ 5 Comments

Tags

No tags :(

Share it

This article assumes you have used Microsoft Word to create your manuscript.

Step 1 – Prepare your word document

Things to note

Find all instances of space quote (a single quote before a word) and write them down so you can find them again. I found these in my novel.
Days of ’47
’cause

Find and Replace

Find all double spaces (Ignore the quotes, I used them to show the spaces.)

Find: "  "
Replace: " "

Find single quote and replace with single quote:

Find: '
Replace: '

Todo: Now go back and find all the single quotes before a word and change them back. If you didn’t write them down, search for space quote and fix them to be the correct smart quote: ’

Find double quote and replace with double quote:

Find: "
Replace: "

Find space plus paragraph special character and replace with just he special character. Run it multiple times until it replaces nothing. Again, ignore the quotes, I use them to show you the space before the paragraph special character.

Find: " ^p"
Replace: ^p

^p
^p

Find paragraph special character plus space and replace with just the special character. Run it multiple times until it replaces nothing. Again, ignore the quotes, I use them to show you the space after the paragraph special character.

Find: "^p "
Replace: ^p

Find double spacing.

Find: ^p^p
Replace: ^p

Find all ellipsis and replace them. I prefer the ellipsis in this format: space dot space dot space dot. (Ignore quotes)

Find: " . . ."
Replace: "^s.^s.^s."

Find all ellipsis followed by punctuation and replace them. Search using all punctuation marks. (Ignore quotes)

Find: " . . . ?"
Replace: "^s.^s.^s.^s?"

Step 2 – Convert your word doc to html (hard way)

Most sites recommend that with your word document, you follow these steps:

Open your word document.
Click File | Save As.
Choose HTML Filtered.
Enter a file name.
Click Save.
Clean the html the rest of the way yourself.

The problem with this method is the last step because you end up with html that is still highly styled. For example, we want the paragraphs to be wrapped in empty p tags, but instead, they are not empty.

Empty

<p>Your paragraph here...</p>

Not Empty

<p class='MsoNoSpacing' style='text-indent:.5in'>Your paragraph here...</p>

So this is where all the documentation tells you to “clean up the tag” and do it yourself. This is tedious (even if you are a techie who knows regex like me) so why do it.

You can avoid much of the tediousness using Find and Replace. Find all instances of the not empty p tag and replace it with the empty p tag. However, there are many tags and with a huge book, how will you ever know if you replaced them all. You either have to go through your entire book in html or you have to learn to find with Regular Expressions or regex.

Step 3 – Clean the HTML Using Regex

You can use Regex in find and replace in Sigil but also in Notepad++.

Regex examples:

Make all p tags empty

Find: <p[^>]*>
Replace: <p>

Make all h1 tags empty

Find: <h1[^>]*>
Replace: <h1>

Make all span tags empty

Find: <span[^>]*>
Replace: <span>

Note: Notice the pattern? You can do this for any tag you want. In fact, you can do this for all tags at once.

Find: (<[^> ]+)([^>]*)(>)
Replace: 13

Or for all tags except tags you want to leave

Find: (<(?!a|img|?|!|html|link|/)[^> ]+)( [^>]*)(>)
Replace: 13

Remove all instances of an html tag but leave the text between

Here is an example using the span tag. You may have to run it multiple times to remove nested tags (tags inside tags).

Find: <span>(([^<]*(?!=</span>))*)</span>
Replace: 1

Or you can just use two find and replace steps:

Find: <span[^>]*>
Replace:
Find: </span>
Replace:

Step 3 – (Alternate)Convert your word doc to html (untrusted way)

You could find a site like HTML Tag and Attribute Remover

Keep these tags:

<p><i><a><img>

Keep these attributes:

href src alt

Creating your Ebook: Step 2 – Novel to EPUB with Sigil

5 Comments

Sheogorath said:

December 4, 2013 at 2:44 pm

Or just do what I do: write the chapters with unused special characters in certain places on my Android, then just search/replace the special characters with the correct HTML coding and CSS styling before saving the document as plain text. I then run it through my HTML reader before changing the encoding and moving the resulting HTML documents onto my netbook for importation purposes. If all else fails, there’s always creating a plain text file in LibreOffice and copy/pasting it into Sigil. Way easier!

Reply
- J. Abram barneck said:
  
  December 4, 2013 at 3:02 pm
  
  We all find hacks that only work for us, but probably aren’t good to suggest to others. I think it is easier to finish a novel and then export it than to try to use special characters while writing. How is an editor going respond to those characters?
  
  Also, most people are not going to think about converting from Word to HTML until after they finished the writing.
  
  Also, from your word document, you want to create a high quality PDF for print too. How would those special characters handle that?
  
  Reply
Miko said:

August 15, 2017 at 4:29 am

Hi J. Abrams
Right now i create my own ebook with your instructions .
May i used your Fire-Light-Sample file epub for my own ebook ?

Regards Miko

Reply
- J. Abram Barneck said:
  
  August 15, 2017 at 1:44 pm
  
  Yes. I have a second template, too. http://jabrambarneck.com/2015/02/11/epub-templates/
  
  Reply
  - Miko said:
    
    August 16, 2017 at 12:54 am
    
    Thx J. Abrams
    
    Reply

J. Abram Barneck ~ Suspense, Fantasy, and Sci-fi Author