Creating your Ebook: Step 1 – Word to HTML
16 Saturday Feb 2013
No tags :(
This article assumes you have used Microsoft Word to create your manuscript.
Step 1 – Prepare you word document
Things to note
Find all instances of space quote (a single quote before a word) and write them down so you can find them again. I found these in my novel.
Days of ’47
Find and Replace
Find all double spaces (Ignore the quotes, I used them to show the spaces.)
Find: " " Replace: " "
Find single quote and replace with single quote:
Find: ' Replace: '
Todo: Now go back and find all the single quotes before a word and change them back. If you didn’t write them down, search for space quote and fix them to be the correct smart quote: ’
Find double quote and replace with double quote:
Find: " Replace: "
Find space plus paragraph special character and replace with just he special character. Run it multiple times until it replaces nothing. Again, ignore the quotes, I use them to show you the space before the paragraph special character.
Find: " ^p" Replace: ^p
Find paragraph special character plus space and replace with just the special character. Run it multiple times until it replaces nothing. Again, ignore the quotes, I use them to show you the space after the paragraph special character.
Find: "^p " Replace: ^p
Find double spacing.
Find: ^p^p Replace: ^p
Find all ellipsis and replace them. I prefer the ellipsis in this format: space dot space dot space dot. (Ignore quotes)
Find: " . . ." Replace: "^s.^s.^s."
Find all ellipsis followed by punctuation and replace them. Search using all punctuation marks. (Ignore quotes)
Find: " . . . ?" Replace: "^s.^s.^s.^s?"
Step 2 – Convert your word doc to html (hard way)
Most sites recommend that with your word document, you follow these steps:
- Open your word document.
- Click File | Save As.
- Choose HTML Filtered.
- Enter a file name.
- Click Save.
- Clean the html the rest of the way yourself.
The problem with this method is the last step because you end up with html that is still highly styled. For example, we want the paragraphs to be wrapped in empty p tags, but instead, they are not empty.
<p>Your paragraph here...</p>
<p class='MsoNoSpacing' style='text-indent:.5in'>Your paragraph here...</p>
So this is where all the documentation tells you to “clean up the tag” and do it yourself. This is tedious (even if you are a techie who knows regex like me) so why do it.
You can avoid much of the tediousness using Find and Replace. Find all instances of the not empty p tag and replace it with the empty p tag. However, there are many tags and with a huge book, how will you ever know if you replaced them all. You either have to go through your entire book in html or you have to learn to find with Regular Expressions or regex.
Step 3 – Clean the HTML Using Regex
You can use Regex in find and replace in Sigil but also in Notepad++.
Make all p tags empty
Find: <p[^>]*> Replace: <p>
Make all h1 tags empty
Find: <h1[^>]*> Replace: <h1>
Make all span tags empty
Find: <span[^>]*> Replace: <span>
Note: Notice the pattern? You can do this for any tag you want. In fact, you can do this for all tags at once.
Find: (<[^> ]+)([^>]*)(>) Replace: 13
Or for all tags except tags you want to leave
Find: (<(?!a|img|?|!|html|link|/)[^> ]+)( [^>]*)(>) Replace: 13
Remove all instances of an html tag but leave the text between
Here is an example using the span tag. You may have to run it multiple times to remove nested tags (tags inside tags).
Find: <span>(([^<]*(?!=</span>))*)</span> Replace: 1
Or you can just use two find and replace steps:
Find: <span[^>]*> Replace: Find: </span> Replace:
Step 3 – (Alternate)Convert your word doc to html (untrusted way)
You could find a site like HTML Tag and Attribute Remover
Keep these tags:
Keep these attributes:
href src alt