Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

 Kindle Publishing Tools

News Kindle  publishing bookshelf Recommended Links Self Publishing Grammar checkers Frontpage Kindle Publishing Tools
KindleGen Kindle Previewer calibre Guidelines  Random Findings Humor Etc

To see a complete explanation of how to make a book available on Amazon Kindle, please see the Amazon Kindle Publishing Guidelines

You can download sample books here.

Amazon provides two main tools for creating ebook in Kindle 8 format:

Your tool chain in not complete with Amazon tools. In addition you need:


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Jul 21, 2013] Create rich-layout publications in EPUB 3 with HTML5, CSS3, and MathML by Liza Daly

Jul 05, 2012 | IBM developerWorks

Validating EPUB 3 documents

Because EPUB 3 relies on XML serializations for most content types, it is amenable to automatic validation. The EpubCheck tool is the canonical method for testing the validity and conformance of EPUB documents. EpubCheck is an open source (Berkeley Software Distribution-licensed) Java™ library. A developer preview version is available for use with EPUB 3 documents and is used throughout this article. See Resources for links to the latest version.

It is strongly suggested that you use the .xhtml extension for all EPUB content documents. Browsers will not interpret HTML content as application/xhtml+xml without that extension. XML processing mode is required when working with many of the features demonstrated in this article, such as CSS namespaces.

Typically, you interact with EpubCheck through the command line, as shown below.

$ java -jar epubcheck-3.0b3.jar sample.epub

Epubcheck Version 3.0b3

No errors or warnings detected.

If you get the error java.lang.NoClassDefFoundError: com/thaiopensource/validate/SchemaReader in response, make sure that the lib/ directory that came with the EpubCheck distribution is in the same directory as the EpubCheck JAR file.

EpubCheck 3 can validate a single subcomponent of the EPUB package individually, as in Listing 1. This extremely useful feature, which is used in the examples in this article, can:


Listing 1. Running EpubCheck 3 against a single file type
$ java -jar ~/src/epubcheck-3.0b3.jar sample-toc.xhtml -mode nav 
Epubcheck Version 3.0b3

WARNING: sample-toc.xhtml: File is validated as a single file of type nav and version 3! 
         Only a subset of the available tests is run!

No errors or warnings detected.

Polish the EPUB by Colin Beckingham

Jul 05, 2012 | IBM developerWorks

Review page numbers

Although you can use Sigil to find and review page breaks and numbering, in a more than 100-page document, doing so might be tedious. An easier way is to iterate through the documents with PHP and review the numbering.

The script in Listing 1 finds and reviews the HTML pages and runs through the page breaks. It finds the number for the first page, which is quite often different from page 1, and verifies that each subsequent page is an increment from the first page. Although the page numbering test is fairly simple, it is an example of how to use the OPF file to find and examine the component HTML.


Listing 1. Page checking the EPUB with PHP and SimpleXML
<?php
/* epub is a zipped package containing many files
  the file "content.opf" contains the pointers to the constituent files
  inside content.opf you have 

  package (root)
    -> manifest
      -> item
          which we need to filter for media-type="application/xhtml+xml"
          and to check these are real text pages, not just full page images

  these are the text chapters which need to be checked one by one
*/
$firstpage = 0;
$oldpage = 0;
// look for the text to be checked
$opf_file = "./OEBPS/content.opf";
if (!file_exists($opf_file)) {
  //cleanup();
  die("Cannot find the OPF file\n");
} else {
  echo "Found it!\n";
  $xml = simplexml_load_file($opf_file);
  // get the manifest items
  foreach ($xml->manifest->item as $mi) {
    if ($mi['media-type']=='application/xhtml+xml') {
      echo "Found ".$mi['href']."\n";
      if (substr($mi['href'],0,4) == 'part') {
          echo "Page number check in document ".$mi['href']."\n";
          echo scan_chap("./OEBPS/".$mi['href']);
      }
    }
  }
}
function scan_chap($chap) {
global $firstpage, $oldpage;
  echo "Trying to page num check section $chap \n";
  if (!file_exists($chap)) {
    echo "Cannot find the chapter $chap\n";
  } else {
    echo "Found it!\n";
    $xml = simplexml_load_file($chap);
    //$i = 0;
    foreach ($xml->body->div->div as $pagnumdiv) {
      if ($pagnumdiv["class"]=='newpage') {
          echo $pagnumdiv["id"]."\n";
          $page = (int) substr($pagnumdiv["id"],5);
          if ($firstpage == 0) {
          $firstpage = $oldpage = $page;
          } else {
          if ($page != $oldpage+1) echo "Problem at page after $oldpage\n";
          $oldpage++;
          }
      }
    }
  }
  return "Done...\n";
}
?>

The code first sets up global variables for the number of the first logical page found (set once at the beginning of the loop) and the number of the previous page checked (that changes with each iteration). It then declares the name of the OPF file, looks for that file, and-if it cannot find it-ends with an error. If the file is found, the script opens the file as an XML object and looks for the names of the files mentioned in the manifest that appear to be HTML using the media-type attribute. In this particular EPUB document, some HTML files contain only a full-page image and therefore can be ignored. The file names of these pages contain the string leaf; the other files that contain extended text have a part label. The code filters these out using substrings.

Now that you know the name of the file, you can read this file into its own simpleXML object. Iterating through the <div> tags and filtering for those that have a class attribute of newpage, you can find the value of the id attribute that contains the page number. You need to let the book tell you which number is the first page because this is often not page 1, and after this value is stored in the global first page variable, you can go on to predict what the number of the next page should be. If it happens not to be the expected number, the script generates an error and continues checking.

This script does not attempt to make changes to the text. It merely flags what it thinks might need your attention.

Spell checking using PHP, XML, and Enchant

Spelling is a different problem. In this case, you are really after events such as Upon, which the OCR has read as TJpon or IJpon, which is close but not correct. It might come in as a number of alternatives, and the spelling routine sees it as so strange that the suggestions it offers are not close or helpful.

A spelling routine examines words one by one and compares them to a standard known list, pointing out those that don't match, making suggestions, and allowing you to make changes. Sigil can make replacements of specific strings across multiple documents in the EPUB package, but you need the power of a scripting engine such as PHP, Perl, Python, and so on, together with specialist libraries, for finer-grained control.

Newer versions of PHP now contain the hooks necessary not only for digging into XML and HTML files using SimpleXML but also for using the Enchant spelling manager library. Enchant is capable of managing multiple different base spelling lists. It helps to differentiate UK English from US English spellings, for example.

The script in Listing 2 examines each of the manifest files separately using the same method as in Listing 1, this time going through paragraph by paragraph and word by word checking each against the known spelling list. It uses the same method of iterating through the HTML component files as in Listing 1 and adds the required instructions to access the dictionaries.


Listing 2. Spell checking the EPUB with PHP, SimpleXML, and Enchant
<?php
  // spell check an epub
/* epub is a zipped package containing many files
  the file "content.opf" contains the pointers to the constituent files
  inside content.opf we have 

  package (root)
    -> manifest
      -> item
          which we need to filter for media-type="application/xhtml+xml"
          and to check these are real text pages, not just full page images

  these are the text chapters that need to be checked one by one

  Acknowledgment: Some of the dictionary-related code
  was copied from the PHP Enchant manual page

*/
// set up console for input
$console = fopen("php://stdin","r");
// set up enchant (from PHP manual)
$tag = 'en_CA';
$r = enchant_broker_init();
$bprovides = enchant_broker_describe($r);
echo "Current broker provides the following backend(s):\n";
print_r($bprovides);
$dicts = enchant_broker_list_dicts($r);
print_r($dicts);
if (enchant_broker_dict_exists($r,$tag)) {
    $d = enchant_broker_request_dict($r, $tag);
    $dprovides = enchant_dict_describe($d);
    echo "dictionary $tag provides:\n";
} else {
  cleanup();
  die ("Cannot set up the spell checker\n");
}
// look for the text to be checked
$opf_file = "./OEBPS/content.opf";
if (!file_exists($opf_file)) {
  cleanup();
  die("Cannot find the OPF file\n");
} else {
  echo "Found it!\n";
  $xml = simplexml_load_file($opf_file);
  foreach ($xml->manifest->item as $mi) {
    if ($mi['media-type']=='application/xhtml+xml') {
      echo "Found ".$mi['href']."\n";
      if (substr($mi['href'],0,4) == 'part') {
          echo "Need to spell check ".$mi['href']."\n";
          echo scan_chap("./OEBPS/".$mi['href']);
      }
    }
  }
}
function cleanup() {
global $d, $r;
  enchant_broker_free_dict($d);
  enchant_broker_free($r);
}
function scan_chap($chap) {
  echo "Trying to spell check section $chap \n";
  if (!file_exists($chap)) {
    echo "Cannot find the chapter $chap\n";
  } else {
    echo "Found it!\n";
    $xml = simplexml_load_file($chap);
    $i = 0;
    foreach ($xml->body->div->p as $para) {
      echo $para."\n";
      // need to spell check the contents of $para
      spell_check(trim($para));
      $i++;
      if ($i > 5) break;
    }
  }
  return "Done...\n";
}
function spell_check($para) {
global $console, $d;
  $para = str_replace("  "," ",$para);
  $para = str_replace(".","",$para);
  $para = $para." ";
  echo "Checking text : $para\n";
  $start = 0;
  while ($pos !== false) {
    $pos = strpos($para," ",$start);
    echo "Found $pos\n";
    if (!$pos) break;
    $len = $pos-$start;
    $theword = substr($para,$start,$len);
    // tidy up theword which may contain punctuation
    $punc = array(':',';',',','"','?','!');
    $theword = str_replace($punc,"",$theword);
    //
    if ((strlen($theword) > 0) and (!is_numeric($theword))) {
      if ($wordcorrect = enchant_dict_check($d, $theword)) {
          echo "$theword is OK!\n";
      } else {
          $suggs = enchant_dict_suggest($d, $theword);
          echo "Suggestions for <$theword>:\n";
          //print_r($suggs);
          $max = 5;
          foreach ($suggs as $k=>$sugg) {
            echo "$k => $sugg\n";
            if ($k > $max) break;
          }
          $inp = fgets($console,1024);
      }
    }
    $start += $len+1;
  }
}
?>

In this code, you start by declaring a file pointer to standard input so that you can get interactive information from the keyboard during the spell-check process. The next section establishes the connection to the dictionaries. Note that the tag variable indicates en-CA, which, in this instance, puts a preference on Canadian English. The result is that the checker chooses colour over color, acknowledgement over acknowledgment, and so on. A more standard setting for the tag is en-US. After the dictionary is connected, it performs the same search for HTML text files as in Listing 1, but this time, instead of looking for page number <div> tags, it looks for paragraphs with real text.

Before performing the actual spell check, the script cleans up the paragraph text to make it more manageable by removing long spaces and removing periods and commas because the goal is to examine word by word. Then, the actual spell checking starts by moving from word to word in the paragraph, ignoring words that are numbers and comparing the word to the dictionary. Where the dictionary does not contain the word, the script suggests words that might be a better substitute. In this case, the script presents only the first five alternates. The script halts at each problem word and waits for user input from the keyboard. At this point, you can add code to change, ignore once, ignore for the session, and so on.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March, 12, 2019