anned, "OCRized" and proofread.
Digitization in "text format" means a book can be copied, indexed, searched,
analyzed and compared with other books. It is possible to search the content of
the book with the "Find" button available in any browser and any software,
without a specific search engine. Project Gutenberg provides a "Nearly Full
Text" search (on the first 100 K of each file) using Google, with a database
updated approximately monthly. It also provides a search of book metadata
(author, title, brief description, keywords) as a participant in Yahoo!'s
Content Acquisition Program, with a database updated weekly. (Please see the
bottom of the Online Book Catalog.) In the Advanced Search, several fields can
be filled: author, title, subject, language, category (any, audio book, music,
pictures), LoCC (Library of Congress Catalog classification), filetype (text,
PDF, HTML, XML, JPEG, etc.), and eText/eBook No. A field "Full Text" was
recently added as an experimental feature.
The assets of digitization in "text format" are numerous. It makes a smaller and
more easily sendable computer file, unlike digitization in "image format", which
produces a bulky "photo" file. Contrary to other formats, the files are
accessible for low-bandwidth use. They can be copied as much as needed to
produce new digital or print versions for free. The typos pointed out after the
text is released can be fixed at any time. Readers can change the font and size
of characters, the margins or the number of lines per page. Visually impaired
readers can increase the letter size. Blind readers can use speech recognition
software. All this is very difficult, if not impossible, with many other
formats.
If the eBooks released are 99.9% accurate in the eyes of the general reader, the
goal is not to create authoritative editions, and to argue with a picky reader
whether a certain sentence should have a colon instead of a semi-colon between
its clauses.
Project Gutenberg is convinced that proofreading by human beings is a very
important step, and that this step makes all the difference. The use of scanned
books as is --converted to text format by OCR software with no proofreading--
gives a much lower quality result. After running OCR software, the text is 99%
reliable, in the best of cases. After proofreading, the text becomes 99.95%
reliable (a high percentage which is also the standard at the Library of
Congress).
For this reason, Project Guten
|