How to scan and digitize books: a tutorial

From ProleWiki, the proletarian encyclopedia

← Back to all essays | Author's essays How to scan and digitize books: a tutorial

by Open source
Published: 2023-05-23 (last update: 2023-09-23)
10-20 minutes

ProleWiki has a plain-text Library of books, and to feed it, we need digitized books that we can copy from.

Digitizing books is something very few of us have ever had to do, yet it's not especially complicated.

In this guide, we'll cover various ways you can scan and digitize your own books, which will fit all budgets and technical capabilities.

Read more

ProleWiki has a library, comrade! If you digitize your book and it belongs to our categories (socialist works), we'd love to have it!

Digitizing books is something very few of us have ever had to do, yet it's not especially complicated.

In this guide, we'll cover various ways you can scan and digitize your own books, which will fit all budgets and technical capabilities.

Some technical considerations

Scanned book being turned into its digitized format

Scanning a book isn't inherently super difficult despite the bulky size format, but there are some things you need to take into consideration.

Semantically, scanning refers to the process of taking digital pictures of your book. It's literally a picture with all the artefacts in it as you can see on the right.

Digitizing is the extra step after that and refers to the process of rendering your book into a native digital format. This normally means a PDF with proper text (which you can search in), white pages, etc.

Scanner specs

Your scanner should have a resolution of 300 dpi (dots per inch) at the minimum to ensure good results; especially to ensure proper OCR (optical character recognition, which we'll see at the end of this guide). All modern scanners achieve that resolution without any issues and sometimes even exceed it, this only matters if you have an older scanner.

On top of dpi, you need to know how many megapixels your scanner has. Most common dedicated book scanners start at 13 megapixels (13 million pixels) and are also able to OCR your book at this size.

Both of those parameters determine the end quality of your scan, and also allow you to scan bigger books. This makes sense: if the book is bigger in size, then you need to move the camera further away to take a full picture of it and so you need more pixels to take a good picture of it.

Should you scan with your phone?

Most smartphones nowadays offer a document scanning camera (or there's apps for it). There are also cheap stands you can buy (or make yourself) to rest your phone on a raised, transparent surface with the book under it.

However, the resolution of your phone camera is likely not enough to take good pictures of a book. You can try this yourself with a phone you own at home, but no phones on the market can take high-quality pictures, especially not of documents with lots of tiny details (the font). These document scanners are mostly good when you need to send a picture to someone, but not for digitizing your books.

Which scanner and method to pick

A flatbed scanner

The at-home way: the destructive process

The easiest way to scan a book relies on a piece of equipment you probably already have at home: a flatbed scanner. Yours is probably attached to your home printer. Otherwise, they can be quite cheap to acquire and are so common that you will likely be able to rent or borrow one for free (ask public libraries!)

However, to get the best results, you'll need to destroy your book. Although opinions vary, this is okay to do on books that are still in print as they are easily replaced. If your books are older or even antiques however, it would be a shame to destroy them when there are non-destructive processes that exist.

In this process, you will simply unbind the book then scan all pages one by one, and reconstitute them in a pdf. Because the scanner is flat, you need a flat document to put in it.

Note that we don't call the destructive process easy or quick, as you first need to unbind the book and then recompose it in the scanner (i.e. scanning the pages in the right order)! It's an easy solution provided you have the equipment at home already, but there are definitely better ways to do this.

You could certainly try to scan your bound book one page at a time by holding half the book out of the scanner, but you will likely see terrible results with lots of artefacts and blurry characters near the inner margin.

Likewise, you can't really scan two pages at once with this type of scanner (laying the book face-down on the entire optical area) as it will create shadows and blur. This can also damage the book; they are not meant to have pressure on the spine.

How to unbind a book

If you're interested in scanning your book this way by destroying it, you need to know what kind of spine your book has and undo it. I'll just link to Wikihow's tutorial on how to unbind a book as they explain it much better than I could.

Using a consumer book scanner

CZUR book scanner, one of the most recognizable brand in this field

Thankfully, there are affordable dedicated book scanners that exist nowadays. Models usually vary between 100 and 200 dollars USD: not exactly "cheap", but also affordable if you plan on digitizing several books. They can also serve as your normal everyday scanner too, which is something to consider.

Like a flatbed though, if you don't want to purchase one for yourself, you should look around and in libraries to rent or borrow or scanner! Your college or place of work might even have one.

The scanner I recommend is the CZUR Shine Ultra starting at 13 megapixels. It can scan up to A3 format (essentially two letter-sized pages) and has all the upsides listed below, except that the LED lights are useless as they shine too bright on the book which creates a glare.

Upsides

These scanners are able to scan your books quickly and efficiently, without destroying them.

Most of them are also able to OCR your book in one click, with usually good results.

They are definitely faster than a flatbed, with some models being able to scan a page every second. Most also have an "automatic scan" function, whereby the software will detect when you move to the next page and automatically take a picture of it.

To help you hold the book down, these scanners usually have an option in the software settings to digitally remove your fingers from the picture. Then, you can take a picture with the provided foot pedal.

Once you're done using the scanner, you can fold it in two so it takes up very little space.

Let's be honest, scanning a book is boring, repetitive work. Having something like this to speed up the process is definitely helpful: mathematically, if your book is 300 pages long and you scan 2 pages at a time, then you will be done in around 150-300 seconds, or basically 5 minutes. That is if you don't have to redo pages and work very efficiently. In any case, they are much faster than flatbed scanners which, excluding prep time, can take 10 seconds to scan a single page.

Downsides

However, as these scanners are effectively a mounted camera with lots of software processing, you need to take into account the maximum size of books they can scan, as well as proper lighting. Flatbed scanners create their own lightning so it's not a problem for them, and most of these overhead scanners also have LED lights, but some of them have terrible light position that renders these useless, and will require an external source to properly light up the book (ideally, the whole page is lit up evenly).

These scanners don't work on every operating system! Check compatibility before buying one.

They also represent an investment, of course, especially if you are only going to use them once or twice. Most communist parties though have tons of books and documents in their own library, and might be interested in purchasing such a scanner! Run the idea by your party, comrade.

Handheld scanners

example of a handheld book scanner

Handheld scanners are also a solution, although they are about as expensive as the above book scanners with more downsides.

Upsides

These scanners are absolutely portable, and you can take them anywhere with you (for example scanning books at the library without checking them out).

They also have good resolution as they are meant to scan books, obviously.

As far as I know, they provide their own lighting and ensure a proper scan every time.

They also save your pages to a memory card (normally a microSD card), which can be an upside or downside. The upside is you don't need to connect it to a PC and it will work on every operating system.

Downsides

There are a lots of downsides to these scanners though. They work on batteries and will run out before you've done a whole book: usually after 100 pages.

Learning how to "drag" the scanner across the page (it's mounted on small wheels) takes some getting used to, but it's nothing too difficult.

Your book size is also limited by the length of the scanner, although they should be able to scan every book out there because it's rare that books actually go above an A4 size, and most are usually pocket size (close to A5).

Finally, I imagine they must have issues properly scanning the middle of a book, when the inner margin is at its deepest and it's difficult to lay the pages completely flat.

I would recommend this scanner only if you find other uses for it in your daily life, like quickly scanning things at work or school. Otherwise, I would advise for the book scanner mentioned just earlier.

Professional book scanners

Professional book scanner with the V-shape pedestal for a book

In all the scanners we've seen so far, there is an obvious problem: you have to hold the pages down somehow, which is just not how books work (especially the more pages they have, as the spine gets bigger).

Professional book scanners have a simple workaround: you lay the book on a V-shaped surface, which is better for the spine and alignment.

These scanners, however, are not meant for home use: the one on the right costs 15000 USD.

There are services, maybe even around you, that can scan books for you. It's worth checking them out before making a purchase as the price might be cheaper than getting a whole scanner just for this.

This method by the way is how the Internet Archive scans their books.

Scanning your book

In any case, once you've selected your method to scan your book, you need to actually get to scanning it.

We are eventually also going to digitize the book, so for that the best end-format is PDF.

Book scanners with integrated OCR will guide you through all this through their software. If you use a flatbed scanner, it might have PDF capabilities, or it might make individual pictures.

Most scanners software will let you see your picture after it's scanned and will ask you if you want to take more, so you shouldn't run into issues. If a page looks improperly scanned (too dark, too blurry), then just retake it and delete the mistake scan as you go along.

Since we're talking about scanning hundreds of pages, you should definitely be methodical about scanning so as not to forget any mistake in your batch.

Creating a PDF if needed

If your scanner makes individual picture, you better hope that it names them in order (e.g. SCAN_000001, SCAN_000002, SCAN_000003) and doesn't try to duplicate a name (SCAN_0000001 (1)), as that will make it easier to collide into a single PDF.

To make a PDF if your scanner does individual pictures, you can upload them on ilovepdf (a favourite of college students around the world), although I'm not sure how many jpegs they can process at once.

Otherwise, you can download software. I know of Xodo PDF Reader and Editor on Windows (through the Microsoft Store), although it's a bit bulky to create PDFs with it.

Once your book is made into a single pdf, you should definitely run through the file to catch any mistakes (duplicates, improperly rotated pages, unreadable pages that weren't scanned properly, etc.)

Afterwards, we can move on to OCR'ing your book and properly digitize it.

How to Optical Character Recognition (OCR)

OCR is by no means new in tech, but it's only recently that it has started being a viable tool to digitize books.

OCR essentially turns a picture of text into actual text that you can select, copy, paste, search in, etc.

If you use a book scanner, it will likely be able to do that for you. It will also be able to keep the formatting intact (like page number position, margins, etc) and can even turn your book into a Word document. Otherwise, there are several free tools that exist online, with varying success.

OnlineOCR.net gives good results and is able to extract the result to a Word format (which you can then open with say LibreOffice and export as a PDF once again).

It's important to note no OCR solution is 100% perfect. The more your source material is contrasted (e.g. black-on-white) the better the results. But, sometimes, the software will substitute the wrong character if it can't understand what it's looking at.

It's possible to go through the book and proofread it of course, but for most usecases OCR mistakes are left in as they still approach a 99.9% precision rate.

Sharing your digitized book (on Prolewiki?)

Once your book is properly digitized, how about uploading it to our library? We aim to host as many Marxist, socialist, socialist-aligned and historical documents as possible!

To properly share your book on ProleWiki, you will (at this time) need to request an account. Keep your .docx (Word) OCR document nearby. Once your account is approved, you can create a page for the book you digitized and simply copy everything in your word document (Ctrl+A) and paste it on the book's page, and it just works!