WelcomeUser Guide
ToSPrivacyCanary
DonateBugsLicense

©2025 Poal.co

This is why I have been stopping in old/small local book stores for years looking for early editions of many books. You can't edit what is already printed on the page. I am even considering buying multiple editions, building one of those book-scanner-bots then hooking it up to some OCR and doing a diff between the various print versions to show exactly what was changed between editions.

What do you think? Good idea or waste of time?

This is why I have been stopping in old/small local book stores for years looking for early editions of many books. You can't edit what is already printed on the page. I am even considering buying multiple editions, building one of those book-scanner-bots then hooking it up to some OCR and doing a diff between the various print versions to show exactly what was changed between editions. What do you think? Good idea or waste of time?

(post is archived)

[–] 1 pt

That sounds awesome, but I'd worry about burning out if you decide to do too many books at once without anticipating the potential workload.

If I was going to do something like this, here's what I'd do:

  • Get a high-quality camera that's computer-controlled (focus, f-stop, exposure, maybe zoom). If it's a commercial camera make sure it can be easily removed from the frame, since someone will see the camera and want to use it outside the frame.
  • Mark out the maximum available frame on the capture plate. You don't want to capture an entire book only to realize that the top 10% of every page is cut off.
  • Make a set of weights to hold down the book. Left and right side are essential, a middle weight may be redundant unless you're capturing newer books with less-flexible spines. Some sort of gripping surface on the outside weights is essential as you don't want to be messing with weight placement to hold the pages in place (which would increase time-per-page). Maybe some sort of silicone, or if you set up metal plates hinged on the long axis on an adjustable rack you could use felt.
  • Set up a dedicated computer for image capture, with an interface for specifying the book's metadata, and a dedicated button to start a capture. Multi-tasking it on an existing computer would increase the chances of failure. You could get away with building a RasPi into the frame for this function but its ability to perform OCR would be limited, therefore the processing may need to be done on another machine.
  • Set up dedicated storage with encrypted offsite backups. You don't want a fire to destroy all your work.
  • Write programs to automate every possible aspect of processing each page. Python would be the language of choice due to its flexibility and ease of writing. E.g.:
    • A capture program on the dedicated computer specifically for accurate image capture and storage. Detects out-of-focus images and recaptures, strips EXIF data, splits images into left and right pages, and stores the image in a hierarchical folder structure for future retrieval. Also stores information on each page into a database.
    • A processing program that monitors the database for any new images and runs it through OCR, runs the result through a spelling and grammar checker for consistency, and stores both the raw output and a restructured output as text files within the folder structure (i.e. remove non-paragraph line breaks, and remove headers, footers, and page numbers). Also update the database to specify the page has been OCR'ed, and add any spelling/grammar inconsistencies to a separate table that references the book/page number.
  • Create some sort of interface for verification of every page. Have every page image and its OCR available for side-by-side viewing, and have a task queue of all spelling/grammar errors available for direct access for error correction.
  • (Optional) Make the above interface available to the public, with appropriate protections. Have all user submissions added to a new table for manual verification by a trusted user. This will be costly, either by dint of maintaining a self-hosted server (risk of DDOS, doxing, and hackers gaining a bridgehold), or by running it in the cloud (massive server and storage costs).