Americans are reading more words-per-day than ever and are becoming less literate every year. This is because, while we are consuming more text, the quality & value of that text is decreasing. The human race has transformed from one where only the highly educated can read, to one where only the highly educated can write/distribute, to millions of people reading content from Hawk Tuah girl.
Enter generative AI. As someone who has been in the AI space for my entire career, including a PhD in large neural networks obtained back in 2010, I am generally skeptical of new AI fads; however, generative AI (forward-models more generally) are extremely powerful and I am convinced there is significant value that will be unlocked by these models over the next five-ten years.
For much of our civilization, we have operated in information scarcity. The ability to have the most popular books of the past 500 years available at the press of a button was a pipe dream for academics. It was so out of reach that it wasn’t even a dream for most. Now it’s available for all, but we now have an attention scarcity. Alice in Wonderland simply can’t compete with Hawk Tuah girl, who has a team of marketing professionals and an army of psychological weapons, not the least of which is online advertising: a tool that has minted at least two of the fortune 5 companies. Can we use generative AI to level the playing field? Can we made the information-rich books of our history pop up and capture us the way a well-curated youtube thumbnail can?
This is the hypothesis that Scott Wilcox ( https://hereandtomorrow.com/ ) and I set out to prove, and it was an incredible journey that cemented a new friendship and resulted in the hands-free illustration of over 30 books.
Getting started
Fortunately, epub files are actually just zip files with a specification that guarantees the presences and schema of some files inside the zip, namely the OEBPS/content.opf file which is an XML file. After getting some public domain books from Project Gutenberg (PG), we hacked the epub to strip out the PG headers and replace them with a uuid that allowed us to have multiple copies of the same book in our epub readers without conflict.
with zipfile.ZipFile(output_book_path, 'w') as new_book_zip:
with zipfile.ZipFile(input_book_path) as original_book_zip:
manifest = original_book_zip.read("OEBPS/content.opf").decode('utf-8')
manifest = manifest.replace("""</dc:identifier>""", f"""/{model_details}/{run_id}</dc:identifier>""")
manifest = manifest.replace("""</dc:title>""",f""" (Version {model_details}/{run_id})</dc:title>""")
filenames = original_book_zip.namelist()
print(filenames)
for file in filenames:
if file == "OEBPS/content.opf":
# Already handled
continue
if file.endswith('.xhtml'):
print("Processing",file)
with original_book_zip.open(file) as myfile:
xhtml_contents = myfile.read()
manifest, new_xhtml = process_xhtml(new_book_zip, xhtml_contents, manifest, text_model, image_model)
new_book_zip.writestr(file, new_xhtml)
else:
with original_book_zip.open(file) as myfile:
new_book_zip.writestr(file, myfile.read())
new_book_zip.writestr("OEBPS/content.opf", manifest)
With that out of the way, we set about creating classes for text-to-image and chat, a caching layer so we can rerun the same book without waiting for the same images to render, and html parsing to extract the content from the book and inject the images:
One major challenge that we found up-front was that, sometimes, the images were garish or completely out of place. Here’s an example of a kitten with a foot for a head:
Here’s a man with two extra pairs of legs:
The way we ended up solving this was through rejection sampling. We passed the newly-created image to a vision-language model (VLM) and asked a series of questions: Is there a person with the wrong number of legs? Is there a person with an animal head? Is there excessive blood or gore? Was this image inspired by Cthulu or some other Deep One? You get the idea. Images that didn’t pass the inspection are thrown out and regenerated with a different random seed until we either get a passing image or give up.
Another problem was the theme. Here are some pictures from the same print of 80 Days Around the World:
The protagonist changed from an 1800s Gentleman (first) to a 1950s businessman (second) to a disheveled dock worker (third). The protagonist’s assistant changed races and genders throughout the book. The general art style and theme changed massively from image to image. This was completely immersion-breaking.
Our solution was to recursively summarize the book and characters. We accumulate summaries of each section and then summarize those into shorter summaries until we build a summarization pyramid. The top of this pyramid is included with each chunk of text to ground the characters and the theme.
For some items, such as the era, we kept track of the era as we discovered it. Thus, even if a chunk did not give away the era of time that it’s based in, we could infer it from previous chunks of text.
The covers were created from the pyramid of summaries and had their own set of VLM validators.
I could write for hours about all the technically clever things we did, but let’s put a bookmark in it and move to: Reading!
Making the classics fun again
Personally, the additional illustrations and covers made these books an entirely new experience. I’m really intrigued to see what the AI imagines from the text and how it enhances my own imagination of the story. Just building this has renewed my interest in reading fiction books permanently, making it a personally extremely satisfying project.
Is it a great business? No. Sadly, the total addressable market of classic books simply isn’t there to justify this as a sustainable business. But I look at it as a community project, similar to taking a few months worth of weekends to clean up one’s local river or park.
Here’s some GIFs showing the evolution of our system over time on the same area of the book:
Conclusion
The Good: The books are delightful and the epub format was relatively simple to hack
The Bad: After buying an expensive GPU, I still paid over $800 out of pocket to do all the image generation in parallel. Even a powerful GPU is too slow compared to letting someone with a server farm go wild.
The Ugly: Many of the misshapen people and chimeras haunt me when I get migraines. You can’t un-see Alice in Wonderland with two horse butts instead of a head.