How the Book Genome Project Works:
“The Book Genome Project is an objective, computer-based analysis of the written word, applied evenly across tens of thousands of published books.”
The Book Genome Project is so fundamentally different from what most readers are used to, it’s easy to be confused about how BookLamp and the Book Genome Project works. I’m hoping to clarify that a little bit here.
To start, BookLamp does not categorize or label books, as you would expect in genre or BISAC codes, nor do so through human or community tagging. Instead we do the exact opposite: We ignore genre and super-classifications and instead only pay attention to the page-by-page components that the author combined to make up the book. We don’t look at what category the book is in, but instead the DNA elements that are in the book, and how that makes one book similar to another regardless of what shelf it sits on in the library or bookstore.
The Librarian With Perfect Recall:
“… even the best of us [humans] will have a hard time perfectly recalling what happened on page 37, paragraph 4 of the book we read 3,749 books ago.”
In a perfect world, a skilled and trained human would be able to do this. That becomes an issue, though, because we’re not only interested in rating just a single book, but every word in every scene in every chapter in every book we can get our hands on.
This quickly becomes a problem for a human; even the best of us will have a hard time perfectly recalling what happened on page 37, paragraph 4 of the book we read 3,749 books ago. And when many humans try to do this, the simple truth is that you have to have many, many people work together in order to get an accurate picture of all the books available for discovery. And by many, many people, I’m not talking about 5 or 6 million people over a few years; I’m talking about hundreds of millions of people month in and month out – a number that simply isn’t feasible. If you don’t have that many people, what happens is that some books – like Harry Potter – get lots of ratings, and most books get virtually none, and disappear into the Social Void, which is what we call that space in social networks where invisible books go to be lonely.
Consequently, when we say that a book has 65% Vampires, it’s because the computer is telling us with a great deal of Apples-to-Apples information that the book has more Vampires in it than 65% of the other books in our corpus that also have Vampires in it. And if you start layering that information – such as knowing that one book has 65% Vampires and 15% Forests, compared to another with 63% Vampires and 15% Urban Environments – you get a sense of why this information is valuable when comparing titles.
Jurassic Park as an Example:
Even if you had hundreds of millions of trained human readers, there are some things that are simply impossible for humans to do at accurately at scale. As an example, let’s look at a writing style graph of the book Jurassic Park, by Michael Crichton.
Both the movie and the book of Jurassic Park focus on technology, DNA, and the security systems of the island. In other words, they spend a good portion of the book talking about the science behind cloning, and the wonders of the park itself. Then, about 43% of the way into the book, the power gets turned off to the fences, the dinosaurs get out, and people get eaten. The book shifts over to an action-adventure novel.
The graph above maps the Pacing and Density writing style variables from beginning to end of Jurassic Park. It’s sort of like a writing style time-line for the book. What you see is that at the start of the novel, as Michael Crichton spends much of his time focusing on these technologies, the Density and Pacing scores stay near each other. Then, about the point the dinosaurs escape, the Pacing goes up, the Density falls, and the book becomes easier to read.
This is an example of the author choosing to change their style of writing to match the contents of the story. So, if you do start reading Jurassic Park, don’t stop reading until at least about 45% of the way through the book, because the action really picks up.
It is possible for a human to create this level of detail for a single book, but doing so for 1,000,000 books – and doing so in a way that compares Apples to Apples in all of them – is literally an impossible task. Considering that more than 300,000 books are published each year, this is a big problem for future book discovery.
Discovery in the Next Generation:
This level of granularity is important. The Book Genome Project is not really concerned with whether a book is really highly Dense or not, but instead that it is either more or less dense than the other books around it. Knowing that allows you to say, “Book A is similar in Density to Book B.” You can’t create an objective map about where one book fits compared to the others unless you have a perfect understanding of every page in every book.
Because much of our work is in the publishing industry, where the discussion tends to revolve around metadata, we refer to what we do as Multidimensional Metadata. Let me define that for you:
Multidimensional Metadata is:
“Any metadata that is a generational leap in DEPTH and SCOPE beyond the capabilities of a publisher to assign manually, or the crowd to describe effectively.”
In practicality, by “depth” we mean that you have to pay attention to information beyond the surface of the book – data found equally on page 2, 3, 4, and 250. We look DEEP into the book, from beginning to end. By “scope” we mean that we have equal data across the entire scope of the corpus. Social networks tend to have lots of data on the really popular books, and insufficient data on the vast majority of books on the market. We collect the same data on every book in our database, regardless of how popular or well known it is. Because it’s a computer-based analysis, the site is equally effective with a single user as it is with millions. We like to say, “Our books introduce each other.” This is true. The content of the books themselves are the connecting threads between them.
In future articles on our blog, we’ll talk more about what we call the Social Void and the Glass Castle, ways of describing the content hole most people are not even aware exist. We’ll talk about the reasons that the future of book discovery lies in the combination of both human powered information AND the content based approach used by the Book Genome Project. But for now, if you have additional questions or comments, please check out our FAQ, or feel free to contact us.
In the mean time, best of luck,
Aaron Stanton – Founder and CEO