Book Data Difficulties
The most comprehensive and influential book sales data in the industry is basically inaccessible to anyone beyond the publishing industry.
As a Romance Book Data Analyst, I quickly learned that my main challenge would be . . . finding book data!
As data scientist and literary scholar Melanie Walsh explains in her essay, Where is All the Book Data?, most book sales data is "proprietary and purposefully locked away."
BookScan
The most comprehensive and influential book sales data in the industry—an exclusive subscription service called BookScan—is basically inaccessible to anyone beyond the publishing industry.
And this top source of book sales data still has significant gaps. According to Walsh: "BookScan numbers, despite their significance and hold on the marketplace, are not completely accurate. BookScan claims to capture 85% of physical book purchases from retailers (including Amazon, Walmart, Target, and independent bookstores) and 80% of top ebook sales."
Read more
Where is All the Book Data? (Public Books)
Bestseller Lists
Another potential source of book data is bestseller lists, but I soon learned these are also fraught with issues. In 2022, Esquire published an article outlining many of the problems with The New York Times Best Seller list: "No one outside The New York Times knows exactly how its best sellers are calculated—and the list of theories is longer than the actual list of best sellers.
In Bestseller Lists are Broken, Kathleen Schmidt explains that bestseller lists are disappearing and the ones that still exist aren't equitable. For example, the Wall Street Journal dropped its bestseller list in 2023 even though it actually "used data from Bookscan to rank books, making it one of the most accurate consumer-facing lists."
Read more
The Murky Path To Becoming a New York Times Best Seller (Esquire)
Bestseller Lists are Broken (Publishing Confidential)
Amazon
I thought Amazon might be another possible source of book data—maybe not sales figures, but at least a listing of how many books in a certain genre were published each year. After all, Amazon does have an Advanced Search tool that allows you to search by subject and publication date. However, when I attempted to do advanced searches on their website to find lists of romance books by year, I ran into so many issues with the resulting data—incomplete, inconsistent and messy—that I eventually gave up. They also have a separate Kindle Store, which further complicates search results (not to mention that the separate Advanced Search tool in the Kindle Store appeared to be broken and wouldn't even return results by publication date at the time this post was published).
As Christine M. Larson explains in her book, Love in the Time of Self-Publishing:
"Authors may be interested in making their books easy to find, but Amazon is also interested in selling advertising, so it benefits from the clutter. Because Amazon has its own e-book imprints, including the Montlake romance imprint, indie authors and publishers are competing against Amazon—which holds all sales data about all books from all publishers on its site, a huge marketing advantage."
One example of this clutter and messiness on Amazon is the overwhelming amount of duplication and repetition when it comes to book genres and subgenres. While recently analyzing a Paranormal Romance bestseller list, I came across the following overlapping list of book categories (see image below).
Monetization of Data
One of the factors at play is the monetization of data.
According to a report from SNS Insider, a market research and consulting agency, the data monetization market is expected to reach USD 17.33 billion by 2032, growing at a rate of almost 20%.
Their news release for the report summarizes it succinctly: "As organizations generate massive volumes of data, there is a significant push to leverage this data for revenue-generating opportunities. Data monetization enables businesses to optimize data assets, generate insights, and create new streams of income."
With the lack of book sales data available, I thought I could potentially analyze social media data related to books and authors. Ten years ago I remember using free tools to track hashtags and keywords, for example. However, I quickly realized that this type of data is also essentially "locked away," in this case behind expensive social media management tools that generally only large companies can afford.
Conclusion
As a result of these limitations, I have to be more creative and resourceful when it comes to finding data sources for romance books. I'm hopeful that initiatives like the Post45 Data Collective (mentioned in the Where is All the Book Data? essay), which is an open access site that peer reviews and publishes literary and cultural data, will continue to grow and provide more data to help us better understand the publishing industry.