Scraping and Exploring the Entire English Audible CatalogToby MandersBlockedUnblockFollowFollowingJul 2Last week I wrote a script using the HTML-Requests package for Python to scrape information from Audible about programs included in their current 2-for-1 credit sale.
After experimenting with threading hyperparameters I managed to get my script’s performance up to about 55 entries per second, which meant Audible’s entire English-language catalog of ~400k programs could be scraped in about two hours.
A couple hours later I had a complete dataset.
(Note: If you like code, you can jump straight to the scraping notebook and the analysis notebook.
You can also download the entire dataset.
)After loading and processing the data to get rid of duplicates and convert object types, I had 352,686 entries with the following fields: title, author, category (genre), length, narrator, price, rating (rounded at the source to increments of 0.
5), rating count, release date, and URL.
# What’s the total length of audio represented in our data?230 years, 235 days, 15 hours# Which are the largest categories by hours and ratings?Interestingly, the top three categories by total length of offerings are the reverse of the top three by number of ratings.
In other words, the engagement (i.
number of reviews) per hour of content is much higher for ‘Sci-Fi & Fantasy’ programs than for ‘Fiction’.
Let’s see if we can quantify further.
Since the number of reviews per hour may be a less fair metric than the number of reviews per title (we assume some genres have longer titles than others, on average), let’s adopt the latter.
Let’s see if the engagement trend continues with this alternate metric.
Categories sorted from left to right by mean number of reviews.
A couple of interesting findings here: First, ‘Sci-Fi & Fantasy’ continues to have the highest engagment by average number of reviews per title.
Yet now it loses to ‘Romance’ when considering the median number of reviews rather than the mean (22 vs.
Furthermore, the distribution of ‘Romance’ reviews is tighter than that of any other category here.
These findings taken together suggest that ‘Romance’ has a more bimodal distrubution than other categories.
We should expect that a larger percentage of titles have few or no reviews, bringing the mean down.
However the titles that are popular are getting a lot of reviews each, bringing the median up.
In other words, ‘Romance’ listeners are less adventurous in their program selection than some other genre listeners.
# How are ratings distributed across titles?Note: that first plot is not a mistake.
Plotting without nonlinear axes means that the number of titles with very few reviews dominates the plot.
In fact, more than 200k of our ~350k programs have fewer than 10 total reviews.
The second plot, which as a straight x-axis and log-scale y-axis, demonstrates the tiny number of titles with greater than 100,000 reviews — 7, to be exact.
The third plot shows us that the log of the rating count varies roughly linearly with the log of the frequency of that rating count.
In other words it’s lonely at the top.
# What are those rarified few books with >100k ratings?Ready Player One dominates the field in number of ratings.
Some things to note: All of these titles are significantly more expensive than the $15.
42 mean price of the entire Audible catalog.
The categories represented are more diverse than we might expect: two science fiction, one fantasy, one thriller, two autobiographical and one self help title.
All but one title were released in the past decade.
Ready Player One has an enormous number of reviews.
In fact, that one title has more reviews than 174,000 other programs combined.
# Which authors have the greatest number of ratings overall?Stephen King wins for the total number of reviews across his books with more than 600,000 of them.
Impressively his title with the greatest number of ratings only has 51,695.
His place at the top of the chart here is a testament to how prolific he’s been.
He has 130 separate recordings on Audible not screened out by our duplicate search.
In the second plot we consider the average number of reviews per program for authors with more than two titles on Audible (in order to screen ‘one-hit wonders’ and their samples/alternate versions).
Andy Weir takes the cake, buoyed by The Martian, which was the number two title overall.
# Which authors have the greatest number of titles and recorded hours?Continuing to investigate the most prolific authors in our dataset, we look at the total duration of audio and the number of recordings (excluding periodicals).
Charles Dickens dominates the total duration battle with nearly 5,000 hours on audible.
He and Arthur Conan Doyle, number two, seem to both be benefitting from the many renditions of their works on Audible.
A Tale of Two Cities alone has 37 different versions!.James Patterson doesn’t enjoy that advantage and wins among contemporary authors.
Slicing the cake slightly differently, Doyle and Dickens swap places for number of recordings on Audible.
Arthur Conan Doyle has 706 versions of his works on Audible in total.
# Which narrators have the greatest number of ratings?The titles Scott Brick has narrated (658 of them!) have racked up the greatest number of ratings by nearly a factor of two.
Same as above, we look at the narrators with the greatest ‘impact’ (by number of ratings per work, at least) on the right.
Roy Dotrice (with 10 titles) takes the top spot for his work narrating the A Song of Ice and Fire books, the first of which was on our list of the seven titles to break 100,000 ratings.
# Which narrators have the greatest number of hours on Audible?We see two sides of the same coin — Scott Brick has the greatest output by number of hours, with 658 titles and an average of 12 hours 39 minutes per title.
Cathy Dobson wins by number of titles with 1161 at an average of 2 hours 53 minutes each.
# How does the title release date relate to the number of reviewsWe can see a spike for each of the 7 outliers above.
While we might expect older titles to benefit from having more time to accumulate ratings, we see the opposite trend.
More recent programs have been more likely to garner more reviews.
# How many titles have been released in each quarter since 1996?Histogram of release dates with one bin per quarter.
Notice that there’s a large bin in the last quarter of 1999.
In fact, the most frequent date in our dataset is 12/16/1999.
Although you might think that date is retroactively applied to all earlier titles, that’s not the case.
The minimum date in the dataset is 12/1/1995 — the same year Audible was founded.
Moreover Google doesn’t return much from the news about Audible in December 1999, although the company was having an eventful year.
The original founder had just passed in October.
The other interesting finding here is the three-quarters-long dip in releases during 2015.
In fact — that dip makes 2015 the only year since 2002 in which fewer titles were released than in the previous year.
Side note: the latest date in the dataset is 12/10/2020 for The Poisoner by Sharon Bolton.
# What is the distribution of prices over all programs?The mean price of all books is $15.
43, which incidentally is near the ~$15 price per credit for some of the Audible monthly plans.
There are spikes at every $5 interval as we’d expect given _4.
99 and _9.
99 pricing schemes.
# How do prices vary by category?Categories ranked, left to right, by mean price‘Mysteries & Thrillers’ has the largest average price at $20.
‘Newspapers & Magazines’ has the cheapest average price at $2.
No big surprises here.
We can see from our boxplot our likely most expensive title — the outlier in the ‘Language Instruction’ category.
Let’s find out what it is.
# What is the most expensive program on Audible?$173.
27 buys you the priciest English title on all of Audible — and 982 minutes (~16 hours) of Spanish language instruction.
# What is the most expensive program per hour?The most expensive title per unit length is the 1-minute-long program Masterpiece by Jeff Hathaway, which costs $10.
49 — or $629.
40 per hour.
# What’s the distribution of program lengths (excluding periodicals)?‘Newspapers & Magazines’, ‘Radio & TV’, ‘Live Events’, and ‘Drama & Poetry’ were excluded from this chart as they are dominated by very short programs.
We can see that there are still quite a few short (<5 hour) programs, but there’s a nice bell curve of longer titles with median right around 7 hours.
We might infer that this length plot is a composite result of combining a long-form distribution with a shorter-form distribution.
Let’s look at categories and see if we can pinpoint which are which.
# What are the length distributions by category?Categories ordered, left to right, by decreasing mean lengthIndeed we find disparate length distributions by category.
‘Newspapers & Magazines’ has the shortest mean program length at 28 minutes.
‘Sci-Fi & Fantasy’ has the longest mean length at just over 9 hours.
Right away we see some length outliers in our boxplot, and we’ll explore those in a moment, but first — let’s validate our theory about the composite distribution.
# Finally, what are the longest books on Audible?df.
nlargest(5, 'length')4 of the 5 longest programs on Audible are in the ‘Religion & Spirituality’ category.
The longest is 154 hours.
The only title outside of that category is Edward Gibbon’s expansive 19th century history, ‘The Decline and Fall of the Roman Empire,’ which has incidentally led to the decline and fall of my eyelids at night for years.
So there we have it.
We’ve looked at most of our features in turn: author, narrator, rating count, release date, length, price and category.
We looked at the most prolific names in the industry as well as the longest and most expensive titles.
The only feature we haven’t explored quite yet is the ‘rating’ column itself — and that’s because when we were scraping, we were only able to capture gradations of 0.
5 from the html we had access to.
With the links in hand, we are free to rescrape more granular ratings from the pages for individual programs, but we’ll save that for another day.