Continuously checking Discogs for errors (part 1)

The dumps released by Discogs are great for datamining, but they are only published once a month. Because they are also not released on the first day of the month (usually on the 4th) it means that you are at minimum a few days behind, or possibly up to 30+ days.

The Discogs data can also be accessed through their API, allowing monitoring of releases in a more continuous way: as soon as something is added it can be downloaded and checked.

The JSON coming from the API exposes more information than the XML from the data dump. For example it contains a timestamp when the release was added and when the last change was. It also contains some information about who contributed to it, and some elements (like labels) contain a bit more information that is not present in the XML The structure also seems a lot more sane than the XML, which I can only process with SAX because building a DOM of a 32 GiB XML file is just insane.

Accessing the Discogs API for the purposes I want to use it for is actually very very simple:

create a security token and use it as described on the Discogs developer webpage
send requests to the right endpoints and provide the token
process the JSON results sent by Discogs

Using for example the Requests library for Python it will only take a few minutes to create such a script.

Checking releases via the API is a lot slower than checking a local (very big) XML file. This is mostly because of the download speed and the rate limiting imposed by the Discogs server. I did some download tests and I could download around 3800 releases in two hours. Doing the math: if you would be able to download 4000 releases in two hours (being optimistic), it would take half a year or longer to download the whole database through the API. If I were to do this, then during the same timeframe Discogs would release 5 or 6 dump files which would be more up to date than the oldest data downloaded through the API.

This is why a continuous checker should not be a replacement for scripts for checking releases in bulk. Instead treat them complementary.

(Of course, you could speed it up by using a distributed download mechanism using a few virtual machines somewhere but for me that's not worth the effort or investment in time and resources.)

Implementing continuous checking

That being said, to quickly catch errors using the API is very convenient. I wrote a script that does the following:

determine the latest release in Discogs by scraping a webpage that lists releases (as the Discogs API does not allow me to discover the latest release easily, see below).
download all releases starting from a certain offset, ending with the release number from step 1
check the releases from step 2 for known smells and report the errors (log file, notices on Linux desktop if enabled)
write the releases from step 3 to a file for offline analysis and avoiding redownloading the same data again
scrape the website from step 1, sleep for 10 minutes, set the offset to start from to the last release checked in step 3 and use this as the new starting point for step 2

This allows me to stay reasonably up to date with what is going on with the new releases. What I noticed is that the most common problem with the new releases, which I already wrote about before (part 1, part 2).

One of the things I have added to the script is to use notifications (only tested on a Linux desktop) if things that are wrong were found. This is not recommended if you have work to do, because it is very distracting. That's why I only run this script whenever I have time to focus on doing things with Discogs.

Possible improvements by Discogs

To make my scripts better there are two things that Discogs could do:

publish a list of time stamps of when each release was last changed, so I can invalidate earlier results, redownload the releases that have changed since I last checked them and check them again. Such a list should not cost more than 30 MiB of diskspace (gzip compressed).
add an endpoint to the API that would allow me to find out the number of the latest release with a single call, so I don't have to scrape it from the Discogs website.

Caveats

There are a few things that you need to keep in mind, like rate limiting and authentication. You can easily see in my script how I did this. The script itself has been released under GPL3, but if you get inspired by it and make your own reimplementation then pick whatever license you feel comfortable with (as I am not doing anything special anyway in that script).

Authentication

Some functionality will only be available if you have authenticated. For example if you have not authenticated then the URLs for images will be empty. Also, rate limiting will be much stricter and you won't be able to download as much data from the database if you have not authenticated.

Rate limiting

If you hammer the Discogs API too hard, then at one point it will stop sending data, but instead send you the HTTP 429 response code ("Too Many Requests"). By checking the output from what Discogs returns (such as the headers indicating how many requests can be done in a certain time frame) and also acting upon it (by pausing your script for a few seconds if you are about to hit the limits), you can more easily avoid this problem.

Vinyl & Data

Search This Blog