Skip to main content

Continuously checking Discogs for errors (part 1)

The  dumps released by Discogs are great for datamining, but they are only published once a month. Because they are also not released on the first day of the month (usually on the 4th) it means that you are at minimum a few days behind, or possibly up to 30+ days.

The Discogs data can also be accessed through their API, allowing monitoring of releases in a more continuous way: as soon as something is added it can be downloaded and checked.

The JSON coming from the API exposes more information than the XML from the data dump. For example it contains a timestamp when the release was added and when the last change was. It also contains some information about who contributed to it, and some elements (like labels) contain a bit more information that is not present in the XML The structure also seems a lot more sane than the XML, which I can only process with SAX because building a DOM of a 32 GiB XML file is just insane.

Accessing the Discogs API for the purposes I want to use it for is actually very very simple:
  1. create a security token and use it as described on the Discogs developer webpage
  2. send requests to the right endpoints and provide the token
  3. process the JSON results sent by Discogs
Using for example the Requests library for Python it will only take a few minutes to create such a script.

Checking releases via the API is a lot slower than checking a local (very big) XML file. This is mostly because of the download speed and the rate limiting imposed by the Discogs server. I did some download tests and I could download around 3800 releases in two hours. Doing the math: if you would be able to download 4000 releases in two hours (being optimistic), it would take half a year or longer to download the whole database through the API. If I were to do this, then during the same timeframe Discogs would release 5 or 6 dump files which would be more up to date than the oldest data downloaded through the API.

This is why a continuous checker should not be a replacement for scripts for checking releases in bulk. Instead treat them complementary.

(Of course, you could speed it up by using a distributed download mechanism using a few virtual machines somewhere but for me that's not worth the effort or investment in time and resources.)

Implementing continuous checking

That being said, to quickly catch errors using the API is very convenient. I wrote a script that does the following:
  1. determine the latest release in Discogs by scraping a webpage that lists releases (as the Discogs API does not allow me to discover the latest release easily, see below).
  2. download all releases starting from a certain offset, ending with the release number from step 1
  3. check the releases from step 2 for known smells and report the errors (log file, notices on Linux desktop if enabled)
  4. write the releases from step 3 to a file for offline analysis and avoiding redownloading the same data again
  5. scrape the website from step 1, sleep for 10 minutes, set the offset to start from to the last release checked in step 3 and use this as the new starting point for step 2
This allows me to stay reasonably up to date with what is going on with the new releases. What I noticed is that the most common problem with the new releases, which I already wrote about before (part 1, part 2).

One of the things I have added to the script is to use notifications (only tested on a Linux desktop) if things that are wrong were found. This is not recommended if you have work to do, because it is very distracting. That's why I only run this script whenever I have time to focus on doing things with Discogs.

Possible improvements by Discogs

To make my scripts better there are two things that Discogs could do:
  1. publish a list of time stamps of when each release was last changed, so I can invalidate earlier results, redownload the releases that have changed since I last checked them and check them again. Such a list should not cost more than 30 MiB of diskspace (gzip compressed).
  2. add an endpoint to the API that would allow me to find out the number of the latest release with a single call, so I don't have to scrape it from the Discogs website.

Caveats

There are a few things that you need to keep in mind, like rate limiting and authentication. You can easily see in my script how I did this. The script itself has been released under GPL3, but if you get inspired by it and make your own reimplementation then pick whatever license you feel comfortable with (as I am not doing anything special anyway in that script).

Authentication

 Some functionality will only be available if you have authenticated. For example if you have not authenticated then the URLs for images will be empty. Also, rate limiting will be much stricter and you won't be able to download as much data from the database if you have not authenticated.

Rate limiting

If you hammer the Discogs API too hard, then at one point it will stop sending data, but instead send you the HTTP 429 response code ("Too Many Requests"). By checking the output from what Discogs returns (such as the headers indicating how many requests can be done in a certain time frame) and also acting upon it (by pausing your script for a few seconds if you are about to hit the limits), you can more easily avoid this problem.

Comments

Popular posts from this blog

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi

SPARS codes (part 1)

Let's talk about SPARS codes used on CDs (or CD-like formats). You have most likely seen it used, but maybe don't know its name. The SPARS code is a three letter code indicating if recording, mixing and mastering were analogue or digital. For example they could look like the ones below. There is not a fixed format, so there are other variants as well. Personally I am not paying too much attention to these codes (I simply do not care), but in the classical music world if something was labeled as DDD (so everything digital) companies could ask premium prices. That makes it interesting information to mine and unlock, which is something that Discogs does not allow people to do when searching (yet!) even though it could be a helpful filter. I wanted to see if it can be used as an identifier to tell releases apart (are there similar releases where the only difference is the SPARS code?). SPARS code in Discogs Since a few months SPARS is a separate field in the Discogs

Country statistics (part 2)

One thing I wondered about: for how many releases is the country field changed? I looked at the two most recent data dumps (covering February and March 2019) and see where they differed. In total 5274 releases "moved". The top 20 moves are: unknown -> US: 454 Germany -> Europe: 319 UK & Europe -> Europe: 217 unknown -> UK: 178 UK -> Europe: 149 Netherlands -> Europe: 147 unknown -> Europe: 139 unknown -> Germany: 120 UK -> US: 118 Europe -> Germany: 84 US -> UK: 79 USA & Canada -> US: 76 US -> Canada: 65 unknown -> France: 64 UK -> UK & Europe: 62 UK & Europe -> UK: 51 France -> Europe: 51 Saudi Arabia -> United Arab Emirates: 49 US -> Europe: 46 unknown -> Japan: 45 When you think about it these all make sense (there was a big consolidation in Europe in the 1980s and releases for multiple countries were made in a single pressing plant) but there are also a few weird changes: