Skip to main content

Finding differences between monthly Discogs data dumps (part 1)

One cool thing about Discogs is that you can see the whole history of the data, including who edited when and what was changed. For a software engineer like me that is very convenient, as I am used to working with version control.

Unfortunately this data is missing from the Discogs datadump, so I can not easily see which data has changed compared to a previous month. This makes it a bit more difficult to for example see if things are being changed incorrectly, or spot other problems. Having a field with a timestamp of the last change to the release would be great to have. I suggested it to Discogs, but I am not sure if they will add it to the XML, or maybe they will save it for a next update of the format.

In any case, I wanted to be able to see the differences between months now, so I came up with an awful hack, but one that would also make it possible for me to do some very crude filtering, as well as find differences between any two months, and not just consecutive months, at the exchange of a lot of diskspace.

Splitting the XML

The first task is to break down the XML data in chunks so the individual releases can be easily accessed. For that I used the xml_split tool, which for each release spits out a separate file containing the XML for that one particular release. I invoke it like this:

$ zcat discogs_20170901_releases.xml.gz | xml_split

which writes the files to the current directory, so you might want to make sure that you have a separate directory and change to that directory first. Also be aware that you need enough diskspace and you should count on something like 40 GiB per month (to be on the safe side).

Splitting the XML file is a very I/O intensive task, so that is something that you want to optimize for. I tried on a system with a regular disk and had to stop it after running for a long time. That same system also has SSD and then it still took 142 minutes to process, so SSD is an absolute must. Also, as the program will be spitting out many files it could happen that you run out of inodes if you are already low on inodes (it happened to me), so make sure that your disk has been properly prepared.
SSD is a "must have" when splitting the XML data.
In case you use Linux: make sure your system is up to date so you can take advantage of performance fixes in the latest Linux kernels.

The cool thing is that now I actually can process chunks of XML in parallel, and also much easier use a DOM parser instead of a SAX parser. The first task after this is to make sure that each XML snippet can easily be linked to a release id. The easiest way to do that is to embed the release id in the filename. For example, the data of release 249504 would be kept in a file named 249504.xml and so on.

Computing checksums

Then I simply compute SHA256 checksums for the content of each of the XML files and store that in a separate file, together with the name of the release.

For this I wrote a script that works with a few threads computing checksums of files and grabbing information from the XML DOM and writing them to some separate files, for easier access to some information.

The code for it can be found in my GitHub repository. The script currently spits out three files per month:
  1. a mapping of file name to SHA256 checksum
  2. a mapping of file name to country
  3. a mapping of file name to status if the status is anything else than Accepted
The rest of the scripts then manipulate these output files (although the script to process country data still needs to be made).

Be warned that the script can easily run for more than two hours (depending on configuration), longer if you have fewer resources available. Also, the scripts still contain some hardcoded paths, but this will be changed as soon as I find the time to fix it.

It is a crude hack, but it works.

Finding differences between monthly Discogs data

The next step is to find differences between the data from the monthly dumps. This is as simple as:
  1. get the entries plus corresponding checksums for one month
  2. get the entries plus corresponding checksums for the other month
  3. compare the sets in 1 and 2
and that really is all there is to it.  With this information the following questions can now be easily answered:
  1. how many entries are new?
  2. how many entries have been removed (merged, etc.)?
  3. how many entries have been changed?

Test results

I compared the August 2017 (containing all data up to August 1) and September 2017 (containing all data up to September 1) dumps. Some statistics about both months:
From August 1- 31 2017 107,347 new releases were added to the Discogs database.
  • the August 2017 dump had 8,773,483 entries, whereas the September 2017 dump  had 8,878,391 entries. This is a difference of 104,908 entries (new entries in September - entries removed in August).
  • 107,347 releases were new in the September 2017 dump
  • 2,439 releases from the August 2017 dump could no longer be found in the September 2017 dump.
  • in the August 2017 dump 108 entries were not Accepted, in the September 2017 dump 200 entries were not Accepted. This is a difference of 92 entries.
  • of the 108 entries that were not accepted in the August 2017 dump 97 could not be found in the September 2017 dump. 189 from the September 2017 dump could not be found in the August 2017 dump. This means that only 11 were shared.
  • one entry that was set to Draft in the August 2017 dump was Accepted in the September 2017 dump, namely release 10642344. It was a draft that had been added before it officially had been released.
  • 8,355,640 entries that could be found in the August 2017 dump were unchanged in the September 2017 dump. 415,404 entries from the August 2017 dump were changed in the September 2017 dump.
From August 1 - 31 2017 415,404 entries were changed in the Discogs database.

Future work

There are a few things that still need to be done and that I will work on soon:
  • process only the new files to see if there are any smells in particular and, where appropriate change the information in the database.
  • further split the files, for example per artist or per label
  • check how many of the new releases from the September 2017 dump will have been removed in the October 2017 dump
  • and probably a few more things...
In the future I want to make it easier to unlock this data is by putting it in a database. This will likely take some time to get right.

Comments

Popular posts from this blog

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi...

SPARS codes (part 1)

Let's talk about SPARS codes used on CDs (or CD-like formats). You have most likely seen it used, but maybe don't know its name. The SPARS code is a three letter code indicating if recording, mixing and mastering were analogue or digital. For example they could look like the ones below. There is not a fixed format, so there are other variants as well. Personally I am not paying too much attention to these codes (I simply do not care), but in the classical music world if something was labeled as DDD (so everything digital) companies could ask premium prices. That makes it interesting information to mine and unlock, which is something that Discogs does not allow people to do when searching (yet!) even though it could be a helpful filter. I wanted to see if it can be used as an identifier to tell releases apart (are there similar releases where the only difference is the SPARS code?). SPARS code in Discogs Since a few months SPARS is a separate field in the Discogs ...

Country statistics (part 2)

One thing I wondered about: for how many releases is the country field changed? I looked at the two most recent data dumps (covering February and March 2019) and see where they differed. In total 5274 releases "moved". The top 20 moves are: unknown -> US: 454 Germany -> Europe: 319 UK & Europe -> Europe: 217 unknown -> UK: 178 UK -> Europe: 149 Netherlands -> Europe: 147 unknown -> Europe: 139 unknown -> Germany: 120 UK -> US: 118 Europe -> Germany: 84 US -> UK: 79 USA & Canada -> US: 76 US -> Canada: 65 unknown -> France: 64 UK -> UK & Europe: 62 UK & Europe -> UK: 51 France -> Europe: 51 Saudi Arabia -> United Arab Emirates: 49 US -> Europe: 46 unknown -> Japan: 45 When you think about it these all make sense (there was a big consolidation in Europe in the 1980s and releases for multiple countries were made in a single pressing plant) but there are also a few weird changes:...