Finding differences between monthly Discogs data dumps (part 1)

One cool thing about Discogs is that you can see the whole history of the data, including who edited when and what was changed. For a software engineer like me that is very convenient, as I am used to working with version control.

Unfortunately this data is missing from the Discogs datadump, so I can not easily see which data has changed compared to a previous month. This makes it a bit more difficult to for example see if things are being changed incorrectly, or spot other problems. Having a field with a timestamp of the last change to the release would be great to have. I suggested it to Discogs, but I am not sure if they will add it to the XML, or maybe they will save it for a next update of the format.

In any case, I wanted to be able to see the differences between months now, so I came up with an awful hack, but one that would also make it possible for me to do some very crude filtering, as well as find differences between any two months, and not just consecutive months, at the exchange of a lot of diskspace.

Splitting the XML

The first task is to break down the XML data in chunks so the individual releases can be easily accessed. For that I used the xml_split tool, which for each release spits out a separate file containing the XML for that one particular release. I invoke it like this:

$ zcat discogs_20170901_releases.xml.gz | xml_split

which writes the files to the current directory, so you might want to make sure that you have a separate directory and change to that directory first. Also be aware that you need enough diskspace and you should count on something like 40 GiB per month (to be on the safe side).

Splitting the XML file is a very I/O intensive task, so that is something that you want to optimize for. I tried on a system with a regular disk and had to stop it after running for a long time. That same system also has SSD and then it still took 142 minutes to process, so SSD is an absolute must. Also, as the program will be spitting out many files it could happen that you run out of inodes if you are already low on inodes (it happened to me), so make sure that your disk has been properly prepared.

SSD is a "must have" when splitting the XML data.

In case you use Linux: make sure your system is up to date so you can take advantage of performance fixes in the latest Linux kernels.

The cool thing is that now I actually can process chunks of XML in parallel, and also much easier use a DOM parser instead of a SAX parser. The first task after this is to make sure that each XML snippet can easily be linked to a release id. The easiest way to do that is to embed the release id in the filename. For example, the data of release 249504 would be kept in a file named 249504.xml and so on.

Computing checksums

Then I simply compute SHA256 checksums for the content of each of the XML files and store that in a separate file, together with the name of the release.

For this I wrote a script that works with a few threads computing checksums of files and grabbing information from the XML DOM and writing them to some separate files, for easier access to some information.

The code for it can be found in my GitHub repository. The script currently spits out three files per month:

a mapping of file name to SHA256 checksum
a mapping of file name to country
a mapping of file name to status if the status is anything else than Accepted

The rest of the scripts then manipulate these output files (although the script to process country data still needs to be made).

Be warned that the script can easily run for more than two hours (depending on configuration), longer if you have fewer resources available. Also, the scripts still contain some hardcoded paths, but this will be changed as soon as I find the time to fix it.

It is a crude hack, but it works.