One cool thing about Discogs is that you can see the whole history of the data, including who edited when and what was changed. For a software engineer like me that is very convenient, as I am used to working with version control.
Unfortunately this data is missing from the Discogs datadump, so I can not easily see which data has changed compared to a previous month. This makes it a bit more difficult to for example see if things are being changed incorrectly, or spot other problems. Having a field with a timestamp of the last change to the release would be great to have. I suggested it to Discogs, but I am not sure if they will add it to the XML, or maybe they will save it for a next update of the format.
In any case, I wanted to be able to see the differences between months now, so I came up with an awful hack, but one that would also make it possible for me to do some very crude filtering, as well as find differences between any two months, and not just consecutive months, at the exchange of a lot of diskspace.
which writes the files to the current directory, so you might want to make sure that you have a separate directory and change to that directory first. Also be aware that you need enough diskspace and you should count on something like 40 GiB per month (to be on the safe side).
Splitting the XML file is a very I/O intensive task, so that is something that you want to optimize for. I tried on a system with a regular disk and had to stop it after running for a long time. That same system also has SSD and then it still took 142 minutes to process, so SSD is an absolute must. Also, as the program will be spitting out many files it could happen that you run out of inodes if you are already low on inodes (it happened to me), so make sure that your disk has been properly prepared.
The cool thing is that now I actually can process chunks of XML in parallel, and also much easier use a DOM parser instead of a SAX parser. The first task after this is to make sure that each XML snippet can easily be linked to a release id. The easiest way to do that is to embed the release id in the filename. For example, the data of release 249504 would be kept in a file named 249504.xml and so on.
For this I wrote a script that works with a few threads computing checksums of files and grabbing information from the XML DOM and writing them to some separate files, for easier access to some information.
The code for it can be found in my GitHub repository. The script currently spits out three files per month:
Be warned that the script can easily run for more than two hours (depending on configuration), longer if you have fewer resources available. Also, the scripts still contain some hardcoded paths, but this will be changed as soon as I find the time to fix it.
It is a crude hack, but it works.
Unfortunately this data is missing from the Discogs datadump, so I can not easily see which data has changed compared to a previous month. This makes it a bit more difficult to for example see if things are being changed incorrectly, or spot other problems. Having a field with a timestamp of the last change to the release would be great to have. I suggested it to Discogs, but I am not sure if they will add it to the XML, or maybe they will save it for a next update of the format.
In any case, I wanted to be able to see the differences between months now, so I came up with an awful hack, but one that would also make it possible for me to do some very crude filtering, as well as find differences between any two months, and not just consecutive months, at the exchange of a lot of diskspace.
Splitting the XML
The first task is to break down the XML data in chunks so the individual releases can be easily accessed. For that I used the xml_split tool, which for each release spits out a separate file containing the XML for that one particular release. I invoke it like this:$ zcat discogs_20170901_releases.xml.gz | xml_split
which writes the files to the current directory, so you might want to make sure that you have a separate directory and change to that directory first. Also be aware that you need enough diskspace and you should count on something like 40 GiB per month (to be on the safe side).
Splitting the XML file is a very I/O intensive task, so that is something that you want to optimize for. I tried on a system with a regular disk and had to stop it after running for a long time. That same system also has SSD and then it still took 142 minutes to process, so SSD is an absolute must. Also, as the program will be spitting out many files it could happen that you run out of inodes if you are already low on inodes (it happened to me), so make sure that your disk has been properly prepared.
SSD is a "must have" when splitting the XML data.In case you use Linux: make sure your system is up to date so you can take advantage of performance fixes in the latest Linux kernels.
The cool thing is that now I actually can process chunks of XML in parallel, and also much easier use a DOM parser instead of a SAX parser. The first task after this is to make sure that each XML snippet can easily be linked to a release id. The easiest way to do that is to embed the release id in the filename. For example, the data of release 249504 would be kept in a file named 249504.xml and so on.
Computing checksums
Then I simply compute SHA256 checksums for the content of each of the XML files and store that in a separate file, together with the name of the release.For this I wrote a script that works with a few threads computing checksums of files and grabbing information from the XML DOM and writing them to some separate files, for easier access to some information.
The code for it can be found in my GitHub repository. The script currently spits out three files per month:
- a mapping of file name to SHA256 checksum
- a mapping of file name to country
- a mapping of file name to status if the status is anything else than Accepted
Be warned that the script can easily run for more than two hours (depending on configuration), longer if you have fewer resources available. Also, the scripts still contain some hardcoded paths, but this will be changed as soon as I find the time to fix it.
It is a crude hack, but it works.
Finding differences between monthly Discogs data
The next step is to find differences between the data from the monthly dumps. This is as simple as:- get the entries plus corresponding checksums for one month
- get the entries plus corresponding checksums for the other month
- compare the sets in 1 and 2
- how many entries are new?
- how many entries have been removed (merged, etc.)?
- how many entries have been changed?
Test results
I compared the August 2017 (containing all data up to August 1) and September 2017 (containing all data up to September 1) dumps. Some statistics about both months:From August 1- 31 2017 107,347 new releases were added to the Discogs database.
- the August 2017 dump had 8,773,483 entries, whereas the September 2017 dump had 8,878,391 entries. This is a difference of 104,908 entries (new entries in September - entries removed in August).
- 107,347 releases were new in the September 2017 dump
- 2,439 releases from the August 2017 dump could no longer be found in the September 2017 dump.
- in the August 2017 dump 108 entries were not Accepted, in the September 2017 dump 200 entries were not Accepted. This is a difference of 92 entries.
- of the 108 entries that were not accepted in the August 2017 dump 97 could not be found in the September 2017 dump. 189 from the September 2017 dump could not be found in the August 2017 dump. This means that only 11 were shared.
- one entry that was set to Draft in the August 2017 dump was Accepted in the September 2017 dump, namely release 10642344. It was a draft that had been added before it officially had been released.
- 8,355,640 entries that could be found in the August 2017 dump were unchanged in the September 2017 dump. 415,404 entries from the August 2017 dump were changed in the September 2017 dump.
From August 1 - 31 2017 415,404 entries were changed in the Discogs database.
Future work
There are a few things that still need to be done and that I will work on soon:- process only the new files to see if there are any smells in particular and, where appropriate change the information in the database.
- further split the files, for example per artist or per label
- check how many of the new releases from the September 2017 dump will have been removed in the October 2017 dump
- and probably a few more things...
Comments
Post a Comment