Skip to main content

What could be done if the Discogs edit history was made available

Discogs is making its database data available under a CC0 license, but one important aspect of the data is lacking and that is the edit history. For me as a software engineer having the edit history available is very natural, as it allows me to dive into the history perhaps see where people went wrong, when certain errors were introduced, and so on. For Discogs this is only possible if I am logged into the website and go to the release page history where I am presented with a view of the edit history without being able to interact with it (queries, and so on), which I find frustrating.

The edit history is a very rich place of information. For each release a lot of information is kept, such as:
  1. creation date/modifcation date
  2. origin (whether it was created fresh, or if "copy to draft" was used)
  3. merge information (if merged with another release)
  4. content of edits
  5. contributors and votes
While some of this information is partially available through the Discogs API the full information is currently unavailable except to Discogs itself. This is unfortunate, as there is a lot of very interesting information that could be mined to discover patterns where people are going wrong, to avoid duplicate work and wrong information persisting in the data for a long time, finding problematic contributors, and so on.

I want to walk through a few scenarios that I would love to be able to research with this data.

Creation date/modification date

Having access to the creation date of each release would allow interesting information to be uncovered, such as how many releases were added or changed per month, day or even hour and find hotspots of activity (or inactivity).

Right now this information can only be aggregated at a very coarse level when looking at the monthly dumps (basically: what is the first release in the dump file and what is the last?).

When using the API it is possible to get a little bit more information as the JSON retrieved from the API contains an element called 'date_added' but would require downloading al the information via the API, which can take a very long time: downloading 9 million releases from the API using an authenticated connection would take approximately 100 days (unless you are using a distributed downloader).

Then to verify which releases have been changed you either would have to wait until you can cross-correlate with the monthly data dump, or download all the data again and check when release were updated (and then for more than 95% of the releases the answer would be that the release was not changed). By the time you are done some of the releases will have changed again, but you won't know which ones, so you would have to start all over again. If only Discogs could make a list available of when a release has been changed for the last time it would make it a lot lot easier.

A file with information about releases and their last change date could be as simple as a file with two tab separated columns, with a release number and a UNIX epoch time stamp. When compressed with gzip such a file would be only about 30 MiB in size for the current database (I tested this with a mockup).

Origin of new releases and merging releases

In Discogs there are two ways to add a release to the database:
  1. create a new release from scratch
  2. copy an existing release and use it as a template ("copy to draft")
The second method is used a lot because entering information in Discogs is a ton of work and starting from a baseline that has most information helps save time. The downside is that some users are sloppy and a lot of the information is not adapted and survives even if it should not be there. Knowing the creation flow for releases makes it easier to detect these cases.

In addition it would make it much easier to track changes that should be applied to multiple releases. To speak in software engineering terms: a "copy to draft" can be seen as a fork. Edits to the release can be seen as patches. In the software world it is common to cherry pick changes and port them to other branches of the software. This is also something that might be useful for releases in Discogs. It happens often that a variant is added and that "copy to draft" is used. Things are then added to or changed in the fork, but not in the original, or vice versa. For some of the information this makes total sense (like catalog numbers, country, and so on) but for other information (artists, titles of tracks, and so on) many times this does not make sense and the information should be backported or forward ported to the other release.

If the history could actually be followed this would be a lot easier to detect.

Similarly, information about merges (when two duplicate releases are merged into one) would help enormously to be able to track these changes in a consistent way.

Contents of edits

To be able to be able to port changes from one release to another it is necessary that edits themselves can also be mined. It would also be interesting to see how many edits releases have on average (my gut feeling: very few).

I am pretty sure that looking at this there would be a few surprises, especially when combined with information about contributors and votes.

Contributors and votes

It would be interesting to be able to see which contributors did what. When using the API you can already get a bit of information as the people who contributed are listed (this is not in the monthly XML data dump), but it doesn't specify who changed what, who the original submitter was, and so on. What I would find interesting is to see if there are certain users who keep making specific errors that other users then fix.

As an example: I know that there is one user who added hundreds of releases with watermarked images and that there was one other user who then disabled all the watermarked images. I also have seen users that consistently add releases by forking an older release and then don't add them to the corresponding "master" release (and this is why having the information how releases are created would be so useful to have).

There might also be users whose contributions are consistently rejected as they are for example hijacking releases. Or some users might get lots of comments and votes about changes that need to be made, but ignore these and add more releases instead (also with errors).

Being able to detect these instances automatically and earlier would be useful for database hygiene.

So please Discogs, let us have some more information (of course under an acceptable license). I am sure you will not regret it.

Comments

Popular posts from this blog

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi

SPARS codes (part 1)

Let's talk about SPARS codes used on CDs (or CD-like formats). You have most likely seen it used, but maybe don't know its name. The SPARS code is a three letter code indicating if recording, mixing and mastering were analogue or digital. For example they could look like the ones below. There is not a fixed format, so there are other variants as well. Personally I am not paying too much attention to these codes (I simply do not care), but in the classical music world if something was labeled as DDD (so everything digital) companies could ask premium prices. That makes it interesting information to mine and unlock, which is something that Discogs does not allow people to do when searching (yet!) even though it could be a helpful filter. I wanted to see if it can be used as an identifier to tell releases apart (are there similar releases where the only difference is the SPARS code?). SPARS code in Discogs Since a few months SPARS is a separate field in the Discogs

Country statistics (part 2)

One thing I wondered about: for how many releases is the country field changed? I looked at the two most recent data dumps (covering February and March 2019) and see where they differed. In total 5274 releases "moved". The top 20 moves are: unknown -> US: 454 Germany -> Europe: 319 UK & Europe -> Europe: 217 unknown -> UK: 178 UK -> Europe: 149 Netherlands -> Europe: 147 unknown -> Europe: 139 unknown -> Germany: 120 UK -> US: 118 Europe -> Germany: 84 US -> UK: 79 USA & Canada -> US: 76 US -> Canada: 65 unknown -> France: 64 UK -> UK & Europe: 62 UK & Europe -> UK: 51 France -> Europe: 51 Saudi Arabia -> United Arab Emirates: 49 US -> Europe: 46 unknown -> Japan: 45 When you think about it these all make sense (there was a big consolidation in Europe in the 1980s and releases for multiple countries were made in a single pressing plant) but there are also a few weird changes: