Skip to main content

Why I love Discogs and why I hate Discogs

Since a few years I have been using Discogs and I love it and hate it at the same time. For those of you who do not know: Discogs is an online catalogue trying to collect as much information about any music-related release ever made. It works as follows: people enter information about releases into the database and collaborate on enhancing the data, fixing errors, uploading pictures and so on until the information is complete and correct.

Data quality: or why I hate Discogs

At least, that's the theory. In practice it turns out that Cory Doctorow was right: the data quality for the releases in the catalogue varies a lot. For some releases every tiny little detail has been described and there are clear pictures, for other releases you get a catalogue number and possibly the right country (if you're lucky!). Some of this bad data has been in there for many years and no one is fixing it, even though for some releases over 100 people have indicated they have it. This is what I hate about Discogs a lot.

Discogs marketplace

This wouldn't be such a big problem (just annoying) if it wasn't for the Discogs marketplace. Discogs isn't just a catalogue, but it also has an associated marketplace where people buy and sell music items (and which apparently has taken away a lot of the market from eBay when it comes to vinyl records). Sellers can browse the catalogue for the item they want to sell and then indicate that they have a copy of that item for sale. If it isn't in the catalogue they add it first and then offer it for sale. But this is unfortunately not what happens. Too often I see that a seller has indicated a certain item is for sale, but then in the description it turns out that it is a different item. For example for a Peruvian item that I saw offered for sale the seller said:

"my copy is from Argentina"

but the catalogue did not contain information about the pressing from Argentina. This is against the Discogs terms of service, which say:

"Discogs allows for the submission of all unique versions of a release to the database. This means that all items listed for sale in the Marketplace must correspond with the correct release in the database. If the correct release does not yet exist on Discogs it must be submitted to the database before it can be sold. Commenting on the differences between the item being sold and the one detailed on the Discogs release page is not permitted."

There are literally thousands and thousands of items for sale that are not corresponding with the correct release in the database.

Discogs actually offers a way to flag this so they can take action but punishing sellers is not really in their own interest: Discogs takes a cut for every sale that is done through Discogs. In fact, that is their business model! Every sale means money for Discogs so I understand them, but for people who expect to buy a certain item, and then don't get the right item as advertized, or who miss out on an item because it was not added to the database it can be very sour.

I don't blame the sellers: at the moment there are still lots of releases missing from Discogs and adding a release is a lot of work if you want to do it right. If you have thousands of items for sale it is very time consuming check the releases and add ones that are missing.

Using data: or why I love Discogs

As said I really hate the data quality issues in Discogs, but at the same time the site has a lot of potential. There are a few ways to react to this:
  1. I could get angry
  2. I could ignore it and hope someone will fix the problems with the data
  3. I could fix the data
Getting angry will not get me further. Ignoring the problem is the most convenient and costs the least energy, but it might take a long time for things to improve. Or, I can fix it.

Maybe foolishly I have decided to take that last option and to fix the data wherever I can. Luckily, and this is why I love them, Discogs is making this quite easy by making the data of the website (apart from the pictures, user data and sales data) available every month in a set of XML files under the CC0 license to basically do whatever you want with it.

Using some scripts and domain specific knowledge it is quite trivial to flag where entries have problems. This is what this blog will be about. The data in the Discogs database has enormous potential, we just need to unlock it.

The next few posts in this blog will be about the Discogs XML data and what can be done to discover errors in the data so they can be fixed.

Comments

Popular posts from this blog

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi...

SPARS codes (part 1)

Let's talk about SPARS codes used on CDs (or CD-like formats). You have most likely seen it used, but maybe don't know its name. The SPARS code is a three letter code indicating if recording, mixing and mastering were analogue or digital. For example they could look like the ones below. There is not a fixed format, so there are other variants as well. Personally I am not paying too much attention to these codes (I simply do not care), but in the classical music world if something was labeled as DDD (so everything digital) companies could ask premium prices. That makes it interesting information to mine and unlock, which is something that Discogs does not allow people to do when searching (yet!) even though it could be a helpful filter. I wanted to see if it can be used as an identifier to tell releases apart (are there similar releases where the only difference is the SPARS code?). SPARS code in Discogs Since a few months SPARS is a separate field in the Discogs ...

Country statistics (part 2)

One thing I wondered about: for how many releases is the country field changed? I looked at the two most recent data dumps (covering February and March 2019) and see where they differed. In total 5274 releases "moved". The top 20 moves are: unknown -> US: 454 Germany -> Europe: 319 UK & Europe -> Europe: 217 unknown -> UK: 178 UK -> Europe: 149 Netherlands -> Europe: 147 unknown -> Europe: 139 unknown -> Germany: 120 UK -> US: 118 Europe -> Germany: 84 US -> UK: 79 USA & Canada -> US: 76 US -> Canada: 65 unknown -> France: 64 UK -> UK & Europe: 62 UK & Europe -> UK: 51 France -> Europe: 51 Saudi Arabia -> United Arab Emirates: 49 US -> Europe: 46 unknown -> Japan: 45 When you think about it these all make sense (there was a big consolidation in Europe in the 1980s and releases for multiple countries were made in a single pressing plant) but there are also a few weird changes:...