Skip to main content

Looking back on 9 months of digging into Discogs

I have been digging into the Discogs dataset for over 9 months now and blogging about it since September. During this period I have made a few observations.

In short I must say that I have mixed feelings about Discogs because it isn't clear what Discogs actually is and what they want to achieve. I talked about this earlier, but it basically comes down to this: Is it a catalog? Is it a marketplace? Is it a place to organize your collection?

Discogs is trying to be all but not succeeding because there are tensions between the different use cases: enforcing correctness chases sellers away from the marketplace and selling records is what is bringing in the money for Discogs and which keeps the other fires burning. But the costs for this is that the data in the catalog is sometimes blatantly incorrect, which vastly reduces its value.

Personally I don't care about the marketplace, as I am a collector and I care about the catalog. Discogs has enormous potential and for collectors with a bit of an obsession with different record pressings (like myself) a correct catalog would be a really cool thing to have and use.

But at the moment Discogs is a very very long way from that. And, to be honest, I am not sure if it will ever get there. Let me tell you abou why I think so.

Discogs' data model is incorrect and hard to fix

Discogs is far from perfect. It is very clear that the data model has grown organically and you can almost see which design choices were made to accomodate new uses cases (or at least, I think I can).

Fixing this would be an enormous operation. At the moment there are close to ten million releases in the Discogs database, with a lot of associated metadata and releases being added all the time. They cannot just redesign the datamodel, migrate the database to the new structure and change users' workflow, without major disruptions to operation of the website. So unfortunately I fear that this is something that we have to live with.

Some people don't appreciate help

Some people have responded very negatively to changes to their releases. There is not a single reason: some people are very possessive of their data ("this is my release"), others people simply do not seem to understand that the pages that they see are just a representation of data, and they focus on the layout of the data as if it were in print. Others object to changes and point to old guidelines that were in place when the release was edited and insist that the release should be kept as it was, or judged according to the guidelines that are in place when the release was added, which to me makes even less sense.

Some people don't want to take ownership of problems

What I noticed is that some people point out problems and describe them in full detail (with references to forum discussions, and so on) and then keep pinging until someone (one of the contributors, or someone else) fixes the problems. To me this is just absurd: in a collaborative system you simply fix that if you already have all the necessary information, instead of waiting and pinging for (in some instances) years. It is truly mind boggling.

Discogs' voting system is too coarse and harsh

Discogs allows people with voting rights to vote on the correctness of the releases. In some cases the commits are then automatically reverted. Voting is done for the entire release. I have seen it happen that someone corrected a lot of the information, but not all of it (due to lack of information, and so on) and then someone voted "Entirely Incorrect", because of some other information, that was already present, was not corrected and the release was rolled back to its previous state, even if it was in fact more incorrect.

Cataloguing vinyl releases is hard...

There are so many tiny differences in releases and the way they were pressed that it is almost as if they did it on purpose to confuse collectors: "if we just change this tiny bit of text here, then in 30 years people will be really confused!" or something like that.

Of course in the earlier days these releases were not made with collectors in mind, but it was just a mass product. That meant that if a product was successful new batches were made, where errors were corrected, or designs were updated, or different labels were used just because they had a pile of them, and so on. A lot of the knowledge especially from the early days, has been lost and now we just have to guess.

...cataloguing CDs is even harder...

I always thought that distinguishing vinyl records from eachother was hard and that with CDs it actually would be easier, as there are fewer pressing plants. But no, this is actually not the case at all. There are so many represses that can only be distinguished from eachother by very accurately looking at the releases and being very precise: pressing plant names, slightly different matrix numbers, all kinds of codes (SID codes), and sometimes no clarity or agreement what country a release is from, and so on, make it very difficult.

...and the way Discogs works is not exactly helping.

What really pisses me off at times is that Discogs could do a lot more to prevent incorrect releases popping up on Discogs and basically expect the Discogs comunity to clear up the mess that others have made (which could be seen as a tragedy of the commons).

For example, it would be fairly trivial to have a few basic checks (which are already in place for some pieces of data) that would flag the most obvious mistakes.

But Discogs is asking for it, by basically dumping people into "expert mode" straightaway. I think that a guided submission system with hints and suggestions (think: a wizard) would give higher quality results.

So why am I bothering?

After all the negativity above you might wonder why I am actually bothering: I could just leave it and focus on something else that is more rewarding or useful.

And, to be honest, you would be right. Quite often I actually am not bothering at all and Discogs, or errors in Discogs data, are very very away from my mind (and increasingly so).

Still I think it is a really great dataset to dig into. I have learned a lot about music releases in general and been able to explore a few new technologies that I could also reuse for other purposes, so it has been a good playing ground.

Comments

Popular posts from this blog

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi

SPARS codes (part 1)

Let's talk about SPARS codes used on CDs (or CD-like formats). You have most likely seen it used, but maybe don't know its name. The SPARS code is a three letter code indicating if recording, mixing and mastering were analogue or digital. For example they could look like the ones below. There is not a fixed format, so there are other variants as well. Personally I am not paying too much attention to these codes (I simply do not care), but in the classical music world if something was labeled as DDD (so everything digital) companies could ask premium prices. That makes it interesting information to mine and unlock, which is something that Discogs does not allow people to do when searching (yet!) even though it could be a helpful filter. I wanted to see if it can be used as an identifier to tell releases apart (are there similar releases where the only difference is the SPARS code?). SPARS code in Discogs Since a few months SPARS is a separate field in the Discogs

Country statistics (part 2)

One thing I wondered about: for how many releases is the country field changed? I looked at the two most recent data dumps (covering February and March 2019) and see where they differed. In total 5274 releases "moved". The top 20 moves are: unknown -> US: 454 Germany -> Europe: 319 UK & Europe -> Europe: 217 unknown -> UK: 178 UK -> Europe: 149 Netherlands -> Europe: 147 unknown -> Europe: 139 unknown -> Germany: 120 UK -> US: 118 Europe -> Germany: 84 US -> UK: 79 USA & Canada -> US: 76 US -> Canada: 65 unknown -> France: 64 UK -> UK & Europe: 62 UK & Europe -> UK: 51 France -> Europe: 51 Saudi Arabia -> United Arab Emirates: 49 US -> Europe: 46 unknown -> Japan: 45 When you think about it these all make sense (there was a big consolidation in Europe in the 1980s and releases for multiple countries were made in a single pressing plant) but there are also a few weird changes: