Skip to main content

How Discogs can prevent wrong data (part 2)

For people caring about correctness of data in Discogs, it sometimes seems like an uphill battle. Once an error has been introduced, it gets copied and spreads, and fixing it becomes almost impossible. It very much resembles fixing errors in the waterfall model of software engineering: stopping errors at the beginning is much easier than fixing later.

One way errors are spreading is because of the "copy to draft" functionality in Discogs, where information from an existing release can be copied and serve as a template when adding a new variant of a release. Although extremely useful (it speeds up entering information) people not only copy the correct information, but also the errors, or they leave information that is irrelevant to a release and don't remove it (example: SID codes for vinyl or other releases). When the new release in turn is used as a template the wrong information spreads further through the database.

Detecting errors is quite trivial using some simple checks that verify the contents of the data in a release and report if something is wrong.

Fixing errors is another story. People using scripts (like myself) already have a strong motivation to find and fix these errors, but the vast amount of errors that are currently in the data (my scripts already detect close to 400,000 individual errors without trying hard) makes it an impossible task to do with just a few users, no matter how eager they are.

So you need to try to motivate other users to fix errors as well. In this blogpost I am going to explore a solution that I think could be implemented in a fairly non-intrusive way, and very lightweight way, perhaps in some sort of "janitor mode" for users that want to help fixing.

Helping users find errors more easily

Guiding users to fix errors is key. Let's look at an error that is quite common in the "Barcode and Other Identifiers" (BaOI) section, namely using wrong information in the Barcode field. The reason there are many errors in this field is that it is the default value when adding a new identifier to this section and quite a few people simply do not change the default (even though it is minimal effort). The below screenshot shows what that looks like when visiting a release page on Discogs:

Common error: the default value Barcode has not been changed to Rights Society

As said, detecting the error is quite trivial, but getting people to fix the error is more difficult The first step is that the user has to recognize the error. Because there are so many different guidelines and rules I can totally understand that some users (including myself) take a conservative approach when seeing something that doesn't look quite right and think "I'll not touch it, because I am unsure about it" and leave it as is. Or perhaps they do not want to touch other people's submissions (also, some people are quite possessive, which simply does not make sense for a collaborative site), or think that because it is already in the database it is the correct way. Whatever the reason for the error, making certain that the user actually recognizes it as an error is a very important first step in fixing it.

Most humans are very good picking up visual clues, so adding something to the site indicating that there might be something wrong could be very effective. It could possibly look something like the picture below. I must admit that my graphical skills are very bad, so this can use a lot of improvement:

Mock up to indicate that there is possibly something wrong with the data
The possible error that is on the site now has a red rectangle around it, with some text indicating that likely the value is wrong (text should most probably go somewhere else, or be shown in another way, and very likely also containing some suggestion like "This is likely a Rights Society" instead of the terse message that I wrote).

Possible implementation

Personally I would likely use a script on the server to generate small bits of JSON (describing errors) for every release that has errors, and use some extra client side code with CSS to download the JSON and indicate on the page where the users are either when people have some sort of "janitor mode" enabled, or are editing releases. The JSON I would either generate on the fly or every night, and invalidate every time the release is done and then regenerate (either on the fly, or the next night).

Probably there is a lot more that could be done to prevent errors sneaking into the data. I might talk about that in a future post: I have a few vague ideas and they still need some more time.

Comments

Popular posts from this blog

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi

SPARS codes (part 1)

Let's talk about SPARS codes used on CDs (or CD-like formats). You have most likely seen it used, but maybe don't know its name. The SPARS code is a three letter code indicating if recording, mixing and mastering were analogue or digital. For example they could look like the ones below. There is not a fixed format, so there are other variants as well. Personally I am not paying too much attention to these codes (I simply do not care), but in the classical music world if something was labeled as DDD (so everything digital) companies could ask premium prices. That makes it interesting information to mine and unlock, which is something that Discogs does not allow people to do when searching (yet!) even though it could be a helpful filter. I wanted to see if it can be used as an identifier to tell releases apart (are there similar releases where the only difference is the SPARS code?). SPARS code in Discogs Since a few months SPARS is a separate field in the Discogs

Country statistics (part 2)

One thing I wondered about: for how many releases is the country field changed? I looked at the two most recent data dumps (covering February and March 2019) and see where they differed. In total 5274 releases "moved". The top 20 moves are: unknown -> US: 454 Germany -> Europe: 319 UK & Europe -> Europe: 217 unknown -> UK: 178 UK -> Europe: 149 Netherlands -> Europe: 147 unknown -> Europe: 139 unknown -> Germany: 120 UK -> US: 118 Europe -> Germany: 84 US -> UK: 79 USA & Canada -> US: 76 US -> Canada: 65 unknown -> France: 64 UK -> UK & Europe: 62 UK & Europe -> UK: 51 France -> Europe: 51 Saudi Arabia -> United Arab Emirates: 49 US -> Europe: 46 unknown -> Japan: 45 When you think about it these all make sense (there was a big consolidation in Europe in the 1980s and releases for multiple countries were made in a single pressing plant) but there are also a few weird changes: