Skip to main content

Barcodes (part 1)

One thing that I still haven't properly researched is barcodes. This is not because the barcode field in Discogs is used correctly. Quite the contrary: I am seeing lots of releases where the barcode field is used for all kinds of information that is not a barcode.

No, the real reason is that I am afraid of how much bad stuff I will find. The current amount of possible known errors in Discogs is already quite large, even when just looking at part of the "Barcodes and Other Identifiers" section! I haven't even started on verifying other data such as labels, companies, and so on, but I am very sure that as soon as I start processing that the flood gates will be open.

But, I have to take the plunge one day, so today is as good as any other. Before I can show some useful results it is good to dive into some specifics about barcodes and how these are used in Discogs.

Barcodes

My guess is that most people are familiar with barcodes, but if not there is an excellent article on Wikipedia about barcodes. In short: there are many different kinds of barcodes, so that is something to keep in mind when looking at the data in Discogs. Most of the barcodes will either be EAN-13, or JAN (Japanese releases), although I am confident there are others as well. One example I saw is likely a EAN-13 with the check digit missing.

Barcodes in Discogs

Barcodes are used in Discogs and for example the Discogs app on iPhone will allow you to scan a barcode and then you will be taken to a list of releases that have that particular barcode. This information is searched in the Discogs database and it is a nifty feature, but it depends on the barcode information being correct. And this is where the fun starts. There are three questions I wanted to answer:

Barcode field abused for other data

In Discogs there is a section "Barcodes and Other Identifiers", or "BaOI" for short. This section is used to store all kinds of information, such as label codes (which does not mean "anything printed on a label"), rights society information, run outs, and more. When you enter a new item in this section you have to select the right type for the field you are entering, but some people don't know what to fill in and just leave it set to the default which is "Barcode". Or they simply don't bother to change it and leave it set to "Barcode". This slightly complicates things, as for example a SPARS code is obviously not a barcode.

So the first question I had is: how many barcode values are there that obviously cannot be barcodes?

Incorrect barcodes

The second question that I had is: how many barcodes do not actually correspond to the release? One of the sources of errors in Discogs is the so called "copy to draft" function where you can use an existing release in Discogs as a template for entering data, to make it easier. Some people are extremely sloppy and don't remove the incorrect data from the release they copied it from from. I am seeing plenty of errors with this, but I am not sure how prevelant it really is.

Wrongly dated releases

Barcodes only started to become more prominent in the mid-1980s and before that almost no releases (according to my knowledge) had barcodes. Old releases obviously cannot have a barcode, so having a barcode on an old release either means that the barcode is incorrect (see above) or that someone entered the wrong release date and it is actually a repress or reissue. The third question is: how many releases have a valid barcode, but a release year that would make it impossible to have a barcode?

So with these questions in mind I started to dig into the data.

Barcode distribution in Discogs

I took the data dump covering all data until (not including) May 1 2019 and looked at which releases have a barcode field defined, regardless of whether or not they actually are correct:

Distribution of releases with barcodes in the Discogs dataset

For each million release numbers there are about 200,000 releases with barcode fields, with the first million releases numbers having about 275,000. In total there are 2,881,281 releases with the barcode field. This excludes the releases where a barcode is present but where it has ended up in other fields or sections (like the 'Notes'), but purely the field "Barcode" in the "Barcodes and Other Identifiers" section. Together these releases have 3,403,322 barcode fields.

"No barcode"

There are quite a few releases that have no barcode. Because Discogs makes it really hard to flag non-existent data and there is no single way to do this I looked into variations of "none" (excluding case variations):
  • none : 24753
  • [none] : 196
  • (none) : 125
  • none  : 97
  • none. : 14
  •  none : 10
  • none (promo) : 6
  • "none" : 6
  • none present : 5
  • #cat: none : 3
  • none shown : 2
  • none available : 2
  • none on cover : 2
  • none - pre barcode era : 2
  • none barcode : 2
  • none   : 2
  • none (club edition) : 2
  • none                 : 1
  • (none as promo) : 1
  • none as promo : 1
  • cat# none : 1
  • none (digital) : 1
  • none (erased trace) : 1
  • none given : 1
  •   none : 1
  • none because release date : 1
  • none2008 : 1
  • none 9 : 1
  • none on label : 1
  • none - test pressing : 1
  • nonexistent  : 1
  • none / promotional only - not for sale : 1
  • none [white rectangle] : 1
  • none on cd or box : 1
  • 'none' : 1
  • none on cd case : 1
  • none1 : 1
  • none listed : 1
  • none on disc : 1
  • none4009880364925 : 1
  • -none- : 1
Luckily most use "none", but there are of course extra spaces, quotes, extra data, and so on. Extending it to "no" (and not showing all results):
  • no barcode : 2490
  • no : 261
  • no barcode  : 46
  • non : 42
  • no bar code : 41
  • no barcode. : 30
  • unknown : 30
  • not on label : 27
  • {digits not printed} : 14
  • no code : 14
  • not barcode : 14
  • (no barcode) : 9
  • no bar code promo only not for sale : 6
  • no barcode ! : 6
  • [no barcode] : 6
  • no barcode on the sleeve : 5
  • not : 5
  • no  : 5
  • nobarcode : 5
  • no barecode : 5
  • not on barcode : 5
  • not for sale : 4
  • not available : 4
  • no barcode available : 4
  • no barcode! : 3
  • [unknown] : 3
  • [no text] : 3
  • non barcode : 3
  • non specified : 3
  • no barcode on cover : 3
Sigh. Seriously Discogs, just follow my suggestion, it would make the data a lot more clean and make it easier for me to skip these entries. It would make the world a better place.

In the next part I will dive further into the data.

Comments

Popular posts from this blog

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi

SPARS codes (part 1)

Let's talk about SPARS codes used on CDs (or CD-like formats). You have most likely seen it used, but maybe don't know its name. The SPARS code is a three letter code indicating if recording, mixing and mastering were analogue or digital. For example they could look like the ones below. There is not a fixed format, so there are other variants as well. Personally I am not paying too much attention to these codes (I simply do not care), but in the classical music world if something was labeled as DDD (so everything digital) companies could ask premium prices. That makes it interesting information to mine and unlock, which is something that Discogs does not allow people to do when searching (yet!) even though it could be a helpful filter. I wanted to see if it can be used as an identifier to tell releases apart (are there similar releases where the only difference is the SPARS code?). SPARS code in Discogs Since a few months SPARS is a separate field in the Discogs

Country statistics (part 2)

One thing I wondered about: for how many releases is the country field changed? I looked at the two most recent data dumps (covering February and March 2019) and see where they differed. In total 5274 releases "moved". The top 20 moves are: unknown -> US: 454 Germany -> Europe: 319 UK & Europe -> Europe: 217 unknown -> UK: 178 UK -> Europe: 149 Netherlands -> Europe: 147 unknown -> Europe: 139 unknown -> Germany: 120 UK -> US: 118 Europe -> Germany: 84 US -> UK: 79 USA & Canada -> US: 76 US -> Canada: 65 unknown -> France: 64 UK -> UK & Europe: 62 UK & Europe -> UK: 51 France -> Europe: 51 Saudi Arabia -> United Arab Emirates: 49 US -> Europe: 46 unknown -> Japan: 45 When you think about it these all make sense (there was a big consolidation in Europe in the 1980s and releases for multiple countries were made in a single pressing plant) but there are also a few weird changes: