One thing that I still haven't properly researched is barcodes. This is not because the barcode field in Discogs is used correctly. Quite the contrary: I am seeing lots of releases where the barcode field is used for all kinds of information that is not a barcode.
No, the real reason is that I am afraid of how much bad stuff I will find. The current amount of possible known errors in Discogs is already quite large, even when just looking at part of the "Barcodes and Other Identifiers" section! I haven't even started on verifying other data such as labels, companies, and so on, but I am very sure that as soon as I start processing that the flood gates will be open.
But, I have to take the plunge one day, so today is as good as any other. Before I can show some useful results it is good to dive into some specifics about barcodes and how these are used in Discogs.
So the first question I had is: how many barcode values are there that obviously cannot be barcodes?
So with these questions in mind I started to dig into the data.
For each million release numbers there are about 200,000 releases with barcode fields, with the first million releases numbers having about 275,000. In total there are 2,881,281 releases with the barcode field. This excludes the releases where a barcode is present but where it has ended up in other fields or sections (like the 'Notes'), but purely the field "Barcode" in the "Barcodes and Other Identifiers" section. Together these releases have 3,403,322 barcode fields.
In the next part I will dive further into the data.
No, the real reason is that I am afraid of how much bad stuff I will find. The current amount of possible known errors in Discogs is already quite large, even when just looking at part of the "Barcodes and Other Identifiers" section! I haven't even started on verifying other data such as labels, companies, and so on, but I am very sure that as soon as I start processing that the flood gates will be open.
But, I have to take the plunge one day, so today is as good as any other. Before I can show some useful results it is good to dive into some specifics about barcodes and how these are used in Discogs.
Barcodes
My guess is that most people are familiar with barcodes, but if not there is an excellent article on Wikipedia about barcodes. In short: there are many different kinds of barcodes, so that is something to keep in mind when looking at the data in Discogs. Most of the barcodes will either be EAN-13, or JAN (Japanese releases), although I am confident there are others as well. One example I saw is likely a EAN-13 with the check digit missing.Barcodes in Discogs
Barcodes are used in Discogs and for example the Discogs app on iPhone will allow you to scan a barcode and then you will be taken to a list of releases that have that particular barcode. This information is searched in the Discogs database and it is a nifty feature, but it depends on the barcode information being correct. And this is where the fun starts. There are three questions I wanted to answer:Barcode field abused for other data
In Discogs there is a section "Barcodes and Other Identifiers", or "BaOI" for short. This section is used to store all kinds of information, such as label codes (which does not mean "anything printed on a label"), rights society information, run outs, and more. When you enter a new item in this section you have to select the right type for the field you are entering, but some people don't know what to fill in and just leave it set to the default which is "Barcode". Or they simply don't bother to change it and leave it set to "Barcode". This slightly complicates things, as for example a SPARS code is obviously not a barcode.So the first question I had is: how many barcode values are there that obviously cannot be barcodes?
Incorrect barcodes
The second question that I had is: how many barcodes do not actually correspond to the release? One of the sources of errors in Discogs is the so called "copy to draft" function where you can use an existing release in Discogs as a template for entering data, to make it easier. Some people are extremely sloppy and don't remove the incorrect data from the release they copied it from from. I am seeing plenty of errors with this, but I am not sure how prevelant it really is.Wrongly dated releases
Barcodes only started to become more prominent in the mid-1980s and before that almost no releases (according to my knowledge) had barcodes. Old releases obviously cannot have a barcode, so having a barcode on an old release either means that the barcode is incorrect (see above) or that someone entered the wrong release date and it is actually a repress or reissue. The third question is: how many releases have a valid barcode, but a release year that would make it impossible to have a barcode?So with these questions in mind I started to dig into the data.
Barcode distribution in Discogs
I took the data dump covering all data until (not including) May 1 2019 and looked at which releases have a barcode field defined, regardless of whether or not they actually are correct:Distribution of releases with barcodes in the Discogs dataset |
For each million release numbers there are about 200,000 releases with barcode fields, with the first million releases numbers having about 275,000. In total there are 2,881,281 releases with the barcode field. This excludes the releases where a barcode is present but where it has ended up in other fields or sections (like the 'Notes'), but purely the field "Barcode" in the "Barcodes and Other Identifiers" section. Together these releases have 3,403,322 barcode fields.
"No barcode"
There are quite a few releases that have no barcode. Because Discogs makes it really hard to flag non-existent data and there is no single way to do this I looked into variations of "none" (excluding case variations):- none : 24753
- [none] : 196
- (none) : 125
- none : 97
- none. : 14
- none : 10
- none (promo) : 6
- "none" : 6
- none present : 5
- #cat: none : 3
- none shown : 2
- none available : 2
- none on cover : 2
- none - pre barcode era : 2
- none barcode : 2
- none : 2
- none (club edition) : 2
- none : 1
- (none as promo) : 1
- none as promo : 1
- cat# none : 1
- none (digital) : 1
- none (erased trace) : 1
- none given : 1
- none : 1
- none because release date : 1
- none2008 : 1
- none 9 : 1
- none on label : 1
- none - test pressing : 1
- nonexistent : 1
- none / promotional only - not for sale : 1
- none [white rectangle] : 1
- none on cd or box : 1
- 'none' : 1
- none on cd case : 1
- none1 : 1
- none listed : 1
- none on disc : 1
- none4009880364925 : 1
- -none- : 1
- no barcode : 2490
- no : 261
- no barcode : 46
- non : 42
- no bar code : 41
- no barcode. : 30
- unknown : 30
- not on label : 27
- {digits not printed} : 14
- no code : 14
- not barcode : 14
- (no barcode) : 9
- no bar code promo only not for sale : 6
- no barcode ! : 6
- [no barcode] : 6
- no barcode on the sleeve : 5
- not : 5
- no : 5
- nobarcode : 5
- no barecode : 5
- not on barcode : 5
- not for sale : 4
- not available : 4
- no barcode available : 4
- no barcode! : 3
- [unknown] : 3
- [no text] : 3
- non barcode : 3
- non specified : 3
- no barcode on cover : 3
In the next part I will dive further into the data.
Comments
Post a Comment