Skip to main content

Posts

Showing posts from November, 2019

Which Label Code is used the most in Discogs?

The field in the Discogs database that is definitely the most misunderstood is the so called "Label Code". Many people think that this is a generic "catch all" field for any code printed on a label, but that is COMPLETELY WRONG . The Label Code is a (somewhat) unique identifier for a record label. It was introduced in Europe somewhere in the 1970s and is now in widespread use. This is all explained in the other articles I have written about this field and I recommend you read those first (especially the first article ). Many Discogs contributors think that just because there is a code on a label it should be in the Label Code field. In the latest download (all releases up until October 31 2019) there are a bit over 44,000 releases that have a wrong value in this field. That's quite a few. Then there are also releases where a Label Code should be present, but isn't (and of course I don't know how many), plus around 9,000 where someone indicated that

Homoglyphs (update October 2019)

About 1.5 year ago I discovered that some people use the wrong character set in the "Rights Society" field, because characters in various character sets (Latin, Cyrillic, Greek) look alike (the so called "homoglyphs"). Some people even mixed characters from different character sets. You can read the articles I wrote here and here and I would recommend you read those first as I will be making comparisons with the data presented there. I was wondering if this situation had actually changed, or if things got worse, so I ran my scripts and found 1146 releases where I know that the wrong character set(s) were used. The distribution across the data set is as follows: Distribution of releases with using the wrong character set(s) for Rights Society Comparing it to the data from 1.5 years ago it seems to have gotten a little bit worse (though not much). As long as there is no search functionality for rights societies on the Discogs website it won't matter muc

Why merging in Discogs is broken

Sometimes duplicate entries get added to the Discogs database, for various reasons: inexperienced users: it takes some time to understand the Discogs workflow. This happened to me as well when I was starting out and it is because the Discogs edit interface throws you into expert mode, instead of trying to guide you through the process (which I have written about before here and here ). stupid sellers: some sellers still don't understand that Discogs is a catalog where you simply pick a release (unlike for example eBay), so they add releases that are already in the catalog. entry errors: sometimes errors are made, which makes it hard to find that an entry has already been added, for example when it isn't clear what the correct label is, or the artist, and so on and then someone adds the same release twice. disagreement about when a release is a variation: people on Discogs frequently disagree about when a release is actually a different release, or simply a variation. For

Which SPARS code is most used in Discogs?

I haven't looked at SPARS codes for a while but I thought it would be interesting to see which SPARS code is used the most. If you haven't read my previous articles about SPARS codes I suggest you do that first. What I wondered is: how often is each SPARS code used on a release? I looked at the releases where I have verified that there is a valid SPARS code. There are more releases with SPARS codes in Discogs, but a lot of releases (at least 24,000) still need to be fixed at the time of writing and these haven't been looked at. In total I looked at 113,975 releases from data up to October 31 2019. The breakdown: DDD: 52,082 releases AAD: 33,821 releases ADD: 28,564 releases DAD: 480 releases AAA: 120 releases DDA: 87 releases ADA: 15 releases DAA: 12 releases If you add these numbers you will get more than 113,975. This is because some releases have more than one SPARS code as the mix can actually be different per track. What is clear is that the "regul

SPARS codes on releases dated before SPARS codes were introduced (October 2019 update)

Back in 2018 I looked at how many releases had a SPARS code, and a release date before SPARS codes were actually introduced. You can read about that here . Back then I found 30 releases and I promised I would revisit this topic sometime later. I just looked at the data up to October 31 and I found 61 releases. That's actually an increase. Bah. Releases in Discogs with a SPARS code and dated before SPARS codes were introduced It seems there is still some work to do!

Hey Discogs, are you listening?

I started this blog in September 2017 with the intention of finding interesting patterns in the Discogs data. It fairly quickly turned into a blog pointing out what doesn't work in Discogs, which I must admit is a bit negative and not what I originally wanted. I am honestly trying to keep it positive this simply isn't always possible. The reason: in the last two years very little has been done to get rid of data entry errors that can be prevented. The only thing that I have seen is the automatic correction of capitalization errors (even though it is a bit difficult to trigger, although it does a good job). Due to work obligations I couldn't look at the data for a few months and had hoped that things would have been different when I returned to processing the data, but this wasn't the case. I have had conversations with people at Discogs and those conversations were actually very nice, but it doesn't seem to lead to improvements. Why this is I really don't know

How big are changes in Discogs (1)?

One thing I keep asking myself is how big changes in Discogs releases typically are: are the changes just small tweaks, or are they really big changes? I decided to compare the size of the changes of releases in two months using a very simple method: looking at the size of the differences. I decided to not go for a very advanced approach that would require me to parse the XML and compare each field, but instead go fora much simpler approach and look at the size differences of the raw XML data. I took two data dumps from the Discogs data website (namely discogs_20191003_releases.xml.gz and discogs_20191101_releases.xml.gz). I split these two dumps into individual XML files (one per release) and renamed them to have the right release number in the file name. I then compared the files with a regular cryptographic hash (SHA256) to see which files were different. These files (439,651 in total) were compared using a locality sensitive hash, specifically TLSH as I am quite familiar wi

Why voting in Discogs is broken

I just need to rant again about something in Discogs that I really do not like and that is the voting system. In Discogs you can vote on the quality of the data of a release. For example, if someone completely screwed up an existing release, you can vote that it is "Entirely Incorrect" and that will then revert the commit and restore it to the previous state. Or, you could vote that it is "Complete and Correct", signalling that all the data that can be on there is on there and it is correct as well. There are also: Correct - not all information might be there, but at least it is correct Needs Minor Changes - information is there, just needs a quick brush up Needs Major Changes - the release needs a lot of work This is to have some sort of "self cleansing" mechanism, to alert people to mistakes, and to weed out bad contributors, as getting many bad votes will put you into "Discogs school" (the "Contrinbutor Improvement Program", o

What happened in Discogs in October 2019?

Phew, I finally managed to process all the data that I wanted to and I should be up to date, at least for the time being. So, let's look at what happened in Discogs in October 2019! For those of you who already know what to expect: it is the same as in previous months. For those of you who don't know, please read the post about September 2019 and work your way back . Release statistics I looked at the dump file with data covering October 3 - October 31 2019 (the previous datadump was for a little bit more than September). This dump file has 11,751,696 releases, whereas the previous one had 11,653,094 releases. That means 98,602 releases more which is less than normal, but also keep in mind that this dump file covers fewer days. Also: 11,211,165 releases stayed the same 439,651 releases were changed 100,880 releases were added 2,278 releases were removed from the database 223 releases had status Draft, Deleted or Rejected 0 releases that were not Accepted were in

What happened in Discogs in September 2019?

Normally I don't think too much about statistics of a particular month, but September is different, because of Discogs' "September Pledge Initiative" or "S.P.IN". You can read Discogs' own take on it on the Discogs blog , but let's see what the data says. If you don't know what everything here means, just look at the previous blog post and work your way back. Release statistics I looked at the dump file with data covering September 1 - October 3 2019 (according to the name of the datadump). This dump file has 11,653,094 releases, whereas the previous one had 11,530,540 releases. That means 122,554 releases more which is actually not more than in a normal month (at least, that's what I think). Also: 11,021,531 releases stayed the same 506,609 releases were changed 124,954 releases were added 2,400 releases were removed from the database 477 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in

What happened in Discogs in Augustus 2019?

Time to dig into some statistics about how Discogs was changed in August 2019. Did the excessive heat reduce the number of contributions, or was it business as usual? If you haven't read it yet I would recommend reading the blog post about contributions in July 2019 first. Release statistics I looked at the dump file with data covering August 1 - 31 2019. This dump file has 11,530,540 releases, whereas the previous one had 11,430,638 releases. That means 99,920 releases more which is only a little bit less than normal. Also: 10,897,333 releases stayed the same 529,657 releases were changed 103,550 releases were added 3,648 releases were removed from the database 403 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in both dumps 3 releases were moved from Draft to Accepted What is interesting: quite a few more releases were removed from the database, so my guess is that there was some cleanup of sorts. Changes in the data (t

What happened in Discogs in July 2019?

It is already November, so it is time to write an update about July. I couldn't do it earlier because I got completely swamped in work, but I will be making a few extra blog posts to compensate for the lack of blog posts. Anyway, back to the monthly statistics! If you don't know how this works, you should read the previous entries, for example about what happened in June 2019 and then work your way through history. You will notice certain patterns. Release statistics I looked at the dump file with data covering July 1 - 31 2019. This dump file has 11,430,638 releases, whereas the previous one had 11,334,008 releases. That means 96,630 releases more which is a bit less than normal, but that can probably be explained by summer on the northern hemisphere. Also: 10,853,019 releases stayed the same 478,710 releases were changed 98,909 releases were added 2,279 releases were removed from the database 344 releases had status Draft, Deleted or Rejected 0 11 releases t

"Rights Society" in the "Barcode" field in Discogs (October 2019)

One mistake that I see quite often in the Discogs data is that people use the "Barcode" field for basically everything in the section "Barcode and Other Identifiers" (or "BaOI" in Discogs lingo). This is not very surprising, as "Barcode" is the default value and no checking is done. Almost two years ago I already checked in how many releases the "Barcode" field was used to store the value of a rights society and found around 400 releases, which resulted in the following graph: Rights societies in the "Barcode" field in January 2018 although I honestly cannot remember how robust my checks were at that time (I believe they missed quite some data). I thought it would be interesting to see what I would find in the current data set. The result: 1118 unique releases, and distributed across the data as follows: Rights societies in the "Barcode" field in October 2019 So that is significantly more. A few things

Duplicate ISRC codes in Discogs (October 2019)

Since the ISRC codes were introduced people have gradually been adding them, or fixing them, which is a good thing. It also means that I can mor easily detect errors. Back in April 2019 I already wrote about detecting duplicate ISRC codes in a release so you probably want to read that and some older articles first. So, what's the situation like half a year later? I took the Discogs dump covering data up until October 31 2019 and found 188 unique releases, and created the following chart, which shows where in the data set these releases can be found (but I cannot conclude anything useful from it): Releases with duplicate ISRC codes in Discogs in October 2019 So it is slowly increasing and there are about 60 more than in April. I still have to go through these releases to see if these are data entry errors by Discogs users, if the label made a mistake (which also happens), or (in case of CDs) perhaps a drive with a malfunctioning firmware was used, as apparently there are C