Skip to main content

Posts

Showing posts from 2019

Storing data "as written on release" or not?

Having looked through many releases in Discogs I often see that data is not entered as it is on the release. One example: rights societies are typeset in many different ways, where sometimes you will have an all uppper case name (for example: "BIEM"), sometimes there is spacing ("B I E M"), sometimes there are dots in the name ("B.I.E.M."), and so on. In the examples mentioned above the values are all basically referencing the same rights society so it is worth asking if the values entered on each release page should be "as written on release" or if they should be stored as the value of whatever it is pointing it? Said differently: should the release page focus on "syntax" (how it is written on the release) or "semantics" (what it means)? There are things to be said for both: there are releases where the person designing the sleeve or label made an error, for example: releases with "BIEN" instead of "BIEM&qu

Are ISRC codes being fixed? Yes they are!

One field in the "barcode and other identifiers" section in Discogs that is fairly new is the ISRC field. I have written about ISRC before, so you might want to read those posts first. Before there was a dedicated ISRC field the ISRC codes would be put into fields like "Other" or "Barcode" so I wondered: how well are these being updated and replaced by ISRC fields? I checked the dumps from the last 6 months: May 2019 dump: 22,391 releases June 2019 dump: 21,096 releases July 2019 dump: 19,214 releases August 2019 dump: 15,161 releases September 2019 dump: 13,390 releases October 2019 dump: 12,090 releases November 2019 dump: 11,360 releases December 2019 dump: 10,578 releases So in 8 months time the number of problematic releases has more than halved, while the amount of releases in the database with ISRC codes has increased. With a bit of luck all of these will have been fixed in the next half year. Impressive.

Translations in Discogs

In some countries (especially in Spain) it was/is custom to translate titles of songs and print them on the sleeve (sometimes without the original title). The cool thing about this is that these releases have unique sleeves and labels. One example is Queen's "Hot Space", which was translated to " Espacio Caliente ". Frequently songs are also covered in a different language than the original and given a new title. From a collector point of view this could be interesting: there are actually people who collect covers of specific songs. The challenge is that if only the translated title is printed on the release (or entered in Discogs) it is harder to find out which track it actually is. For regular releases (where the title simply has been translated) it is usually not that difficult. But, for cover versions it can be. How to indicate translations has been the matter of debate inside Discogs. At some point the solution was to enter both the original title and

What happened in Discogs in November 2019?

A few days ago another data dump was released covering all releases up until (and including) November 30 2019, so I can look at some statistics again. If you don't know how this works, I would recommend reading the previous articles about it, for example the article about October 2019 . Release statistics I looked at the dump file with new data entered and changes made from November 1 - November 30 2019. This dump file has 11,854,877 releases, whereas the previous one had 11,751,696. That means 103,181 releases more. Also: 11,249,513 releases stayed the same 500,202 releases were changed 105,162 releases were added 1,981 releases were removed from the database 588 releases had status Draft, Deleted or Rejected 0 releases that were not Accepted were in both dumps 3 releases were moved from Draft to Accepted Changes in the data (that is: changes to already existing releases) are distributed as follows: Existing releases changed in November 2019 Smells I fou

Which Label Code is used the most in Discogs?

The field in the Discogs database that is definitely the most misunderstood is the so called "Label Code". Many people think that this is a generic "catch all" field for any code printed on a label, but that is COMPLETELY WRONG . The Label Code is a (somewhat) unique identifier for a record label. It was introduced in Europe somewhere in the 1970s and is now in widespread use. This is all explained in the other articles I have written about this field and I recommend you read those first (especially the first article ). Many Discogs contributors think that just because there is a code on a label it should be in the Label Code field. In the latest download (all releases up until October 31 2019) there are a bit over 44,000 releases that have a wrong value in this field. That's quite a few. Then there are also releases where a Label Code should be present, but isn't (and of course I don't know how many), plus around 9,000 where someone indicated that

Homoglyphs (update October 2019)

About 1.5 year ago I discovered that some people use the wrong character set in the "Rights Society" field, because characters in various character sets (Latin, Cyrillic, Greek) look alike (the so called "homoglyphs"). Some people even mixed characters from different character sets. You can read the articles I wrote here and here and I would recommend you read those first as I will be making comparisons with the data presented there. I was wondering if this situation had actually changed, or if things got worse, so I ran my scripts and found 1146 releases where I know that the wrong character set(s) were used. The distribution across the data set is as follows: Distribution of releases with using the wrong character set(s) for Rights Society Comparing it to the data from 1.5 years ago it seems to have gotten a little bit worse (though not much). As long as there is no search functionality for rights societies on the Discogs website it won't matter muc

Why merging in Discogs is broken

Sometimes duplicate entries get added to the Discogs database, for various reasons: inexperienced users: it takes some time to understand the Discogs workflow. This happened to me as well when I was starting out and it is because the Discogs edit interface throws you into expert mode, instead of trying to guide you through the process (which I have written about before here and here ). stupid sellers: some sellers still don't understand that Discogs is a catalog where you simply pick a release (unlike for example eBay), so they add releases that are already in the catalog. entry errors: sometimes errors are made, which makes it hard to find that an entry has already been added, for example when it isn't clear what the correct label is, or the artist, and so on and then someone adds the same release twice. disagreement about when a release is a variation: people on Discogs frequently disagree about when a release is actually a different release, or simply a variation. For

Which SPARS code is most used in Discogs?

I haven't looked at SPARS codes for a while but I thought it would be interesting to see which SPARS code is used the most. If you haven't read my previous articles about SPARS codes I suggest you do that first. What I wondered is: how often is each SPARS code used on a release? I looked at the releases where I have verified that there is a valid SPARS code. There are more releases with SPARS codes in Discogs, but a lot of releases (at least 24,000) still need to be fixed at the time of writing and these haven't been looked at. In total I looked at 113,975 releases from data up to October 31 2019. The breakdown: DDD: 52,082 releases AAD: 33,821 releases ADD: 28,564 releases DAD: 480 releases AAA: 120 releases DDA: 87 releases ADA: 15 releases DAA: 12 releases If you add these numbers you will get more than 113,975. This is because some releases have more than one SPARS code as the mix can actually be different per track. What is clear is that the "regul

SPARS codes on releases dated before SPARS codes were introduced (October 2019 update)

Back in 2018 I looked at how many releases had a SPARS code, and a release date before SPARS codes were actually introduced. You can read about that here . Back then I found 30 releases and I promised I would revisit this topic sometime later. I just looked at the data up to October 31 and I found 61 releases. That's actually an increase. Bah. Releases in Discogs with a SPARS code and dated before SPARS codes were introduced It seems there is still some work to do!

Hey Discogs, are you listening?

I started this blog in September 2017 with the intention of finding interesting patterns in the Discogs data. It fairly quickly turned into a blog pointing out what doesn't work in Discogs, which I must admit is a bit negative and not what I originally wanted. I am honestly trying to keep it positive this simply isn't always possible. The reason: in the last two years very little has been done to get rid of data entry errors that can be prevented. The only thing that I have seen is the automatic correction of capitalization errors (even though it is a bit difficult to trigger, although it does a good job). Due to work obligations I couldn't look at the data for a few months and had hoped that things would have been different when I returned to processing the data, but this wasn't the case. I have had conversations with people at Discogs and those conversations were actually very nice, but it doesn't seem to lead to improvements. Why this is I really don't know

How big are changes in Discogs (1)?

One thing I keep asking myself is how big changes in Discogs releases typically are: are the changes just small tweaks, or are they really big changes? I decided to compare the size of the changes of releases in two months using a very simple method: looking at the size of the differences. I decided to not go for a very advanced approach that would require me to parse the XML and compare each field, but instead go fora much simpler approach and look at the size differences of the raw XML data. I took two data dumps from the Discogs data website (namely discogs_20191003_releases.xml.gz and discogs_20191101_releases.xml.gz). I split these two dumps into individual XML files (one per release) and renamed them to have the right release number in the file name. I then compared the files with a regular cryptographic hash (SHA256) to see which files were different. These files (439,651 in total) were compared using a locality sensitive hash, specifically TLSH as I am quite familiar wi

Why voting in Discogs is broken

I just need to rant again about something in Discogs that I really do not like and that is the voting system. In Discogs you can vote on the quality of the data of a release. For example, if someone completely screwed up an existing release, you can vote that it is "Entirely Incorrect" and that will then revert the commit and restore it to the previous state. Or, you could vote that it is "Complete and Correct", signalling that all the data that can be on there is on there and it is correct as well. There are also: Correct - not all information might be there, but at least it is correct Needs Minor Changes - information is there, just needs a quick brush up Needs Major Changes - the release needs a lot of work This is to have some sort of "self cleansing" mechanism, to alert people to mistakes, and to weed out bad contributors, as getting many bad votes will put you into "Discogs school" (the "Contrinbutor Improvement Program", o

What happened in Discogs in October 2019?

Phew, I finally managed to process all the data that I wanted to and I should be up to date, at least for the time being. So, let's look at what happened in Discogs in October 2019! For those of you who already know what to expect: it is the same as in previous months. For those of you who don't know, please read the post about September 2019 and work your way back . Release statistics I looked at the dump file with data covering October 3 - October 31 2019 (the previous datadump was for a little bit more than September). This dump file has 11,751,696 releases, whereas the previous one had 11,653,094 releases. That means 98,602 releases more which is less than normal, but also keep in mind that this dump file covers fewer days. Also: 11,211,165 releases stayed the same 439,651 releases were changed 100,880 releases were added 2,278 releases were removed from the database 223 releases had status Draft, Deleted or Rejected 0 releases that were not Accepted were in

What happened in Discogs in September 2019?

Normally I don't think too much about statistics of a particular month, but September is different, because of Discogs' "September Pledge Initiative" or "S.P.IN". You can read Discogs' own take on it on the Discogs blog , but let's see what the data says. If you don't know what everything here means, just look at the previous blog post and work your way back. Release statistics I looked at the dump file with data covering September 1 - October 3 2019 (according to the name of the datadump). This dump file has 11,653,094 releases, whereas the previous one had 11,530,540 releases. That means 122,554 releases more which is actually not more than in a normal month (at least, that's what I think). Also: 11,021,531 releases stayed the same 506,609 releases were changed 124,954 releases were added 2,400 releases were removed from the database 477 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in

What happened in Discogs in Augustus 2019?

Time to dig into some statistics about how Discogs was changed in August 2019. Did the excessive heat reduce the number of contributions, or was it business as usual? If you haven't read it yet I would recommend reading the blog post about contributions in July 2019 first. Release statistics I looked at the dump file with data covering August 1 - 31 2019. This dump file has 11,530,540 releases, whereas the previous one had 11,430,638 releases. That means 99,920 releases more which is only a little bit less than normal. Also: 10,897,333 releases stayed the same 529,657 releases were changed 103,550 releases were added 3,648 releases were removed from the database 403 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in both dumps 3 releases were moved from Draft to Accepted What is interesting: quite a few more releases were removed from the database, so my guess is that there was some cleanup of sorts. Changes in the data (t

What happened in Discogs in July 2019?

It is already November, so it is time to write an update about July. I couldn't do it earlier because I got completely swamped in work, but I will be making a few extra blog posts to compensate for the lack of blog posts. Anyway, back to the monthly statistics! If you don't know how this works, you should read the previous entries, for example about what happened in June 2019 and then work your way through history. You will notice certain patterns. Release statistics I looked at the dump file with data covering July 1 - 31 2019. This dump file has 11,430,638 releases, whereas the previous one had 11,334,008 releases. That means 96,630 releases more which is a bit less than normal, but that can probably be explained by summer on the northern hemisphere. Also: 10,853,019 releases stayed the same 478,710 releases were changed 98,909 releases were added 2,279 releases were removed from the database 344 releases had status Draft, Deleted or Rejected 0 11 releases t

"Rights Society" in the "Barcode" field in Discogs (October 2019)

One mistake that I see quite often in the Discogs data is that people use the "Barcode" field for basically everything in the section "Barcode and Other Identifiers" (or "BaOI" in Discogs lingo). This is not very surprising, as "Barcode" is the default value and no checking is done. Almost two years ago I already checked in how many releases the "Barcode" field was used to store the value of a rights society and found around 400 releases, which resulted in the following graph: Rights societies in the "Barcode" field in January 2018 although I honestly cannot remember how robust my checks were at that time (I believe they missed quite some data). I thought it would be interesting to see what I would find in the current data set. The result: 1118 unique releases, and distributed across the data as follows: Rights societies in the "Barcode" field in October 2019 So that is significantly more. A few things

Duplicate ISRC codes in Discogs (October 2019)

Since the ISRC codes were introduced people have gradually been adding them, or fixing them, which is a good thing. It also means that I can mor easily detect errors. Back in April 2019 I already wrote about detecting duplicate ISRC codes in a release so you probably want to read that and some older articles first. So, what's the situation like half a year later? I took the Discogs dump covering data up until October 31 2019 and found 188 unique releases, and created the following chart, which shows where in the data set these releases can be found (but I cannot conclude anything useful from it): Releases with duplicate ISRC codes in Discogs in October 2019 So it is slowly increasing and there are about 60 more than in April. I still have to go through these releases to see if these are data entry errors by Discogs users, if the label made a mistake (which also happens), or (in case of CDs) perhaps a drive with a malfunctioning firmware was used, as apparently there are C

How Discogs' email notification could be improved (part 2)

Whenever a change is made in Discogs a mail is sent. In my opinion the system behind these e-mails is broken, as I have already said before . One of the biggest issues that I have with the current e-mails being sent by Discogs is the subject line of the e-mail, saying: Recent changes affecting releases in your collection or Recent changes affecting your contributions but nothing else. I first have to click on the message to see to full content. It might be that I find some of the changes more interesting than the others. For example, if images are changed, then it is likelier that I will want to take a look because sometimes people change images, with ugly consequences . So I would like to have a little bit more information in the e-mail subject, for example: [IMAGES] Recent changes affecting releases in your collection would already be an improvement over the existing mails being sent as it would allow me to prioritize which mails to check first (at the expense of some e

What happened in Discogs in June 2019?

A new datadump was released last night, so time to look at what happened in Discogs in June 2019. If you don't know how this works, you should read the previous entries, for example about last month . Having done this for nearly two years I am pretty sure that it wil be nearly identical to the previous few months. Release statistics I looked at the dump file with data covering June 1 - 30 2019. This dump file has  11,334,008 releases, whereas the previous one had 11,231,882 releases. That means 102,126 releases more. Also: 10,723,325 releases stayed the same 506,571 releases were changed 104,112 releases were added 1,986 releases were removed from the database 167 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in both dumps 1 release was moved from Draft to Accepted Changes in the data (that is: changes to already existing releases) are distributed as follows: Existing releases changed in June 2019 Although it still h

Hijacked releases, or: "Why can't people read or check pictures?"

I am going through a few hundred 7" to see if they are already on Discogs (which in itself is already a big task) and I am seeing quite a few that are not on there (so setting them apart to make scans, which is even more work). What I see way too often is so called "release hijacking" where an existing release is changed into a different release. This seriously pisses me off and I have written about it before . Some examples: someone posts pictures saying "alternate labels" (which according to the guidelines should be turned into a different release) or "better pictures" which turn out to be for a different release altogether (different texts, labels, etc.). Or, they don't read the notes that are for a release that say something like "This is the release with X, for the release with Y, see Z" and then add pictures for the release with Y. Sigh. What these people do not seem to realise is this vastly reduces the value of the database f

Finding out of date formatting codes in Discogs

Like many other websites Discogs allows you to use certain formatting codes that is then automatically expanded. For example, to link to a release you can use [r000000] (where of course "000000" should be a real release number) which is then expanded to a link to the release. There are also other formatting codes . The benefit is that if necessary Discogs can rearrange their infrastructure and then the links will (or at least should) still be working. Unfortunately they don't have it implemented correctly everywhere (yet), as on the iOS app it doesn't seem to be working (bug report has been sent): These codes are not automatically updated when releases are merged, or removed and then the links basically become useless. So I was wondering: how many releases are there with formatting codes to releases that no longer exist: how many dangling pointers are still there? I grabbed the latest data dump and first looked at all the existing releases that were not drafts

Home recordings from the 1940s

I was going through a box of old 78s that we got quite a few years back (15 years or longer) to put on Discogs (and found quite a few which aren't on Discogs and need to be added). The records were from someone from Leeuwarden (up north in the Netherlands) according to the stickers the record shop put on it. Most of these were jazz, foxtrots, waltzes, some classical and Dutch, but there were three records that I found very interesting: they were self-produced and two of them had a date of October 8 1948. So first of all, I didn't even know people were recording things themselves back then, so that was a bit of a surprise. As said there were three records: one 7" and two 10" records. I could find out what was on the 7", even though I didn't play it: organ music from the Dutch Reformed church in Dronrijp (which is very close to Leeuwarden) and which apparently has the oldest church organ in the province it's in. The record is a so called Simplex recor

Making a living with Discogs: some thoughts

If you search the Discogs forums you will sometimes see posts from people who dream of giving up their regular job and make their money with Discogs and wonder if that is a smart idea. I was asking myself the same question, not because I want to become a full time record dealer, but because I wondered if it actually is possible. After researching it a bit my answer is: most likely it is not possible, unless you are willing to put in a lot of time and effort, accept a cut in income, add uncertainty and perhaps move to another country to cut costs. I live in the Netherlands, which is definitely one of the more expensive countries to live in in Europe: incomes are higher than in most parts of Europe, but taxes are also higher, as are prices for many things, such as housing, restaurants, postage, flights (dynamic pricing, sigh), and so on. So that is already a bit of a challenge. According to the Dutch government the minimum wage (from July 1 2019 onwards) is € 1635.60 per month (bef

Maintaining a collection with Discogs: what could improve

Some people, including me, use Discogs for having an overview which releases they have. Though it is not perfect because people can, and do, change releases so it matches their own copy (even when it should have been a separate release), but it is better than nothing. There are a few things that annoy me in the way that Discogs organizes collections. One of them is the way that releases can be organized into folders. I understand the rationale behind it: you can organize releases into folders and then look at these folders separately instead of at a big list which would work for a few items, but not when having thousands or tens of thousands of items. Except: for me it is too restrictive. Sometimes I would like to have releases in more than one folder, for example I might want have a release in a folder "death metal", a folder "dutch", a folder "coloured vinyl" and a folder "football" at the same time (although I am not aware of any football

Using data to find out when EMI moved from Barcelona to Madrid

I am going to deviate from my usual path a bit, mostly to show the power of having a lot of (correct) data. At some point in the 1980s EMI moved its Spanish office from Barcelona to Madrid. I wanted to see if I could pinpoint when this happened. According to the EMI-Odeon label page on Discogs there were a few addresses in/around Barcelona and one in Madrid. What I noticed when looking at labels of EMI releases is that at some point a few things in the label designs and on the sleeves changed: the region for the depĆ³sito legal the name of the company the address of the company An example of a release with the old address is Duran Duran's "Hungry Like The Wolf" from 1982. An example of a release with the new address is Duran Duran's "Ordinary World" from 1993. So the change of address must have happened sometime between those two releases, but when exactly? I took the data dump released in June 2019 as a basis, extracted all the Spanish releases

What happened in Discogs in May 2019?

A new data dump has been released by Discogs, so time to look at some statistics again, although I am certain that it is very similar to previous months . In fact, I can already tell that it is almost exactly the same: the contributions and to and changes in Discogs are very very consistent. Release statistics I looked at the dump file with data covering May 1 - 31 2019. This dump file has  11,231,882 releases, whereas the previous one had 11,123,192 releases. That means 108,690 releases more. Also: 10,619,089 releases stayed the same 500,776 releases were changed 3,327 112,017 releases were added 112,017 3,327 releases were removed from the database 180 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in both dumps 14 releases were moved from Draft to Accepted Changes in the data (that is: changes to already existing releases) are distributed as follows:  which is, again, almost the same graph as in previous months. Smells

Adding missing distribution companies to releases

One thing that I have hardly looked into is the section "Label, Company, Catalog Number, Etc." or "LCCN" in Discogs lingo (although it is only called that in the data entry form, on regular pages it says "Companies, etc."). In this section information such as the label, record company, distributor and much more should be entered. But of course, this often doesn't happen, partially also because people don't know what to enter. I believe that with a little bit of data mining it should be possible to find at least some of the data. I already showed that it works for depĆ³sito legal values so why shouldn't it for others? Let's look at an example from, again, Spain, namely this release on the Pye label . On the rear sleeve the following can be seen: Looking at the label page for Discos Belter : "Discos Belter, S.A. also offered distribution services and was a licensee for Motown, Pye Records, Salsoul Records, Prelude Records,

A case for guided data entry (part 2)

I have been thinking a bit about how to increase data quality when entering data. There is already a lot that could be done by using wizards and asking users the right questions . But I think that it can be made even easier when using graphical hints and explicitly pointing out to users what information should be entered in the database. In the previous article I mentioned using a wizard and guiding users through the process of entering data. When the right questions have been answered (such as country and label) the user could be asked what the label of the release looks like . For example they could be given the following choice (left: typical label EMI used in Spain in the 1980s, right: typical label EMI used in Spain in the 1970s): Examples of labels EMI used in the 1980s (left) and 1970s (right) Based on this they could then be guided through the process of picking the right data. Also, using which label was picked already means that some checks can be applied. For exa

A case for guided data entry (part 1)

I just helped someone add a few releases to Discogs and it was, again, a quite frustrating experience, even though I have added releases to Discogs before. Adding releases to Discogs is actually a lot of work. These are the steps I typically take for a 7" single: see if the release is already there on Discogs. see if the release I have differs from any of the listed releases copy an existing release to draft, or start from scratch fill in all the details that I am sure about make scans, crop the scans, scale them add scans Adding a release properly, with all the details, can easily take 15 to 20 minutes, and then I still only have a fairly barebones release. What I typically don't get right due to lack of knowledge are things like: composers printing companies manufacturing companies pressing plants country specific peculiarities etc. These usually require specialist knowledge about releases from a certain country, label or artist, which I simply don't

ISRC codes in Discogs (part 8)

A bit over a year ago I looked into ISRC codes for checking release dates. The ISRC code actually contains a year component which can be used to check dates, as (except in rare old cases) the release is always after the ISRC code has been assigned, so it cannot be earlier. A few people on Discogs seem to disagree, but meh. For more details you can read an earlier post . So back in February 2018 I got the following graph: Distribution of smells in releases with known ISRC fields in February 2018 which also included some other smells and not just date mismatches. Back then I concluded that there were 273 releases that had a date mismatch. So I decided to rerun my analysis with the latest data dump (up until and including April 30 2019) now that it seems that there are many more ISRC codes in the database, because people seem to be fixing the ISRC field . As expected the amount of releases with a date mismatch is now a lot higher: 1414. They are distributed across the database

Barcodes (part 4)

Continuing with my research into barcodes in the Discogs database: which barcodes appear most often? If you haven't already I would recommend reading part 1 , part 2 and part 3 of this series first. One thing that I was wondering about is which barcodes are found the most. The top 10 values for the barcode field looks like this: 1411: 4820011260011 1252: 4 820011 260011 > 827: 6456489431561 826: 6 456489 431561 460: 4 64043 51662 9 455: 464043516629 447: 4 60980-06754 5 437: 4 619497 411525 429: 4619497411525 427: 460980067545 Because one release could have multiple barcode fields (frequently one for the text representation and one for the scanned version) there is some overlap and the top 10 is actually just 5 different barcodes. I was wondering which release was at number one and to my surprise it was a release by Dissection . To my knowledge the majority of the people on Discogs are not into extreme metal, so I looked a bit further at what other releases ha