Skip to main content

Posts

Showing posts from November, 2017

Releases under a Creative Commons license in Discogs (part 1)

While digging through releases in Discogs I spotted that some of them had a reference to a Creative Commons license . As I am dealing with open source licensing (of software) on a daily basis it of course immediately drew my attention and I started to wonder how many of the releases in the Discogs database have a reference to Creative Commons. I (again) grabbed the latest data dump from Discogs (which itself is under a CC0 license, so quite fitting), and adapted my scripts to report if there were any references to Creative Commons in either the free text fields, the notes or in some of the descriptions in the "Barcode and Other Identifiers" (BaOI) sections and found 7884 releases. I plotted the results of the release numbers, which you can see below (first bar: release number 1 - 999,999, second bar release number 1,000,000 - 1,999,999, and so on): Distribution of possibly CC-licensed works in the Discogs database What strikes me is that for older releases there are

SID codes (part 3)

And again it is time to dive into the world of SID codes. Before reading this post it is advised to read part 1 and part 2 . Before I used Discogs I had not heard of SID codes and never realized that they can (somewhat) be used help date a release, as SID codes were only introduced in 1994. I found out by accident when I was browsing through Discogs and found this release that I have had for many years. I bought it back in 1994 or 1995 and always wondered why it showed up in the shop (new) then even though it had supposedly been available already a number of years before. This year I found out  after learning about SID codes that my copy is a later repress . This made me wonder: how many releases in the Discogs database are there that have some SID code defined, but have the year set to something earlier than 1994? I added a check to my scripts and analyzed the latest available Discogs data dump (with data until November 1 2017) and found 3092 releases with a release date ea

SID codes (part 2)

Another round of digging into SID codes. Before you read this it is highly advised to first read part 1 about SID codes. In the first part I wondered how many releases have SID fields defined, but have something in there that isn't an actual SID code. For that it is important to know what a valid SID code actually means. Valid SID codes According to the IFPI specifications this is a valid SID code: "The SID Code consists of the letters 'IFPI', followed by either four or five additional characters, which may be alphabetical or numerical [...]" which seems very clear. There are a few extra restrictions: the additional characters in the mastering SID code always have to start with "L", and mould SID codes cannot use the characters "I", "O", "S" or "Q" (it is unclear from the document if this is just for mould SID codes, or also for mastering SID codes). With that background information I (partially) adapt

How sparse is Discogs?

In Discogs each release is assigned a number. At the moment (November 27 2017) the number of releases in Discogs is a bit over 9,200,000, while the latest release number is a bit over 11,200,000. That means that 2 million release numbers (around 17.85%) have gone from the database and about 82.15% of releases that were added are still in the database. Around 2 million release numbers have been removed from the Discogs database. [EDIT 2017-11-28: I was told by Discogs that every Draft is also assigned a release number and many drafts don't ever get submitted to the database, so they simply do not appear in the database] In a previous post I wondered if Discogs is getting increasingly "sparse". I kept thinking about it, so I decided to just look at the data to see if my suspicion ("Discogs is getting increasingly sparse") was right. Why releases disappear from Discogs There are several reasons why releases disappear from Discogs. The most common reason is

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi

Label Code (part 3)

Today a short post, again about label codes. Before reading this post it is recommended to first read part 1 and part 2 about Label Code fields in Dicsogs. Label Code in format free text field One thing I wondered about is: how often would a valid label code would pop up in the free text field of the format description. This was fairly easy to check. In the latest Discogs dump: just 4, where 1 actually had a Label Code field and the free text field in Format just contained duplicate information. These were all fixed very easily. To be honest I had expected quite a few more, so this was a positive surprise.

The importance of submission notes in Discogs

Ranting time. Something that I think Discogs got right is that when editing a release you can see the entire editing history and view earlier versions of the page. This makes it lot easier to see when a particular change was introduced and allow you to "debug" a page: some people make changes that turn out to be wrong, or information got lost and it needs to be restored, or a page got vandalized (there are a few disgruntled ex-users that come and go and changes need to be reverted. Having the entire history available makes that possible. Other collaborative systems like Wikipedia have this feature as well where it has proven to be incredibly useful. In Discogs it is mandatory to describe changes to releases in the so called submission notes. For me as a software engineer using open source software this is completely normal: you create a change, and describe the change, otherwise it becomes cumbersome down the line and you are asking yourself "what is this change and wh

SPARS Codes (part 3)

It is time for me to dig into SPARS codes again. Before reading this article it is advised to first read part 1 and part 2 of this series about SPARS codes. One thing that noticed when digging into possible smells for SPARS codes is that sometimes I would see a code named CDC that was tagged as a SPARS code. As that letter actually isn't in the SPARS code specifications I was wondering what it was and how often it appeared in the Discogs database. As it turns out this is a code specific to Sony Music that somehow indicates the format. There are more have these codes, which have been discussed several times in the Discogs forums already. I updated my scripts to see how many of these Sony Music codes were in SPARS Code fields. Surprisingly few: I could find just 23 but lots of SPARS Code fields still need to be fixed, so I am expecting more to pop up once these get "fixed". As always, scripts have been updated to check for these Sony codes in SPARS code fi

Digging into the Spanish Depósito Legal identifier (part 3)

Time again to dive a bit further into the Spanish depósito legal identifiers to date releases. It is highly recommended to read part 1 and part 2 first before reading this article. Using depósito legal to date a release The year that is embedded in the depósito legal can be used to date a release, but there are a few things to keep in mind. For that it is necessary to know a little bit more about the depósito legal. To get a depósito legal number on a release you first have to apply for it. The release is then assigned a number by the library. This number is then to be printed on the release. This means that the release could never have been prior to the year embedded in the depósito legal number, unless the depósito legal number has been misprinted. However, the depósito legal number could be in the past: maybe a depósito legal was applied for, and then the release was postponed. Or, it was applied for in December and the release was made in January the next year.

ASIN field in Discogs (part 2)

One of the identifiers used in Discogs is the Amazon Standard Identification Number or ASIN. It is advised to first read part 1 . ASIN used in wrong place One thing I wondered about after my previous post about the ASIN was: how many ASIN identifiers could be found in places other than the ASIN field, like Barcode , or Other and where it was tagged as ASIN in the description. I adapted my scripts and ran some tests. The script reported 51 hits, but there were a few false positives where both the field and description were set to ASIN . There were a few releases (about 15 or so) where something else than ASIN was used (most likely some barcode). This was a surprising find, as I actually had expected many more. There must be a few more in hiding but I will leave that for another time.

ASIN field in Discogs (part 1)

What I like about Discogs is that I keep learning about things I never knew about. One field in Discogs that I didn't know about is the ASIN field. The Discogs guidelines describe it as: "This is a unique identification number assigned by Amazon.com and its partners for product identification within the Amazon.com organization." and apparently it is embedded in product URLs at Amazon. Many people seem to have extracted this number from the product URLs, despite the guidelines saying: "This number should only be applied to releases manufactured by Amazon (CD-Rs), physical releases exclusive to Amazon, and digital files sold by Amazon." as there were more than 42,000 ASIN entries in the Discogs database on October 31 2017. I would be extremely surprised if there were that many Amazon exclusives. ASIN in Discogs According to the Wikipedia page about ASIN the identifier is a ten character identifier. A bit of mining revealed: most, if not all, ASI

Weird data in the "Released" field

In the Discogs database each release has a Released field that can contain either a year, a year plus month, or a year, month and day or nothing (if the date is unknown). At some point in time Discogs started to enforce the format for this field, but before then it was more or less a free text field. I looked at how often the Released field contained an invalid year: 1766 times, which is surprisingly low. Some results that I found: different date notations (DD-MM-YYYY, MM-DD-YYYY) with different separators (-, /, .) the word "unknown", in various forms, including misspellings a question mark whitespace multiple years month names combinations of the above It is interesting to see that there is a very clear point when invalid dates are no longer seen in the data (likely somewhere in 2016), as this is when the checks were put in place, showing that automated gatekeepers to keep data clean do help. Checks that I made will soon be added to the cleanup scripts on Gi

SPARS Codes (part 2)

Before reading this article it is recommended to first read part 1 about SPARS codes. SPARS codes in Format free text field One thing I saw a few times is that people used the free text field in the Format field to store the SPARS code. I was wondering how often that happened. Using my scripts and the latest data dump that was quite easy to find out. A few simple additions (will be pushed to GitHub soon) I looked at if there was a SPARS code in the free text field for the format, but only if the SPARS code was the only thing in the free text field. In total I found 441 releases, but I did not check how many of these have a SPARS Code field, because regardless of that it is an error: if there is no SPARS Code field, there should be one, and if there is one, then the SPARS code in the format should be there. On October 31 2017 there were 441 releases in the Discogs database where the SPARS code was recorded in the Format free text field. What is surprising is that even

Digging into the Spanish Depósito Legal identifier (part 2)

It is recommended to read part 1 before reading this article. Depósito Legal as catalog number One thing that I wondered about is: for how many releases did people use a depósito legal identifier as the catalog number? I had seen it a couple of times, and wondered if it was a structural error, or not. After all, it is an easy mistake to make, as many people are not familiar with what these depósito legal numbers actually are, and mistake them for just some number. I adapted my scripts (changes will be added to the repository soon after some cleanups), ran the scripts, and then verified with the data on Discogs: just 65 releases were using a depósito legal as a catalog number, sometimes in addition to the actual catalog number (so should be easy to fix), sometimes instead of the actual catalog number (harder to fix, especially if there are no pictures, of only tiny pictures on which nothing can be seen). This is actually fewer than I expected, so I am pleasantly surprised. Now

What happened in Discogs in October 2017?

A new dumpfile was uploaded by Discogs, so I launched my scripts again to see what happened in Discogs in October 2017. Release statistics The new dumpfile ("the November dump") was released on November 4 2017 and has 9,107,428 releases. The previous dump ("the October dump") was released on October 4 2017 and had 8,996,419 releases. That means 111,009  more releases in the database. Of these: 8,455,978 releases stayed the same 537,679 releases were changed 113,771 releases were added 2,762 releases were removed 222 releases had the status Draft , Deleted or Rejected set 11 releases that were not Accepted were present in the October and November data dump 1 release moved from Draft to Accepted What stands out is that fewer new releases were added than in September even though October was 1 day longer. But in September Discogs had organized a contest called S.P.IN (September Pledge INiative), where people were encouraged to add new releases to the

Unofficial Discogs rankhunting guide

After a friend who helps a lot on Discogs got accused of making minor edits just to get more votes I felt that something really important was missing, namely an unofficial "rankhunting guide" for Discogs. So here it is! In this article you will find some best practices for increasing your rank on Discogs, without pissing off everyone. With the advise here and too much free time on your hands it should be easy to get into the top 1000 in no time. Use scripts to discover problematic entries First you need to have a list of entries you want to fix, as it is much more efficient and you can very easily click your way through entries. The easiest way to do this is to download a monthly dump of the releases in Discogs and then run scripts that will output a list of smells that need to be fixed.  Spread your "risk" These scripts output a list of smells in the same order as in which the releases were added to Discogs. Because every ten seconds or so a new release