Skip to main content

Posts

Showing posts from 2017

Releases with incorrect tracklistings on Discosg (part 3)

Another thing that I wondered about when looking at tracklistings: in how many of those tracklistings have people duplicated numbers or letters? I saw at least one, but there have to be many more. Then again I thought that this is probably something that I should not be wanting to wonder about based on my earlier experiences with errors found in tracklistings . Nevertheless, I pushed ahead and what I feared: there are lots of releases where in tracklist positions are duplicated. I found 101,474 instances in 31,056 releases. Releases with possible duplication of tracklist positions While most of these seem to be actual errors there are also a few exceptions, which are not always easy to identify automatically: some vinyl releases (like singles) are a double A side and will have 'A' on both sides of the label some releases don't have A/B sides, but instead have "This/That" or use something else some promotional releases have the same song on both side

Finding spelling errors in Czechoslovak and Czech releases on Discogs

Time for a short, but positive post! Today I got validation that what I do is actually useful for people working with Discogs. While sharing the findings of the previous article about Czechoslovakian manufacturing date codes we were asked if we could also search for a particular misspelling in Czech releases that are impossible to find with the Discogs search functionality. Apparently the Czech alphabet has a character (ě) that looks a lot like another character (ĕ) and it is difficult for non-Czechs to spot the difference (it took me some time as well), so there was the suspicion that there would be releases where one was used instead the other, but Discogs does not allow you to search for these characters (according to one user on the Czechoslovak forum). Adding another check to my scripts was fairly trivial (I only had to take care to not search the YouTube playlists as well, which are probably not that interesting). The result: around 90 releases, the results of which hav

Manufacturing date codes from Czechoslavakia

Another country that put dates on its releases is Czechoslovakia (I am using past tense, as the country no longer exists). On most releases from this country from the late 1960s - early 1990s you can find a so called "manufacturing date code", or something similar to that. The page for the Opus label on Discogs says: "Most Czechoslovak vinyl records pressed between ca. 1967 and 1992 include a three-digit code on the side A center label . The first two digits represent the year, the third digit the half-year of the pressing. For example, a record with code '75 2' on the label has been manufactured in the 2nd half of 1975." and goes on to explain that this code does not necessarily mean that the record was also released in that year, but that it could have been held back for whatever reason. But it does mean that it can be used for checking the release year to see if it is perhaps wrong (too early). so I did just that. I looked at a few things in t

Releases with incorrect tracklistings on Discogs (part 2)

Time to dig a bit deeper into the tracklistings, as I believe there is a bit more to the story than what I wrote about in an earlier post . One thing that I kept wondering about is: is this mostly a problem for vinyl releases or for cassettes? I adapted my scripts to also output the format so I can answer this question. As an extra I also checked for shellac records and 8 track cartridges, which also have sides. My scripts found 944 shellac records that possibly have a wrong tracklist. There are even more 8 track cartridges than shellac records with a possible tracklist issue: 1144. This means that the total amount of releases I found with possible tracklist problems is now 148,013. The shellac releases with possibly wrong tracklistings are distributed over the data as follows: Distribution of shellac records with a possibly wrong tracklist in the Discogs data. For 8 track cartridges it looks like this: Distribution of 8 track cartridges with a possibly wrong tracklist

Releases with incorrect tracklistings on Discogs (part 1)

The tracklisting is one of the most essential parts of a release in Discogs, but for quite a few people it is proving difficult to get it right. What I often see with new entries in the database is that they are copied from an already existing release (using the powerful but dangerous "copy to draft" functionality in Discogs) and that information is adapted, but not all of the information. For the tracklist I see for example that for vinyl records or cassettes an already existing entry of a CD is used as a template and that the tracklisting is not adapted. This is important: CDs have a single side, but vinyl records and cassettes have two sides (or are single sided). Yet it is very common to find vinyl releases or cassettes in Discogs with no indication what side the tracks are on. What happens is that some people then ask the submitter to fix it. So I wondered: how often does this really happen? It wouldn't be the first time for me to think "that has to be a hug

How many releases in Discogs are Christmas related?

Because it is the season I was wondering: how many releases in Discogs are Christmas related? So I decided to run a very simple and crude test and search tracklists for a few keywords (all lower case, for easier searching): christmas x-mas kerst (Dutch) weinachten (German) This is, of course, not a very good test and I could also have checked for specific song titles, or different languages, but I didn't have the time to research it in depth. I also did not check if every result is actually a 'real' Christmas song. So you should take this with a big heap of salt. German-language Christmas releases As it turns out German speaking people don't really like Christmas: just 31 releases have some reference to Weinachten. Maybe they don't have that many Christmas songs that actually mention it. Dutch-language Christmas releases The Dutch and Flemish (and the occassional low-German) are doing a lot better: a bit over 1130 releases. I roughly verified that these

Wrong months in the 'Released' field in Discogs

Time for me to go back to a very boring error in the Discogs database: release dates. In an earlier post I already wrote about weird data in the Released field (this is the date field) but that is not what I want to talk about now (as I have already done that). This time it is about something a lot simpler: wrong months in Discogs. Month equals 0 As it turns out there are quite a few releases in the data where the value of the month is no longer correct. Apparently in the past it was custom (or even mandatory) to have releases in full YYYY-MM-DD format, and if the month and date were not know the value 00 was inserted. This is no longer accepted and if you edit a release and the value of the month is 00 then the following error will be displayed: Error message for wrong month This happens only for older releases that haven't been updated for years. It can be argued that this is mostly a cosmetic bug as the old dates with month values of 00 will display just fine. But, t

ISRC codes in Discogs (part 2)

It is time to dive into ISRC codes a bit more. If you don't know what these codes are, then it is probably good to read part 1 first. In this article I will not really look at ISRC codes in Discogs (OK, just a tiny bit), but instead look at how to extract ISRC codes on a Linux machine using Python. For this I will use libdiscid and its Python bindings . What you need for this: a computer with a CD/DVD drive (which is getting surprisingly hard with laptops these days) a Linux distribution (in my case Fedora 27, but any recent Linux distribution will do) libdiscid with associated Python bindings (either Python 2 or Python 3) It is actually very simple and can be done from either the Python prompt or from a simple script. It is literally as simple as this (Python 3 code, I stored it in a file called isrc.py): import libdiscid disc = libdiscid.read() try:     track_isrcs = disc.track_isrcs     for i in range(0,len(track_isrcs)):         print("Track %d: %s" %

ISRC codes in Discogs (part 1)

One thing that I had never heard about is the International Standard Recording Code (ISRC). According to the Wikipedia page about ISRC (worth reading) it is actually already almost 30 years old. These codes are interesting for datamining for a few reasons: the year in which the code was assigned is embedded in it, even though there are some exceptions: apparently in early days the recording year was used. they are specific to a single recording These two characteristics make it an ideal piece of data for checks and comparisons. Sometimes the ISRC code is printed on releases, but sometimes also embedded in some of the metadata on CDs. I never knew about ISRC, as I don't have releases with the code (or at least, I never paid attention) and I never have had a (standalone) CD-player that displayed these codes by default. ISRC in Discogs The ISRC field is a relative new addition to the Barcode and Other Identifers (BaOI) section. In retrospect it might have been better to

Fixing wrong credits in Discogs (part 1)

For releases in Discogs it is customary that people add the credits for a release, like who wrote songs, produced the record, mastered the recording, and so on. In some cases this information can be important, as some reissues or versions of records have been mastered by different people (I don't have an example, but I remember bumping into this while browsing through releases). Discogs has published a list of credits and roles that can be used in the database. These are the only ones that should be used and when editing a release that has a different role an error saying that something "does not match the credit list" (or similar to that) will be shown and you are expected to fix it first. Since a friend saw this error very frequently it got me wondering how many releases in Discogs there are with a credit that is not on the official credit list. For my clean up scripts I created a script to extract the credits list (so I wouldn't need to hardcode it), added a

Social interactions on Discogs

One fascinating side to Discogs is the interaction with other people. As you are probably aware the whole database is crowdsourced, with people adding and improving (at least, that's the idea) data. At the moment I am writing this there are 388,098 users listed in the contributor ranking, meaning they contributed to one or more releases. That's a lot of people: taken together they would be the fourth largest city in the Netherlands, before Utrecht . This also means that you might meet a wide range of characters, from many different cultures, with different motivations, and different opinions about what is the correct behaviour on Discogs. One of my friends is doing a lot of janitor work on Discogs and he gets to see quite a few comments from friendly to annoyed, very angry, hostile, or even downright absurd. Luckily my friend has been hardened by years of online disputes and, as the saying goes: "If you can't stand the heat, get out of the kitchen!" so if th

Indian PKD codes in Discogs

Some countries require that music releases have certain identifiers on them. A well known example is the depósito legal identifier in Spain that I wrote about  several times before. But other countries have it similar things as well. One of them is India, which for new releases requires a so called "PKD". This code contains the month and year in which the release was made, or at least when it was packed (I have seen it being talked about as both a production date, as well as a packaging date). One of my friends at Discogs (gerjolp) looked into these codes a bit, but could not find many references, just a few forum threads and the exact meaning remains a bit unclear, also because sometimes the date seems to have been stamped instead of printed, and this is not clear from the data in the database. PKD codes in Discogs There are 132 releases from India in the latest Discogs dump where it is easy to find the PKD code in BaOI. There are likely many more releases in Disco

Releases under a Creative Commons license in Discogs (part 2)

It's time for a very short post. In the last few days I spent some more time digging into Creative Commons licensed releases in the Discogs database. In a previous post I already looked a bit into it, but it was far from complete and likely I missed a few releases that have content under Creative Commons, but that I didn't identify. One thing that I added to my scripts recently was to look for a very specific word that is used in many Creative Commons license statements, namely ShareAlike . I added this to my scripts , grabbed the latest datadump (which is newer than the one I used last time, and also contains data added in November 2017) and reran my scripts. The result is that I found 82 more releases than last time. It should be noted that in one release the Creative Commons statement was actually removed and 31 releases seem to be new. That leaves around 50 releases which have the ShareAlike statement, but which I didn't recognize before as a released licensed und

What happened in Discogs in November 2017? (part 1)

A new month, a new data dump...and new statistics! Before reading this post it might be good to read the post about what happened in Discogs in October 2017 . Release statistics For this blogpost I downloaded the latest datadump (the "December dump") containing data from November 1 to November 30 (inclusive). The previous dump file had 9,107,428 releases, the new dump file has 9,217,123 releases. That means 109,695 more releases in the database. 3,868,975 releases stayed the same 5,235,697 releases were changed 112,451 releases were added 2,756 releases were removed from the database 222 releases had status Draft , Deleted or Rejected 11 releases that were not Accepted were in both the November dump and the December dump  2 releases were moved from Draft to Accepted What immediately stands out is that there is an enormous amount of releases that has been changed compared to previous months, when it was about 10 times less . As I don't believe that Discog

How to better flag non-existent information in Discogs

I know a few collectors of vinyl records who can best be described as "completists" (although others would describe them as "completely nuts") who try to collect every variant of an album of a certain artist, no matter how small the difference. For these people it is very important to know about how to tell different releases apart from eachother. This is something that Discogs is currently unfortunately quite bad at, especially when it comes to indicating that certain information is not present on a release. Let's look at an example. In the 1980s Metallica released a few picture discs. Some of these picture discs were released with a barcode , and others without . For the collectors knowing whether or not a barcode is on the release really matters. Another example would be knowing if there are SID codes on a release. In an earlier blogpost about SID codes I wrote that SID codes can sometimes indicate that a CD was released after 1994. The SID codes are

How Discogs can prevent wrong data (part 2)

For people caring about correctness of data in Discogs, it sometimes seems like an uphill battle. Once an error has been introduced, it gets copied and spreads, and fixing it becomes almost impossible. It very much resembles fixing errors in the waterfall model of software engineering: stopping errors at the beginning is much easier than fixing later. One way errors are spreading is because of the "copy to draft" functionality in Discogs, where information from an existing release can be copied and serve as a template when adding a new variant of a release. Although extremely useful (it speeds up entering information) people not only copy the correct information, but also the errors, or they leave information that is irrelevant to a release and don't remove it (example: SID codes for vinyl or other releases ). When the new release in turn is used as a template the wrong information spreads further through the database. Detecting errors is quite trivial using some simp

SID codes (part 4)

I wanted to call this post "Return of the SID", but there is only so much nerdiness you can squeeze into one topic. But, I am going to talk about SID codes once again, so it is best to first read part 1 , part 2 and part 3 about SID codes if you are not familiar with them. SID codes are inherently tied to CDs, or CD-like media (DVD, Blu-Ray, and so on) and have not been used anywhere else. One thing I wondered: for how many releases in the Discogs database have SID codes been defined when it actually is a different format for which SID codes do not make any sense at all, such as vinyl, or cassettes? So I got the latest Discogs data dump (releases until November 1 2017), adapted my scripts , ran some tests and got quite interesting results: 332 vinyl records 151 cassettes 24 files (digital music files) 13 shellac discs 2 DCC releases 1 VHS release 1 Memory Stick release 1 Edison Disc Especially the Edison Disc made me chuckle, as it is such an ancient format, an

Releases under a Creative Commons license in Discogs (part 1)

While digging through releases in Discogs I spotted that some of them had a reference to a Creative Commons license . As I am dealing with open source licensing (of software) on a daily basis it of course immediately drew my attention and I started to wonder how many of the releases in the Discogs database have a reference to Creative Commons. I (again) grabbed the latest data dump from Discogs (which itself is under a CC0 license, so quite fitting), and adapted my scripts to report if there were any references to Creative Commons in either the free text fields, the notes or in some of the descriptions in the "Barcode and Other Identifiers" (BaOI) sections and found 7884 releases. I plotted the results of the release numbers, which you can see below (first bar: release number 1 - 999,999, second bar release number 1,000,000 - 1,999,999, and so on): Distribution of possibly CC-licensed works in the Discogs database What strikes me is that for older releases there are

SID codes (part 3)

And again it is time to dive into the world of SID codes. Before reading this post it is advised to read part 1 and part 2 . Before I used Discogs I had not heard of SID codes and never realized that they can (somewhat) be used help date a release, as SID codes were only introduced in 1994. I found out by accident when I was browsing through Discogs and found this release that I have had for many years. I bought it back in 1994 or 1995 and always wondered why it showed up in the shop (new) then even though it had supposedly been available already a number of years before. This year I found out  after learning about SID codes that my copy is a later repress . This made me wonder: how many releases in the Discogs database are there that have some SID code defined, but have the year set to something earlier than 1994? I added a check to my scripts and analyzed the latest available Discogs data dump (with data until November 1 2017) and found 3092 releases with a release date ea

SID codes (part 2)

Another round of digging into SID codes. Before you read this it is highly advised to first read part 1 about SID codes. In the first part I wondered how many releases have SID fields defined, but have something in there that isn't an actual SID code. For that it is important to know what a valid SID code actually means. Valid SID codes According to the IFPI specifications this is a valid SID code: "The SID Code consists of the letters 'IFPI', followed by either four or five additional characters, which may be alphabetical or numerical [...]" which seems very clear. There are a few extra restrictions: the additional characters in the mastering SID code always have to start with "L", and mould SID codes cannot use the characters "I", "O", "S" or "Q" (it is unclear from the document if this is just for mould SID codes, or also for mastering SID codes). With that background information I (partially) adapt

How sparse is Discogs?

In Discogs each release is assigned a number. At the moment (November 27 2017) the number of releases in Discogs is a bit over 9,200,000, while the latest release number is a bit over 11,200,000. That means that 2 million release numbers (around 17.85%) have gone from the database and about 82.15% of releases that were added are still in the database. Around 2 million release numbers have been removed from the Discogs database. [EDIT 2017-11-28: I was told by Discogs that every Draft is also assigned a release number and many drafts don't ever get submitted to the database, so they simply do not appear in the database] In a previous post I wondered if Discogs is getting increasingly "sparse". I kept thinking about it, so I decided to just look at the data to see if my suspicion ("Discogs is getting increasingly sparse") was right. Why releases disappear from Discogs There are several reasons why releases disappear from Discogs. The most common reason is

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi

Label Code (part 3)

Today a short post, again about label codes. Before reading this post it is recommended to first read part 1 and part 2 about Label Code fields in Dicsogs. Label Code in format free text field One thing I wondered about is: how often would a valid label code would pop up in the free text field of the format description. This was fairly easy to check. In the latest Discogs dump: just 4, where 1 actually had a Label Code field and the free text field in Format just contained duplicate information. These were all fixed very easily. To be honest I had expected quite a few more, so this was a positive surprise.

The importance of submission notes in Discogs

Ranting time. Something that I think Discogs got right is that when editing a release you can see the entire editing history and view earlier versions of the page. This makes it lot easier to see when a particular change was introduced and allow you to "debug" a page: some people make changes that turn out to be wrong, or information got lost and it needs to be restored, or a page got vandalized (there are a few disgruntled ex-users that come and go and changes need to be reverted. Having the entire history available makes that possible. Other collaborative systems like Wikipedia have this feature as well where it has proven to be incredibly useful. In Discogs it is mandatory to describe changes to releases in the so called submission notes. For me as a software engineer using open source software this is completely normal: you create a change, and describe the change, otherwise it becomes cumbersome down the line and you are asking yourself "what is this change and wh

SPARS Codes (part 3)

It is time for me to dig into SPARS codes again. Before reading this article it is advised to first read part 1 and part 2 of this series about SPARS codes. One thing that noticed when digging into possible smells for SPARS codes is that sometimes I would see a code named CDC that was tagged as a SPARS code. As that letter actually isn't in the SPARS code specifications I was wondering what it was and how often it appeared in the Discogs database. As it turns out this is a code specific to Sony Music that somehow indicates the format. There are more have these codes, which have been discussed several times in the Discogs forums already. I updated my scripts to see how many of these Sony Music codes were in SPARS Code fields. Surprisingly few: I could find just 23 but lots of SPARS Code fields still need to be fixed, so I am expecting more to pop up once these get "fixed". As always, scripts have been updated to check for these Sony codes in SPARS code fi

Digging into the Spanish Depósito Legal identifier (part 3)

Time again to dive a bit further into the Spanish depósito legal identifiers to date releases. It is highly recommended to read part 1 and part 2 first before reading this article. Using depósito legal to date a release The year that is embedded in the depósito legal can be used to date a release, but there are a few things to keep in mind. For that it is necessary to know a little bit more about the depósito legal. To get a depósito legal number on a release you first have to apply for it. The release is then assigned a number by the library. This number is then to be printed on the release. This means that the release could never have been prior to the year embedded in the depósito legal number, unless the depósito legal number has been misprinted. However, the depósito legal number could be in the past: maybe a depósito legal was applied for, and then the release was postponed. Or, it was applied for in December and the release was made in January the next year.

ASIN field in Discogs (part 2)

One of the identifiers used in Discogs is the Amazon Standard Identification Number or ASIN. It is advised to first read part 1 . ASIN used in wrong place One thing I wondered about after my previous post about the ASIN was: how many ASIN identifiers could be found in places other than the ASIN field, like Barcode , or Other and where it was tagged as ASIN in the description. I adapted my scripts and ran some tests. The script reported 51 hits, but there were a few false positives where both the field and description were set to ASIN . There were a few releases (about 15 or so) where something else than ASIN was used (most likely some barcode). This was a surprising find, as I actually had expected many more. There must be a few more in hiding but I will leave that for another time.

ASIN field in Discogs (part 1)

What I like about Discogs is that I keep learning about things I never knew about. One field in Discogs that I didn't know about is the ASIN field. The Discogs guidelines describe it as: "This is a unique identification number assigned by Amazon.com and its partners for product identification within the Amazon.com organization." and apparently it is embedded in product URLs at Amazon. Many people seem to have extracted this number from the product URLs, despite the guidelines saying: "This number should only be applied to releases manufactured by Amazon (CD-Rs), physical releases exclusive to Amazon, and digital files sold by Amazon." as there were more than 42,000 ASIN entries in the Discogs database on October 31 2017. I would be extremely surprised if there were that many Amazon exclusives. ASIN in Discogs According to the Wikipedia page about ASIN the identifier is a ten character identifier. A bit of mining revealed: most, if not all, ASI