Skip to main content

Posts

Showing posts from 2018

How to "hijack" releases in Discogs that were voted correct without getting caught

There are some things in Discogs that really irritate me. One of the things that I personally dislike a lot is the voting system, as it is not granular enough. But what's worse is that there are ways to (partially) edit release and where the voting system can be bypassed: pictures. Normally when editing a release in Discogs that has received a vote (either correct, incorrect, etc.) its status is set back to "needs vote", except when only changing images. In that case the voting status stays as it is. I have seen instances where an entry was voted "complete and correct" 7 years ago, but that pictures that were completely different replaced the original ones and no one noticed. This so called "hijacking" of releases is very much frowned upon and unfortunately it happens far too often. So what Discogs should do is quite simple: make sure that every edit, including an edit related to images, causes the status to be set to "needs vote".

How are release formats distributed over Discogs?

In Discogs each release has one or more Format fields, in which a contributor has to indicate what format a device has, or what formats a device has, in case it has multiple formats. I simply looked at all the releases in the database and simply counted, and this is the list I got (from most releases to fewest releases): Vinyl: 4,819,258 CD: 2,913,785 File: 917,569 Cassette: 647,604 CDr: 386,541 Shellac: 146,494 DVD: 85,475 Box Set: 52,634 All Media: 29,217 Flexi-disc: 20,814 VHS: 18,674 8-Track Cartridge: 15,206 Acetate: 9,580 DVDr: 8,581 Lathe Cut: 8,023 SACD: 6,380 Reel-To-Reel: 5,022 Blu-ray: 3,705 Laserdisc: 3,077 Memory Stick: 1,701 Minidisc: 1,613 Edison Disc: 1,308 Cylinder: 1,290 Betacam SP: 1,060 Hybrid: 1,012 Floppy Disk: 1,003 Blu-ray-R: 674 CDV: 601 4-Track Cartridge: 593 DCC: 397 PathƩ Disc: 367 U-matic: 362 Betamax: 263 DAT: 209 PlayTape: 144 Microcassette: 135 HD DVD: 65 MiniDV: 57 UMD: 51 VHD: 40 SelectaVision: 37 Tefifon: 3

Digital file releases in Discogs (part 1)

One category of releaes in Discogs are the "digital releases". Basically: MP3s or other digital formats from stores, iTunes releases, and so on. Call me old fashioned, but personally I don't see these as collectables, as to me they are just files on a computer or music player. But apparently many people disagree with that and collect them. In Discogs the digital releases have "File" in the format field. This makes it quite easy to recognize. So I wondered: how many of these "file" releases are there and where are they in the data? Is it mostly the newer releases, or are there also many older file releases? So I looked at everything in the latest data dump and found 917,569 releases that have "File" in the format field, which is about 10% of the releases in Discogs and 40% more than for example cassettes. Distribution of releases tagged as "File" in the Discogs data set So only in the very early days there were few "

Looking back on 9 months of digging into Discogs

I have been digging into the Discogs dataset for over 9 months now and blogging about it since September. During this period I have made a few observations. In short I must say that I have mixed feelings about Discogs because it isn't clear what Discogs actually is and what they want to achieve. I talked about this earlier , but it basically comes down to this: Is it a catalog? Is it a marketplace? Is it a place to organize your collection? Discogs is trying to be all but not succeeding because there are tensions between the different use cases: enforcing correctness chases sellers away from the marketplace and selling records is what is bringing in the money for Discogs and which keeps the other fires burning. But the costs for this is that the data in the catalog is sometimes blatantly incorrect, which vastly reduces its value. Personally I don't care about the marketplace, as I am a collector and I care about the catalog. Discogs has enormous potential and for collecto

How many known errors in Discogs were fixed in May 2018?

Whenever a new datadump from Discogs has become available I try to find out how the data has been improved, and see what else I can find that can be improved in the data. Yesterday I looked at how many releases that previously did not have errors (detected using my scripts ) were now flagged as having errors . The results actually weren't that encouraging. Tto give it a positive twist I wanted to look at the exact opposite: how many releases that were flagged as having an error in the previous dump did not have an error (as detected by my scripts of course) in the new dump? For this I ignored tracklisting, role and artist errors and focused on the "real" errors (even though tracklisting errors are very real the other two fall more in the "meh" category). Also, if a release with errors was removed, then that is also considered a fix (because the error is no longer there). In 2169 releases (with 3283 errors) were fixed. Old releases that were fixed in

Introducing new errors in old releases

One thing that I noticed when looking at older releases is that sometimes new errors are introduced. These are then picked up by my scripts which detect them just fine, but I was wondering how often this happens, as that is something my scripts do not detect. So I grabbed the dumps released in May and June and simply counted how many old releases were in the later dump, but not in the earlier. As it turned out, quite a few: 1692 errors in 1046 unique releases and there was only about two weeks in between the two dump files. Extrapoliting a bit that means probably for around 3000 errors in 2000 older releases each month new errors are introduced (and most of these errors are preventable). When ignoring Artist and Tracklisting errors there are still about 630 errors left in 474 releases. Old releases in which errors were introduced in the second half of May 2018 What is clear is that this time it is mostly the recent releases that are adapted. What this also means is th

What happened in Discogs in May 2018?

Another month, so another round of statistics. This time Discogs was fairly quick to release a new data dump, unlike last month . If you don't know how this works it is best to first read that overview, plus perhaps a few older ones. Release statistics The latest dump (which I call "the June dump", as it was released then) covers all data added from May 1 May 15 - May 31 2018 (inclusive). The previous dump had 9,843,513 releases, the new dump has 9,906,032 releases. That means 62,519 releases more in the database. Of those: 9,520,889 releases stayed the same 320,855 releases were changed 64,288 releases were added 1,769 releases were removed from the database 365 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in both dumps 1 release was moved from Draft to Accepted What strikes me is that the amount of releases is a lot less than last time, although then it was a lot more. So I guess that last month's dump was actua

What happened in Discogs in April 2018?

I actually had wanted to write this post a few weeks ago, but for some reasons it took Discogs a lot longer to publish the new data dump. I can only guess why, but I would not be surprised if GDPR had something to do with it. I then got swamped with other tasks so now, almost a month too late, I can finally tell you what happened in Discogs in April 2018. First of all, if this is the first time you read one of these posts, please read the one from last month first. Release statistics The latest dump (which I call the "May dump") covers the period of April 1 - April 30 May 14 2018 (inclusive). The previous dump had 9,680,263 releases, the new dump has 9,843,513 releases.  That means 163,250 releases more in the database. Of these: 8,983,224 releases stayed the same 693,583 releases were changed 166,706 releases were added 3,456 releases were removed from the database 162 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in both

Contributor ranking in Discogs (part 2)

What I like about working with the Discogs data is to make data that isn't visible visible. In an earlier post I talked about that I suspected that the Discogs contributor ranking likely followed the 80/20 rule , but I didn't have enough data yet to confirm that. I crawled more data from Discogs (very slowly, as Discogs doesn't make it easy with their anti-crawling measure, so I crawled from multiple locations over quite a few hours) and reran scripts that I wrote to crunch the numbers and see how many of the top contributors were responsible for having 80% of the accumulated points in Discogs. When looking at contributions of the top 1000 contributors 60% of the contributors accounted for about 80% of the points. The more data I got the more this moved towards 20% and it became clear very quickly that Discogs indeed seems to follow the 80/20 rule: when looking at the points of the top 36,000 contributors 80% of the points accumulated belong to the top 21.3% contributo

Contributor ranking in Discogs (part 1)

In Discogs you can get points for every contribution. There are a few people who have made it a sport to score as many points as possible, the so called "rank hunters" which some people look down on for reasons that I do not understand (in case you want to be a rankhunter or you want to be a more effective one, please check out the "Unofficial Discogs rankhunting guide" to maximize your efforts). It works like this: adding a new release gets you three points and each edit (regardless how much you edit) gives you one point and there is some sort of leaderboard/ranking for all contributing users to Discogs . There are a few people that have an enormous amount of points and who seem to live for the site. The number one currently has over 356,000 points. When looking at the graphs they really reminded me of power laws , and the 80/20 rule because when looking at the first page of the contributors in Discogs and power law pictures there is a striking resemblance so

ISRC codes in Discogs (part 5)

I am still not done digging into the ISRC data from Discogs, as I see it as a source of errors and I am in ranting mode. At the moment having errors in ISRC codes is not a big problem, as Discogs is not using it yet (right now you cannot specifically search on ISRC codes). Unlocking this data could actually quite handy in the future ("Oh, I like this track that I downloaded and want it on a physical release. On which physical releases was it published?") but I will let them discover that business case themselves. If you don't know what ISRC codes are I suggest you start with reading one of my previous posts about the subject and follow the links there. What I wondered is: how many times can you find the same ISRC code on a single release, for example if a contributor makes a copy/paste error and forgets to change the code? So I adapted my scripts and ran a test to see in which releases (where ISRC codes actually marked as such with a proper ISRC field) there are dup

ISRC codes in Discogs (part 4)

Time to look at ISRC codes again because I am still not done researching them. If you don't know what it is, I would suggest that you first read part 1 , part 2 and part 3 . To better understand some of my complaints below you should also read about why I think that Discogs stores the ISRC fields in the wrong place . So now that we've got that out of the way we can start. The ISRC field in Discogs is in the "Barcodes and Other Identifiers" (BaOI) section, which I already explained is not the correct place and it should be with the individual tracks. Right now people are using descriptions (a free text field) to indicate which ISRC code belongs to which release. Because I know how bad people are with typing in correct information (and of course I am not immune to this) I was wondering in how many releases people make mistakes. I added a very simple check to my scripts to see if descriptions were being reused (copy/paste errors, or "off by one" errors).

What happened in Discogs in March 2018?

A new month, so that means that there is a new dumpfile available that I can do analysis on. If this is the first time you see one of these posts I would highly recommend to first read the posts from previous months . Release statistics The latest dump ("the April dump") covers the period from March 1 - March 31 (inclusive).  The previous dump had 9,554,069 releases, the new dump has  9,680,263 releases. That means 126,194 releases more in the database. 8,962,136 releases stayed the same 589,131 releases were changed 128,996 releases were added 2,802 releases were removed from the database 247 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in both dumps 0 release was moved from Draft to Accepted Luckily this time Discogs did not change the XML format, so it is relatively easy to compare to the previous month. It looks very similar to previous months: the amount of releases edited is very similar, but perhaps with a sl

Where the current Discogs datamodel doesn't work (part 2)

Time to look into a few more aspects of the Discogs data model and where I think it could be improved. If you haven't read the first part I would suggest you do that first. This time I want to look at a few more: rights society, mastering and mould SID codes, and pressing plant. Rights Society Currently rights societies are stored in the "Barcodes and Other Identifiers" (BaOI) section which covers the whole release. The correct place for this information would actually be per track. There are some releases in Discogs (like some compilations) where it is indicated which rights society applies to which track. Mastering and Mould SID codes Like rights societies the mastering and mould SID codes are stored in the BaOI section that covers the whole release. The correct place for these identifiers would actually be a physical disc, as they frequently are different in releases with multiple discs per release. The solution that people use is to tag it in the free text fie

Release with artists not in the Discogs database (part 1)

When adding a new artist to Discog and the artist is unknown it is normally added to the database and the release is added to the artist page. Artist actually is a bit of a misnomer, as it could also mean a sound engineer, a producer, and so on. For each of the "artists" there is a number in the database and a corresponding artist page. I was looking through some of the entries in Discogs and spotted a few instances where the number of the artist was 0 and on the release page there was no link to an artist page. The credits list on the Discogs website explains that there are five credits for which the artist is not recorded in the database and there is no artist page. Personally I don't see the need for three of these ("artwork by", "photography" and "executive producer") and think they should be completely replaced by linked credits. I was wondering how many releases there actually are where one or more "artists" were not li

Where the current Discogs datamodel doesn't work (part 1)

After looking at many releases in Discogs and the edit history of various releases and how things have changed through time it is quite clear to see how the data model has changed. Some of these changes have turned out well, while others were, in retrospect, probably not the right change to make. I am not blaming the developers as I know from experience how difficult it is to get a data model right the first time and how hard it is to change: as soon as something is in use it is non-trivial to change especially with a database the size of Discogs. Still I want to go through a few examples where I think the current datamodel doesn't work. Most of my examples will focus on the fields from the "Barcodes and Other Identifiers" (BaOI) section. In this post I am looking at three of them: ISRC, SPARS codes and matrix/runout. ISRC Currently ISRC codes (International Standard Recording Code, see previous articles about it for more information: 1 , 2 , 3 ) are stored in the

Homoglyph confusion in the Discogs database (part 2)

I decided to look into character sets again, because the last time I only looked at a few instances of "homoglyphs" (characters that look like characters in other languages, but which are different). It is recommended to first read the first article about this. I looked into rights societies again. Last time I mostly focused on Greek characters, this time I looked at where Latin characters were used where Cyrillic characters should have been used. The rights society in Russia is called RAO which in Russian is spelled Š ŠŠž, which looks a lot like PAO, but it is different (first one is using the Cyrillic alphabet, second one the Latin alphabet). They are in the data as follows: Distribution of wrong entries for Russian rights society Š ŠŠž What is interesting is the peak for recent releases. I don't know whether this is because more Russian releases have been added recently, or if this error has mostly been corrected for earlier releases.

What happened in Discogs in February 2018?

It is a new month, so that means that there is a new dumpfile available that I can do analysis on. If this is the first time you see one of these posts I would highly recommend to first read a similar post from last month . Release statistics The latest dump ("the March dump") contains data from February 1 - February 28 (inclusive).  The previous dump had 9,442,719 releases, the new dump has  9,554,069 releases. That means 111,350 releases more in the database. 3,391,491 releases stayed the same 6,048,861 releases were changed 113,717 releases were added 2,367 releases were removed from the database 209 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in both dumps 1 release was moved from Draft to Accepted So, looking at the releases that were changed it is obvious: Discogs changed the dump format again! ARGH! This time it is in the "joiners" for the artist names. Discogs changing the internal XML format makes it

Using matrix numbers of CDs to verify release years (2)

I am still not done looking at matrix numbers to verify Discogs release data and I am uncovering more and more. In a previous post I wrote about how Cinram has embedded the manufacturing date of the glass master in the matrix and how effective it can be to find releases with an incorrect release year. But there are also other companies that did this, such as P+O . In typical German fashion they were very thorough with their matrices, and until 2007 they put a few markers possibly indicating the date in the matrix (later they used a different matrix and since then it is not so easy). Each old style P+O matrix has: a number identifying the release a character indicating the master (for example A, or B) an optional number indicating the stamper a month/year combination indicating the production date The two interesting parts are the number identifying the release and the month/year. For each of the numbers it is known which number was used in which year. The month/year combina

Pressing plant misspellings in CD matrix fields

One thing that I learned about data entry: it is difficult! People are sloppy and tend to make silly mistakes and then overlook them, because the brain is great at error correction, making you blind for these mistakes. I bumped into one particular error in Discogs twice while looking at possible errors for specific pressing plants . And, I thought, if I can already see it twice by sampling a small section of releases, then it has to be more common. The error that I found is that the matrix numbers of the PMDC plants were misspelled as " PDMC " (interestingly, PDMC was another completely unrelated plant). Some likely reasons: sometimes the matrix is mirrored (tricky for the brain) copy and paste errors from elsewhere (personal) PDMC is easier to say than PMDC Using my scripts and the latest datadump I found 62 releases with this particular error. They are distributed as follows in Discogs: Releases where PMDC is likely misspelled as PDMC in the matrix I

Using pressing plant identifiers to date releases (2)

Using pressing plant information to find wrong information in releases in Discogs is something that I explored in an earlier post . I only looked at a single pressing plant and that already uncovered 76 releases that were obviously incorrect. I added a few more checks for pressing plants to my scripts and processed the latest Dicogs data dump that I could find. Immediately hundreds of more errors popped up, and I didn't even add that many companies to my checks. With the new checks I could find 610 that are wrong, but I am sure that as soon as I start adding more checks many more errors will pop up. Releases in Discogs where manufacturing plants and release years don't match What often seems to be the case that people have combined the original release with reissues, making the entries completely useless for a correct classification. One thing that I noticed when looking at the pages for the manufacturing plants is that there are quite a few where it is mentioned w

SPARS Codes (part 4)

In the last few weeks I have been digging a lot into CDs and I have come to the conclusion that correctly identifying CDs is difficult: there are so many things that you have to take into account and it is easy to make a mistake. Luckily automation helps! I decided to look at SPARS codes again, as I felt that story was not yet complete. If you don't know what SPARS codes are, I would suggest to first read my earlier posts about: part 1 , part 2 and part 3 . On Wikipedia it says that SPARS code were introduced in 1984 . Of course, we all know that if it is on Wikipedia it just has to be true! Wikipedia has a reference to a physical magazine from 1984 and with a bit of searching I found someone who actually describes the same magazine and which seems to confim what Wikipedia says. So, I wondered: how many releases are there in the Discogs database with a defined SPARS code field (containing a valid SPARS code) and a declared release date prior to 1984? I was surprised to f

Using pressing plant identifiers to date releases (1)

Something that has turned out surprisingly hard to do is to categorize CDs. While at first it seemed easier than vinyl it has turned out to be much more difficult, as CDs have been repressed in different years by different plants (or the same under a different name), but where everything is the same, except for some very minor details, such as SID codes, or the CD matrix. I have already written about SID codes and how they can be used to very roughly date CD releases . But there is also other information that can be used to see if releases have been dated (somewhat) correctly. Recently I decided to look at the companies that are listed, see when they were operational and use that information for checking release years. Quite a few CDs were pressed at a particular plant in the US. During its lifetime this plant operated under different names: PDO, USA from 1986 - 1992 PMDC, USA from 1992 - 1999 UML from 1999 - 2005 EDC, USA from 2005 - 2009 There is some overlap: some g

Using matrix numbers of CDs to verify release years (1)

One of the first steps when making CDs is to produce a "glass master". From that (using a few more steps) a "stamper" is created which is then used to press the actual CDs. When looking at a CD you can often see text in a ring in the middle of the CD. This is the so called matrix which comes from the glass master. Apart from the matrix other text (like the IFPI mastering SID code) could possibly also come from the glass master. A bit of background information can be found on page 7 of the IFPI SID code implementation guide although I would also highly recommend watching some of the clips about glass masters on YouTube, which are highly informative. One company making a lot of these glass masters is Cinram . For most of their glass masters they stored the production date in matrix. This is good news, because it means that it could possibly be used to verify releases and see if the declared year of the release is right: it could never have been released before

DMM records in Discogs

Somewhere in the 1980s records started to appear with the "DMM" (Direct Metal Mastering) method. The whole background about DMM is explained on Wikipedia much better than I could do, so I will just focus on DMM in Discogs. In Discogs there is no special field to indicate that a record was made using DMM, so some people have used the free text field in the "Format" section for it. I looked at how often this happened: 571 times, which are distributed like this in the Discogs database: Distribution of releases tagged DMM This seems to be quite low as there should be many many more (especially the 1980s pop records). Of course, one explanation could be that it is actually not relevant information. I am not aware of records that were both pressed as DMM and non-DMM releases and where the only difference is that DMM was used (but I could be wrong). One thing that I also looked at is how many release were in Discogs that have DMM in the free text field but whic

ISRC codes in Discogs (part 3)

Time to revisit the ISRC codes. I already talked about these twice, namely how many errors for these codes there are in Discogs , as well as how to extract them from a CD . In the first of those posts I already hinted at that there is actually a year component in the ISRC codes. The ISRC code for a single track has 12 characters. Characters 6 and 7 should be digits indicating the year the code was assigned (although in the early days some ISRC codes were handed out where the year the song was written or when the recording was made). It would of course been easier if they would have used 4 characters, but they didn't. That means that the year component from the ISRC code can be used to check whether or not a release is correctly dated, because a release cannot be from earlier than the year the ISRC was assigned. I adapted my scripts to check for a few things: is the format confirming to the ISRC standard? is the year recorded in the ISRC not later than the date of the rele

CD+G releases in Discogs

For many years to me an audio CD was just a silver disc with music on it. But as I already discovered with for example the ISRC codes on CDs there can be more data on an audio CD than most people know. One thing is that on some CDs there are also graphics that can be displayed on a system capable of displaying some rudimentary graphics, such as karaoke machines. The format is called CD+G (CD plus graphics) and it can be recognized (amongst others) by the compact disc logo with the word "graphics" included. So I wanted to know: how many CD+G releases are there in Discogs? Plus, how many have possibly gone undetected and can I detect them? In the format section there is a checkbox for CD+G, but some people have entered it into the free text field (which is incorrect), so I checked both. There are currently 352 releases in the database where the release either has the CD+G checkbox checked (350 releases) or where it is in the free text field (2 releases, but fixed now)

EMI catalog numbers and countries (1)

When I was a bit more fanatic about collecting records of a certain artist who were on EMI (and related labels) I quickly learned about how EMI's labels worked and it all seemed very clear: catalog numbers starting with 5C means The Netherlands, 1A is either The Netherlands or Europe (in the 1980s), 5A and 1C are Germany, 4C is Belgium, 2C France, and so on, like mentioned on this page in the Discogs reference site . So I thought that it would be trivial to cross reference EMI label numbers and countries in Discogs to see if they would match and clearly see if anyone would have entered the wrong data and then rightfully scold them for it. But reality turns out to be a bit more complicated than anticipated. What I did is that I looked at all the releases where the catalog number starts with either "5C" or "5c" and checked if the country is The Netherlands. I could find a bit over 6400 releases from the Netherlands: Distribution of Dutch releases with EMI