Skip to main content

Posts

Showing posts from May, 2020

What happened in Discogs in April 2020?

I am going to talk about the monthly statistics again, but I am going to skip a few months: that's right, no drill down statistics for January, February and March 2020. The reason is that Discogs had changed the dump format and I first needed to fix my scripts and give them a good brush up. The dump file that I researched contains the data as it was on May 1 2020 (I think as it was at 00:00 UTC, but I am not entirely sure). I compared this to the database from April 1 2020. there were 12,458,766 releases up from 12,302,145 releases, so that's 156,621 more 159,735 releases were added 3,114 releases were removed 11,676,243 releases stayed the same 622,788 releases were modified The modifications in the database were distributed across the data as follows: Releases in Discogs changed in April 2020 As I found out recently there are certain edits that are not very relevant , such as the YouTube links. These are irrelevant for me. The 268,187 releases with only irrelevant edits can...

Are ISRC codes being fixed? Yes they are! (3)

Time to return to an older subject: ISRC codes. I have written about ISRC codes before so if you don't know what I am talking about, you might want to read the older posts first and then return to this one. I wondered how in the past few months the number of obvious ISRC errors would have gone down. In my previous post about this subject I saw a massive decrease: May 2019 dump: 22,391 releases June 2019 dump: 21,096 releases July 2019 dump: 19,214 releases August 2019 dump: 15,161 releases September 2019 dump: 13,390 releases October 2019 dump: 12,090 releases November 2019 dump: 11,360 releases December 2019 dump: 10,578 releases January 2020 dump: 8,893 releases and I wanted to know if this trend continued: February 2020 dump: 6,840 releases (but this dump covered changes in January and most of February, not just January) March 2020 dump: 6,722 releases April 2020 dump: 6,468 releases May 2020 dump: 6,313 releases The trend is still downwards, but the pace seems to have slowed ...

How big are changes in Discogs (2)?

Recently Discogs changed its data dump format, making it more difficult to compare releases the way I used to do and which is described in a previous blog post . So, the only thing I could do: dive into the XML a bit more and compare elements in the XML output for releases that have been changed. The results are quite interesting. I took the data dumps of March 1 2020 and April 1 2020, split these data dumps into individual XML files, computed SHA256 checksums for each of these files and ignored the ones that were the same in both dataset, leaving me with 531,444 releases to look at: Releases that were changed in March 2020 This graph is the same graph that I have been seeing for the last 2.5 years, but what I wanted to know is: what are the releases with relevant changes and where are those changes? Personally I don't think that the YouTube videos are very relevant, and since that was the biggest change in the Discogs data dumps I decided to filter these and then compare. What I ...

New errors introduced in existing (correct) releases in January, February, March and April 2020

Because Discogs changed the data format of the data dumps and I have not yet fixed the scripts that I use there are no monthly statistics yet. But I can still say something about how many previously correct releases actually had errors introduced. As I said (or at least implied) in the previous article this doesn't necessarily mean that the data of the release is now more wrong: it could be that data was added that wasn't previously there and in the added data there is an error (this is what I mostly see). In January 2020 the 1564 unique releases that had errors introduced were distributed over the data as follows: Releases in the Discogs database where an error was introduced in January 2020 and which were previously correct Now, it should be said that this data actually seems to cover a large part of February 2020 as well. The name of the data dump file seems to suggest that it includes data up to February 20. The 321 unique releases that were changed between February 20 202...

New errors introduced in existing data in December 2019

One thing that I am always interested about: how many releases that my scripts thought were OK now have errors? I compared the data of smells of the data dump of December 2019 with the data dump of January 2020 and got the following chart (as always, the columns indicate the range of release numbers in the database: first column is everything with release number < 1,000,000, second column is everything between 1,000,000 and 2,000,000, and so on): Releases in the Discogs database where an error was introduced in January 2020 December 2019 and which were previously correct So actually: not too bad. It is also consistent with the change patterns I have seen over the years. The reasons for the peaks for older releases and newer releases: older releases are expanded with new information. This also means that more errors are introduced there. newer releases are usually added first, and expanded later by other people. This also increases the chances of errors being introduced.

What happened in Discogs in January 2020?

Normally I would have expected this post to have been published earlier, but Discogs didn't release its data earlier. In the past the data dumps would be released around the 4th or 5th of the month, but now it took until February 24th before there was a data dump. I then looked at the data and just threw it into a corner and left it there for a few months. This is going to be a little bit different than the other blog posts in this series . The reason is that the internal format seems to have changed again. In the past the dump file would grow with around 100 MiB per month and it has done this very consistently over years. So when I saw that in one month the archive had grown with 1 GiB (about ten times the expected increase) I knew something was up. A short inspection: a lot of the YouTube video information is now included, so comparing the XML data as I have done so far actually makes no sense at all, as with the new data that is included a release would have been marked as ...