Skip to main content

What happened in Discogs in April 2020?

I am going to talk about the monthly statistics again, but I am going to skip a few months: that's right, no drill down statistics for January, February and March 2020. The reason is that Discogs had changed the dump format and I first needed to fix my scripts and give them a good brush up.

The dump file that I researched contains the data as it was on May 1 2020 (I think as it was at 00:00 UTC, but I am not entirely sure). I compared this to the database from April 1 2020.

  • there were 12,458,766 releases up from 12,302,145 releases, so that's 156,621 more
  • 159,735 releases were added
  • 3,114 releases were removed
  • 11,676,243 releases stayed the same
  • 622,788 releases were modified
The modifications in the database were distributed across the data as follows:

Releases in Discogs changed in April 2020

As I found out recently there are certain edits that are not very relevant, such as the YouTube links. These are irrelevant for me. The 268,187 releases with only irrelevant edits can be found here:

Releases in Discogs changed in April 2020 with only YouTube link edits

When only looking at relevant changes, it looks like this:

Releases in Discogs changed in April 2020 with relevant changes


The sections that were changed in these releases:

  1. changed identifiers: 80378
  2. changed companies: 78147
  3. changed tracklist: 72928
  4. changed images: 71951
  5. changed extraartists: 63530
  6. changed formats: 50135
  7. changed data_quality: 48549
  8. changed notes: 39229
  9. changed labels: 34148
  10. added master_id: 29704
  11. added images: 18137
  12. changed title: 17626
  13. changed artists: 16958
  14. added notes: 14630
  15. changed styles: 11448
  16. changed released: 8648
  17. added released: 7334
  18. changed genres: 6675
  19. changed master_id: 6447
  20. added styles: 5690
  21. changed country: 4212
  22. removed notes: 3836
  23. removed released: 3023
  24. added country: 1766
  25. removed images: 1010
  26. removed master_id: 800
  27. removed styles: 354
  28. removed country: 337
Most of the changes were very small. Using TLSH I could compute the distance between the XML elements that were changed in each release. Quite a few were too small for TLSH, but of the ones where it could be computed the top 20 scores were as follows:
  1. TLSH distance 0: 204573
  2. TLSH distance 1: 4855
  3. TLSH distance 3: 4175
  4. TLSH distance 4: 4029
  5. TLSH distance 2: 3713
  6. TLSH distance 5: 3683
  7. TLSH distance 6: 3265
  8. TLSH distance 7: 2821
  9. TLSH distance 8: 2516
  10. TLSH distance 9: 1947
  11. TLSH distance 10: 1706
  12. TLSH distance 17: 1533
  13. TLSH distance 16: 1517
  14. TLSH distance 18: 1492
  15. TLSH distance 19: 1437
  16. TLSH distance 20: 1416
  17. TLSH distance 11: 1397
  18. TLSH distance 15: 1328
  19. TLSH distance 21: 1235
  20. TLSH distance 14: 1179
The lower the TLSH distance, the closer the data from two releases is to eachother. This means that most of the changes are very small. There are a few releases with massive edits, but there are only very few of them.

Smells


I found 3062 possible smells in newly added releases (ignoring tracklist errors which remain the bulk of the errors), which is quite a bit more than earlier months (this is most likely the effect of the lockdown). Like before: a bit less than half are label codes, around a quarter are SID codes, and then the rest, so it is very similar to previous months.

Comments

Popular posts from this blog

SID codes (part 1)

One thing that I only learned about after using Discogs is the so called Source Identification Code, or SID. These codes were introduced in 1994 to combat piracy and to find out on which machines a CD was made. It was introduced by Philips and adopted by IFPI, and specifications are publicly available which clearly describe the two available SID codes (mastering SID code and mould SID code). Since quite a few months Discogs has two fields available in the " Barcode and Other Identifiers " (BaOI) section: Mould SID code Mastering SID code A few questions immediately popped up in my mind: how many releases don't have a SID field defined when there should be (for example, the free text field indicates it is a SID field)? how many releases have a SID field with values that should not be in the SID field? how many release have a SID field, but a wrong year (as SID codes were only introduced in 1994) how many vinyl releases have a SID code defined (which is impossi

SPARS codes (part 1)

Let's talk about SPARS codes used on CDs (or CD-like formats). You have most likely seen it used, but maybe don't know its name. The SPARS code is a three letter code indicating if recording, mixing and mastering were analogue or digital. For example they could look like the ones below. There is not a fixed format, so there are other variants as well. Personally I am not paying too much attention to these codes (I simply do not care), but in the classical music world if something was labeled as DDD (so everything digital) companies could ask premium prices. That makes it interesting information to mine and unlock, which is something that Discogs does not allow people to do when searching (yet!) even though it could be a helpful filter. I wanted to see if it can be used as an identifier to tell releases apart (are there similar releases where the only difference is the SPARS code?). SPARS code in Discogs Since a few months SPARS is a separate field in the Discogs

Country statistics (part 2)

One thing I wondered about: for how many releases is the country field changed? I looked at the two most recent data dumps (covering February and March 2019) and see where they differed. In total 5274 releases "moved". The top 20 moves are: unknown -> US: 454 Germany -> Europe: 319 UK & Europe -> Europe: 217 unknown -> UK: 178 UK -> Europe: 149 Netherlands -> Europe: 147 unknown -> Europe: 139 unknown -> Germany: 120 UK -> US: 118 Europe -> Germany: 84 US -> UK: 79 USA & Canada -> US: 76 US -> Canada: 65 unknown -> France: 64 UK -> UK & Europe: 62 UK & Europe -> UK: 51 France -> Europe: 51 Saudi Arabia -> United Arab Emirates: 49 US -> Europe: 46 unknown -> Japan: 45 When you think about it these all make sense (there was a big consolidation in Europe in the 1980s and releases for multiple countries were made in a single pressing plant) but there are also a few weird changes: