I am going to talk about the monthly statistics again, but I am going to skip a few months: that's right, no drill down statistics for January, February and March 2020. The reason is that Discogs had changed the dump format and I first needed to fix my scripts and give them a good brush up.
The dump file that I researched contains the data as it was on May 1 2020 (I think as it was at 00:00 UTC, but I am not entirely sure). I compared this to the database from April 1 2020.
- there were 12,458,766 releases up from 12,302,145 releases, so that's 156,621 more
- 159,735 releases were added
- 3,114 releases were removed
- 11,676,243 releases stayed the same
- 622,788 releases were modified
The modifications in the database were distributed across the data as follows:
Releases in Discogs changed in April 2020 |
As I found out recently there are certain edits that are not very relevant, such as the YouTube links. These are irrelevant for me. The 268,187 releases with only irrelevant edits can be found here:
Releases in Discogs changed in April 2020 with only YouTube link edits |
When only looking at relevant changes, it looks like this:
Releases in Discogs changed in April 2020 with relevant changes |
The sections that were changed in these releases:
- changed identifiers: 80378
- changed companies: 78147
- changed tracklist: 72928
- changed images: 71951
- changed extraartists: 63530
- changed formats: 50135
- changed data_quality: 48549
- changed notes: 39229
- changed labels: 34148
- added master_id: 29704
- added images: 18137
- changed title: 17626
- changed artists: 16958
- added notes: 14630
- changed styles: 11448
- changed released: 8648
- added released: 7334
- changed genres: 6675
- changed master_id: 6447
- added styles: 5690
- changed country: 4212
- removed notes: 3836
- removed released: 3023
- added country: 1766
- removed images: 1010
- removed master_id: 800
- removed styles: 354
- removed country: 337
Most of the changes were very small. Using TLSH I could compute the distance between the XML elements that were changed in each release. Quite a few were too small for TLSH, but of the ones where it could be computed the top 20 scores were as follows:
- TLSH distance 0: 204573
- TLSH distance 1: 4855
- TLSH distance 3: 4175
- TLSH distance 4: 4029
- TLSH distance 2: 3713
- TLSH distance 5: 3683
- TLSH distance 6: 3265
- TLSH distance 7: 2821
- TLSH distance 8: 2516
- TLSH distance 9: 1947
- TLSH distance 10: 1706
- TLSH distance 17: 1533
- TLSH distance 16: 1517
- TLSH distance 18: 1492
- TLSH distance 19: 1437
- TLSH distance 20: 1416
- TLSH distance 11: 1397
- TLSH distance 15: 1328
- TLSH distance 21: 1235
- TLSH distance 14: 1179
The lower the TLSH distance, the closer the data from two releases is to eachother. This means that most of the changes are very small. There are a few releases with massive edits, but there are only very few of them.
Smells
I found 3062 possible smells in newly added releases (ignoring
tracklist errors which remain the bulk of the errors), which is quite a bit more than earlier months (this is most likely the effect of the lockdown). Like before: a
bit less
than half are label codes, around a
quarter are SID codes, and then the rest, so it is very similar to previous months.
Comments
Post a Comment