Recently Discogs changed its data dump format, making it more difficult to compare releases the way I used to do and which is described in a previous blog post.
So, the only thing I could do: dive into the XML a bit more and compare elements in the XML output for releases that have been changed. The results are quite interesting.
I took the data dumps of March 1 2020 and April 1 2020, split these data dumps into individual XML files, computed SHA256 checksums for each of these files and ignored the ones that were the same in both dataset, leaving me with 531,444 releases to look at:
Releases that were changed in March 2020 |
This graph is the same graph that I have been seeing for the last 2.5 years, but what I wanted to know is: what are the releases with relevant changes and where are those changes? Personally I don't think that the YouTube videos are very relevant, and since that was the biggest change in the Discogs data dumps I decided to filter these and then compare.
What I did is the following:
- read the XML for the release in both data dumps
- compare each element that is a direct child of the top level "release" element
- ignore the "videos" element
- record if any elements were added, removed or changed
As it turns out quite a few releases that were marked as different actually are identical if you ignore the "videos" (I think the YouTube videos are irrelevant information): 224,352 releases were actually identical when ignoring the YouTube videos. That leaves 307,092 files that had relevant changes:
Releases that had relevant changes in March 2020 |
The graph is interesting to see, as it is quite different to the other graph that I am quite familiar with. It seems like most of the changes in old releases are actually only for YouTube videos:
Releases with only YouTube video changes in March 2020 |
What iI wanted to know next is where these changes were made in the releases. I recorded three types of changes:
- element removed
- element added
- element changed
and also which element was changed.
There were 29 different change/element pairs. From most to least changes:
- changed identifiers: 69427
- changed tracklist : 63399
- changed companies : 62219
- changed images : 59756
- changed extraartists : 54986
- changed data_quality : 42555
- changed formats : 40751
- changed notes : 33513
- changed labels : 31066
- added master_id : 27111
- added images : 15907
- changed artists : 15472
- changed title : 14463
- added notes : 12813
- changed styles : 10165
- changed released : 7535
- changed genres : 6597
- added released : 6214
- added styles : 5759
- changed master_id : 5139
- changed country : 4069
- removed notes : 3495
- removed released : 2353
- added country : 1629
- removed images : 1047
- removed master_id : 898
- removed styles : 329
- removed country : 260
- added genres : 1
So the most common type of change is definitely "identifiers", which translate to "Barcode and Other Identifiers" on the Discogs webpage. This is not entirely unexpected. What is interesting to see is that very few sections are completely removed and if a section is removed completely it is mostly "notes" (which makes sense, as there are plenty of releases with bogus notes) and "released" (released dates/years).
There are probably still some other tweaks possible here. For example, personally, I think that an element such as "data_quality" doesn't add much information, except which releases to possibly ignore (I have ranted about this before) and there were a bit over 12,500 releases where the only change was this particular field. Others that I would probably also ignore: releases where only "genres" (264 releases) or "styles" (1480 releases) were changed.
My next goal will be to dive a bit deeper into how big the differences are using TLSH. I will leave that for next time.
Comments
Post a Comment