One thing I keep asking myself is how big changes in Discogs releases typically are: are the changes just small tweaks, or are they really big changes? I decided to compare the size of the changes of releases in two months using a very simple method: looking at the size of the differences.
I decided to not go for a very advanced approach that would require me to parse the XML and compare each field, but instead go fora much simpler approach and look at the size differences of the raw XML data.
I took two data dumps from the Discogs data website (namely discogs_20191003_releases.xml.gz and discogs_20191101_releases.xml.gz). I split these two dumps into individual XML files (one per release) and renamed them to have the right release number in the file name. I then compared the files with a regular cryptographic hash (SHA256) to see which files were different.
These files (439,651 in total) were compared using a locality sensitive hash, specifically TLSH as I am quite familiar with it. This hashing algorithm allows you to compare two hashes and compute a number. The lower the number that TLSH gives you, the closer the two pieces of data are. Translating it to the Discogs domain: the lower the TLSH number, the smaller the differences between two releases. So this allows you to compute a distance.
Based on previous experience with TLSH I would say that anything with a score of 20 or lower would mean that the two pieces of data are very close. My gut feeling was that a lot of the changes in Discogs would be actually very small and it turns out that this is indeed the case. The top 15 of the number of releases for each TLSH distance found gives me this:
So indeed: a very significant portion of the edits in Discogs are small edits. I haven't yet looked at where in the database these small edits happen and if older releases see bigger or smaller edits than newer releases. I also don't know yet which parts of a release are updated the most. I will leave that for another time.
I decided to not go for a very advanced approach that would require me to parse the XML and compare each field, but instead go fora much simpler approach and look at the size differences of the raw XML data.
I took two data dumps from the Discogs data website (namely discogs_20191003_releases.xml.gz and discogs_20191101_releases.xml.gz). I split these two dumps into individual XML files (one per release) and renamed them to have the right release number in the file name. I then compared the files with a regular cryptographic hash (SHA256) to see which files were different.
These files (439,651 in total) were compared using a locality sensitive hash, specifically TLSH as I am quite familiar with it. This hashing algorithm allows you to compare two hashes and compute a number. The lower the number that TLSH gives you, the closer the two pieces of data are. Translating it to the Discogs domain: the lower the TLSH number, the smaller the differences between two releases. So this allows you to compute a distance.
Based on previous experience with TLSH I would say that anything with a score of 20 or lower would mean that the two pieces of data are very close. My gut feeling was that a lot of the changes in Discogs would be actually very small and it turns out that this is indeed the case. The top 15 of the number of releases for each TLSH distance found gives me this:
- distance: 1, releases: 25858
- distance: 5, releases: 14788
- distance: 4, releases: 14611
- distance: 6, releases: 14150
- distance: 3, releases: 13893
- distance: 7, releases: 13608
- distance: 8, releases: 12611
- distance: 2, releases: 12288
- distance: 9, releases: 11291
- distance: 10, releases: 10189
- distance: 11, releases: 8993
- distance: 12, releases: 7791
- distance: 13, releases: 6933
- distance: 19, releases: 6848
- distance: 18, releases: 6813
So indeed: a very significant portion of the edits in Discogs are small edits. I haven't yet looked at where in the database these small edits happen and if older releases see bigger or smaller edits than newer releases. I also don't know yet which parts of a release are updated the most. I will leave that for another time.
Comments
Post a Comment