The new monthly datadump is available, so I downloaded it and processed it with my scripts to see what happened in September 2017.
Of course this is just the low hanging fruit: it is very easy to find out that something that is listed is syntactically incorrect (wrong field, wrong formatting, and so on), but not if it is factually correct, or that there is information is missing, so the numbers above should always be taken with a grain of salt as many more releases are in desperate need of improvement. Doing more thorough checks requires a lot more (human) effort, but still it seems that it is heading into the right direction, so I am quite positive about what I see.
Also, the scripts that I wrote are not perfect and do not catch everything. There are many more checks that can be done, and will be done in the future, so I am expecting the numbers to go up again.
As soon as the dump with the data of October has been released (in early November) I will rerun the (improved) scripts (for both old and new data) and see if there is continued improvement.
Release statistics
The new dumpfile was published on October 4 2017 and has 8,996,419 releases. The previous dump (published September 4 2017) had 8,878,391 releases. That means 118,028 more releases in the database.- 3,158 releases were removed in the new dumpfile
- 121,186 releases were added in September
- 8,456,324 releases remained the same
- 418,909 releases were changed
- 205 releases had the status Draft, Deleted or Rejected set
- 11 releases that were not Accepted were present in both the September 2017 and October 2017 data dump
- 1 release moved from Draft to Accepted
In September 2017 there was one edit in the Discogs catalogue every five seconds
Releases with known smells
Using another script I looked at some of the known smells in the data. The good news is that it seems to be going down. There were around 15,000 less known smells on October 1 compared to September 1, with more releases (but still there are around 218,000 releases where I know there is a problem).On October 1 2017 there were 15,000 releases with known smells less than on September 1 2017One field for which fewer errors was the Depósito Legal field with about 7,500 fewer bugs. Almost 4,000 SPARS Code fields were corrected.
Of course this is just the low hanging fruit: it is very easy to find out that something that is listed is syntactically incorrect (wrong field, wrong formatting, and so on), but not if it is factually correct, or that there is information is missing, so the numbers above should always be taken with a grain of salt as many more releases are in desperate need of improvement. Doing more thorough checks requires a lot more (human) effort, but still it seems that it is heading into the right direction, so I am quite positive about what I see.
Also, the scripts that I wrote are not perfect and do not catch everything. There are many more checks that can be done, and will be done in the future, so I am expecting the numbers to go up again.
As soon as the dump with the data of October has been released (in early November) I will rerun the (improved) scripts (for both old and new data) and see if there is continued improvement.
Comments
Post a Comment