Discogs is making its database data available under a CC0 license, but one important aspect of the data is lacking and that is the edit history. For me as a software engineer having the edit history available is very natural, as it allows me to dive into the history perhaps see where people went wrong, when certain errors were introduced, and so on. For Discogs this is only possible if I am logged into the website and go to the release page history where I am presented with a view of the edit history without being able to interact with it (queries, and so on), which I find frustrating.
The edit history is a very rich place of information. For each release a lot of information is kept, such as:
I want to walk through a few scenarios that I would love to be able to research with this data.
Right now this information can only be aggregated at a very coarse level when looking at the monthly dumps (basically: what is the first release in the dump file and what is the last?).
When using the API it is possible to get a little bit more information as the JSON retrieved from the API contains an element called 'date_added' but would require downloading al the information via the API, which can take a very long time: downloading 9 million releases from the API using an authenticated connection would take approximately 100 days (unless you are using a distributed downloader).
Then to verify which releases have been changed you either would have to wait until you can cross-correlate with the monthly data dump, or download all the data again and check when release were updated (and then for more than 95% of the releases the answer would be that the release was not changed). By the time you are done some of the releases will have changed again, but you won't know which ones, so you would have to start all over again. If only Discogs could make a list available of when a release has been changed for the last time it would make it a lot lot easier.
A file with information about releases and their last change date could be as simple as a file with two tab separated columns, with a release number and a UNIX epoch time stamp. When compressed with gzip such a file would be only about 30 MiB in size for the current database (I tested this with a mockup).
In addition it would make it much easier to track changes that should be applied to multiple releases. To speak in software engineering terms: a "copy to draft" can be seen as a fork. Edits to the release can be seen as patches. In the software world it is common to cherry pick changes and port them to other branches of the software. This is also something that might be useful for releases in Discogs. It happens often that a variant is added and that "copy to draft" is used. Things are then added to or changed in the fork, but not in the original, or vice versa. For some of the information this makes total sense (like catalog numbers, country, and so on) but for other information (artists, titles of tracks, and so on) many times this does not make sense and the information should be backported or forward ported to the other release.
If the history could actually be followed this would be a lot easier to detect.
Similarly, information about merges (when two duplicate releases are merged into one) would help enormously to be able to track these changes in a consistent way.
I am pretty sure that looking at this there would be a few surprises, especially when combined with information about contributors and votes.
As an example: I know that there is one user who added hundreds of releases with watermarked images and that there was one other user who then disabled all the watermarked images. I also have seen users that consistently add releases by forking an older release and then don't add them to the corresponding "master" release (and this is why having the information how releases are created would be so useful to have).
There might also be users whose contributions are consistently rejected as they are for example hijacking releases. Or some users might get lots of comments and votes about changes that need to be made, but ignore these and add more releases instead (also with errors).
Being able to detect these instances automatically and earlier would be useful for database hygiene.
So please Discogs, let us have some more information (of course under an acceptable license). I am sure you will not regret it.
The edit history is a very rich place of information. For each release a lot of information is kept, such as:
- creation date/modifcation date
- origin (whether it was created fresh, or if "copy to draft" was used)
- merge information (if merged with another release)
- content of edits
- contributors and votes
I want to walk through a few scenarios that I would love to be able to research with this data.
Creation date/modification date
Having access to the creation date of each release would allow interesting information to be uncovered, such as how many releases were added or changed per month, day or even hour and find hotspots of activity (or inactivity).Right now this information can only be aggregated at a very coarse level when looking at the monthly dumps (basically: what is the first release in the dump file and what is the last?).
When using the API it is possible to get a little bit more information as the JSON retrieved from the API contains an element called 'date_added' but would require downloading al the information via the API, which can take a very long time: downloading 9 million releases from the API using an authenticated connection would take approximately 100 days (unless you are using a distributed downloader).
Then to verify which releases have been changed you either would have to wait until you can cross-correlate with the monthly data dump, or download all the data again and check when release were updated (and then for more than 95% of the releases the answer would be that the release was not changed). By the time you are done some of the releases will have changed again, but you won't know which ones, so you would have to start all over again. If only Discogs could make a list available of when a release has been changed for the last time it would make it a lot lot easier.
A file with information about releases and their last change date could be as simple as a file with two tab separated columns, with a release number and a UNIX epoch time stamp. When compressed with gzip such a file would be only about 30 MiB in size for the current database (I tested this with a mockup).
Origin of new releases and merging releases
In Discogs there are two ways to add a release to the database:- create a new release from scratch
- copy an existing release and use it as a template ("copy to draft")
In addition it would make it much easier to track changes that should be applied to multiple releases. To speak in software engineering terms: a "copy to draft" can be seen as a fork. Edits to the release can be seen as patches. In the software world it is common to cherry pick changes and port them to other branches of the software. This is also something that might be useful for releases in Discogs. It happens often that a variant is added and that "copy to draft" is used. Things are then added to or changed in the fork, but not in the original, or vice versa. For some of the information this makes total sense (like catalog numbers, country, and so on) but for other information (artists, titles of tracks, and so on) many times this does not make sense and the information should be backported or forward ported to the other release.
If the history could actually be followed this would be a lot easier to detect.
Similarly, information about merges (when two duplicate releases are merged into one) would help enormously to be able to track these changes in a consistent way.
Contents of edits
To be able to be able to port changes from one release to another it is necessary that edits themselves can also be mined. It would also be interesting to see how many edits releases have on average (my gut feeling: very few).I am pretty sure that looking at this there would be a few surprises, especially when combined with information about contributors and votes.
Contributors and votes
It would be interesting to be able to see which contributors did what. When using the API you can already get a bit of information as the people who contributed are listed (this is not in the monthly XML data dump), but it doesn't specify who changed what, who the original submitter was, and so on. What I would find interesting is to see if there are certain users who keep making specific errors that other users then fix.As an example: I know that there is one user who added hundreds of releases with watermarked images and that there was one other user who then disabled all the watermarked images. I also have seen users that consistently add releases by forking an older release and then don't add them to the corresponding "master" release (and this is why having the information how releases are created would be so useful to have).
There might also be users whose contributions are consistently rejected as they are for example hijacking releases. Or some users might get lots of comments and votes about changes that need to be made, but ignore these and add more releases instead (also with errors).
Being able to detect these instances automatically and earlier would be useful for database hygiene.
So please Discogs, let us have some more information (of course under an acceptable license). I am sure you will not regret it.
Comments
Post a Comment