A few times I have already talked about my love/hate relationship with Discogs. I love it because there is so much data, but I dislike the quality of the data.
I like it that on the Discogs blog there are now finally frequent blog posts about how the database evolves, with pretty graphs (as I told some staff members they should consider), but I don't like that they are not looking at errors and increasing quality. Also, I don't like that they are not publishing the actual data sets used to generate those pictures, which brings me to the core of this post:
What I love is that (most of) the data is available under a CC0 license (and others share this view) but as I have ranted about before: this is not actually all of the data from the catalog (I am not interested in the sales data). Specifically, all of the historical edit information is missing, which potentially contains very valuable hints about how releases have evolved over time, which would allow people to dig into very specialized searches to identify which parts of the database are not receiving the love they need, or to find problematic patterns. Instead, we are given a monthly snapshot in a difficult to process format (more than 6 GiB of gzip compressed XML), which misses essential information that Discogs has access to, but I don't, like for example when a release was first added to the database or last changed. I could potentially get this information from the API, but hammering the Discogs API for updates (or even initial data) on 10 million+ records, just feels discouraging and it feels like a form of openwashing.
I do understand that there are possibly technological and infrastructural challenges to making all of this data available (although I don't know this for sure, as I haven't actually seen the real data), but right now it is not a level playing field, even though Discogs makes it seem so. Not happy with that.
I like it that on the Discogs blog there are now finally frequent blog posts about how the database evolves, with pretty graphs (as I told some staff members they should consider), but I don't like that they are not looking at errors and increasing quality. Also, I don't like that they are not publishing the actual data sets used to generate those pictures, which brings me to the core of this post:
What I love is that (most of) the data is available under a CC0 license (and others share this view) but as I have ranted about before: this is not actually all of the data from the catalog (I am not interested in the sales data). Specifically, all of the historical edit information is missing, which potentially contains very valuable hints about how releases have evolved over time, which would allow people to dig into very specialized searches to identify which parts of the database are not receiving the love they need, or to find problematic patterns. Instead, we are given a monthly snapshot in a difficult to process format (more than 6 GiB of gzip compressed XML), which misses essential information that Discogs has access to, but I don't, like for example when a release was first added to the database or last changed. I could potentially get this information from the API, but hammering the Discogs API for updates (or even initial data) on 10 million+ records, just feels discouraging and it feels like a form of openwashing.
I do understand that there are possibly technological and infrastructural challenges to making all of this data available (although I don't know this for sure, as I haven't actually seen the real data), but right now it is not a level playing field, even though Discogs makes it seem so. Not happy with that.
Comments
Post a Comment