The dumps released by Discogs are great for datamining, but they are only published once a month. Because they are also not released on the first day of the month (usually on the 4th) it means that you are at minimum a few days behind, or possibly up to 30+ days.
The Discogs data can also be accessed through their API, allowing monitoring of releases in a more continuous way: as soon as something is added it can be downloaded and checked.
The JSON coming from the API exposes more information than the XML from the data dump. For example it contains a timestamp when the release was added and when the last change was. It also contains some information about who contributed to it, and some elements (like labels) contain a bit more information that is not present in the XML The structure also seems a lot more sane than the XML, which I can only process with SAX because building a DOM of a 32 GiB XML file is just insane.
Accessing the Discogs API for the purposes I want to use it for is actually very very simple:
Checking releases via the API is a lot slower than checking a local (very big) XML file. This is mostly because of the download speed and the rate limiting imposed by the Discogs server. I did some download tests and I could download around 3800 releases in two hours. Doing the math: if you would be able to download 4000 releases in two hours (being optimistic), it would take half a year or longer to download the whole database through the API. If I were to do this, then during the same timeframe Discogs would release 5 or 6 dump files which would be more up to date than the oldest data downloaded through the API.
This is why a continuous checker should not be a replacement for scripts for checking releases in bulk. Instead treat them complementary.
(Of course, you could speed it up by using a distributed download mechanism using a few virtual machines somewhere but for me that's not worth the effort or investment in time and resources.)
One of the things I have added to the script is to use notifications (only tested on a Linux desktop) if things that are wrong were found. This is not recommended if you have work to do, because it is very distracting. That's why I only run this script whenever I have time to focus on doing things with Discogs.
The Discogs data can also be accessed through their API, allowing monitoring of releases in a more continuous way: as soon as something is added it can be downloaded and checked.
The JSON coming from the API exposes more information than the XML from the data dump. For example it contains a timestamp when the release was added and when the last change was. It also contains some information about who contributed to it, and some elements (like labels) contain a bit more information that is not present in the XML The structure also seems a lot more sane than the XML, which I can only process with SAX because building a DOM of a 32 GiB XML file is just insane.
Accessing the Discogs API for the purposes I want to use it for is actually very very simple:
- create a security token and use it as described on the Discogs developer webpage
- send requests to the right endpoints and provide the token
- process the JSON results sent by Discogs
Checking releases via the API is a lot slower than checking a local (very big) XML file. This is mostly because of the download speed and the rate limiting imposed by the Discogs server. I did some download tests and I could download around 3800 releases in two hours. Doing the math: if you would be able to download 4000 releases in two hours (being optimistic), it would take half a year or longer to download the whole database through the API. If I were to do this, then during the same timeframe Discogs would release 5 or 6 dump files which would be more up to date than the oldest data downloaded through the API.
This is why a continuous checker should not be a replacement for scripts for checking releases in bulk. Instead treat them complementary.
(Of course, you could speed it up by using a distributed download mechanism using a few virtual machines somewhere but for me that's not worth the effort or investment in time and resources.)
Implementing continuous checking
That being said, to quickly catch errors using the API is very convenient. I wrote a script that does the following:- determine the latest release in Discogs by scraping a webpage that lists releases (as the Discogs API does not allow me to discover the latest release easily, see below).
- download all releases starting from a certain offset, ending with the release number from step 1
- check the releases from step 2 for known smells and report the errors (log file, notices on Linux desktop if enabled)
- write the releases from step 3 to a file for offline analysis and avoiding redownloading the same data again
- scrape the website from step 1, sleep for 10 minutes, set the offset to start from to the last release checked in step 3 and use this as the new starting point for step 2
One of the things I have added to the script is to use notifications (only tested on a Linux desktop) if things that are wrong were found. This is not recommended if you have work to do, because it is very distracting. That's why I only run this script whenever I have time to focus on doing things with Discogs.
Possible improvements by Discogs
To make my scripts better there are two things that Discogs could do:- publish a list of time stamps of when each release was last changed, so I can invalidate earlier results, redownload the releases that have changed since I last checked them and check them again. Such a list should not cost more than 30 MiB of diskspace (gzip compressed).
- add an endpoint to the API that would allow me to find out the number of the latest release with a single call, so I don't have to scrape it from the Discogs website.
Comments
Post a Comment