Skip to main content

Posts

Showing posts from January, 2018

Homoglyph confusion in the Discogs database (part 1)

A challenge in many data sets is how to properly deal with internationalization and characters. Discogs is no exception. In quite a few character sets there are characters that look exactly the same as characters in other character sets, but they aren't. Take for example the following two instances of a Rights Society name that I found in Discogs: BIEM Ī’Ī™Ī•Īœ To the naked eye they might look the same (depending on the font the website is using), but they actually are not the same. The first one is all ASCII, but in the second one all the characters are from the Greek alphabet. From a data point of view these characters are very different! Characters that look a lot like other characters are called homoglyphs . I wrote about this before when looking at misspellings in Czechoslovak and Czech releases (which was essentially the same problem). There are many examples of homoglyphs in the Discogs database and it can get confusing. For example, let's look at entries for a Gree

Rights Society fields in the Discogs database (part 3)

Today it's time for a very short post. I decided to look a little bit deeper into the wrong values of rights societies that I uncovered (I wrote about this earlier ). I was surprised about the large number of errors. One error that I found were that there are quite a few releases with homoglyphs (more about that in a later article, it will be interesting) and I was wondering if these errors really are just made by people submitting it, or if there really are releases where the companies made a mistake. The answer: a bit of both. While most of the time it actually is the submitters who made mistakes there are a few releases where the labels indeed seem wrong. I looked into STEMPRA, which probably should be STEMRA, but there are at least two releases where this is not the case and the labels actually say STEMPRA: this unofficial release and this official release . It certainly doesn't make things easier, as you cannot assume that it is indeed a misspelling.

Rights Society fields in the Discogs database (part 2)

The Rights Society field in Discogs is a big potential source of issues. The reason is: almost all releases out there have one, or more, rights societies printed on them. I already looked into where the Rights Society field perhaps should have been used, but wasn't. What I didn't look at is whether or not the values of the releases that have one or more Rights Society fields are actually correct. I took the latest data dump and ran some tests. I haven't added the check to my scripts because there were too many false positives (although I will add a few sanity checks soon), but I still found a few interesting things. BIEM One rights society (at least according to the list of rights societies on Discogs, as it appears to be more of an umbrella organization) is BIEM . Some misspellings that I found: BOEM: 2 times BEIM: 122 times (although for at least one release it is printed like that on the disc) BIME: 2 times BIEN: 82 times STEMRA Another rights soci

Rights Society fields in the Discogs database (part 1)

One of the fields in the "Barcode and Other Identifiers" section of a release in Discogs is the Rights Society field. These societies are for things like collecting royalties for artists, although I have no idea how they exactly work (and, to be honest, I am not that interested in knowing at the moment). Examples of rights societies are ASCAP, GEMA, BUMA , SGAE and more. The Rights Society field has been around in Discogs for some time, but still quite a few errors are being made. For example, I often see the Other field being used, but also ISRC , Barcode and other fields. A while ago I added support for detecting these to my scripts , and recently refined detection so I could write this blog with some more accurate information. Rights Society errors In the latest Discogs data dump you will find about 30,000 releases where the Rights Society field is not used where it possibly should have been used. This excludes instances of where the value is actually not a rig

Continuously checking Discogs for errors (part 1)

The  dumps released by Discogs are great for datamining, but they are only published once a month. Because they are also not released on the first day of the month (usually on the 4th) it means that you are at minimum a few days behind, or possibly up to 30+ days. The Discogs data can also be accessed through their API, allowing monitoring of releases in a more continuous way: as soon as something is added it can be downloaded and checked. The JSON coming from the API exposes more information than the XML from the data dump. For example it contains a timestamp when the release was added and when the last change was. It also contains some information about who contributed to it, and some elements (like labels) contain a bit more information that is not present in the XML The structure also seems a lot more sane than the XML, which I can only process with SAX because building a DOM of a 32 GiB XML file is just insane. Accessing the Discogs API for the purposes I want to use it for

Detecting wrong label information in Discogs (part 1)

When looking at a music release in general (not just in Discogs) it can be quite confusing which company did what, as there are quite a few listed on them. There is the label, the production company, the rights owner and perhaps also others like a printing company and others. Then there are companies that use names which are very close and which at different points in time might have meant something else (as companies are renamed, sold, repurposed, and so on). This is guaranteed to lead to errors. Recently I bumped into one example and that triggered me to look into it a bit moreto see how many errors I could find. The company in question is London which is associated with the label called London Records. The webpage for London (the company) says: "For the record label, please see London Records ." which means as much as "don't use this as the label". So how often does this happen in the data? I grabbed the latest data dump (released January 2018), adapte

What could be done if the Discogs edit history was made available (part 2)

After having seen quite a few releases on Discogs and having dug through the data that Discogs makes available every month I am sometimes still finding things that make me wonder. In a previous post I already talked a bit about what I could do if the whole edit history would be made available, but I forgot one important thing: some people tend to put information that is relevant to the release in the submission notes. The submission notes are just metadata describing an edit . They should not be used to describe a release . Yet some people do exactly this. They do not realize that unless you are logged into Discogs and actively click on "edit page" you will not see the submission history and you will not see that particular information. It is effectively hidden for people who do not know this or who are not logged in. To find releases with that information it would be really useful to have the Discogs edit history. On the other hand, maybe it should also be a signal to

What could be done if the Discogs edit history was made available

Discogs is making its database data available under a CC0 license, but one important aspect of the data is lacking and that is the edit history. For me as a software engineer having the edit history available is very natural, as it allows me to dive into the history perhaps see where people went wrong, when certain errors were introduced, and so on. For Discogs this is only possible if I am logged into the website and go to the release page history where I am presented with a view of the edit history without being able to interact with it (queries, and so on), which I find frustrating. The edit history is a very rich place of information. For each release a lot of information is kept, such as: creation date/modifcation date origin (whether it was created fresh, or if "copy to draft" was used) merge information (if merged with another release) content of edits contributors and votes While some of this information is partially available through the Discogs API the ful

Images in Discogs: what could be improved (part 1)

One feeling that I have when I am looking through Discogs is "if only they would have done XYZ". I understand that there is always more to do than there is manpower available and what I perceive as a big problem might be very low priority (and vice versa), but that doesn't stop me from ranting about it. This is one of those posts. In Discogs it is expected that people upload images. The guidelines for images describe how this is being done and they make a lot of sense. For example, if a release (like a 7" single) has a sleeve and two sides the first image you should show is the front cover, then back cover, then the A-side label, then the B-side label. Of course, humans being humans it is completely random at times (although many times it is perfectly fine). Some things I have observed: the order can be quite random it is unclear what the images are off some people feel the need to upload images of their own copy people add images of another release or worse