A challenge in many data sets is how to properly deal with internationalization and characters. Discogs is no exception. In quite a few character sets there are characters that look exactly the same as characters in other character sets, but they aren't. Take for example the following two instances of a Rights Society name that I found in Discogs: BIEM ĪĪĪĪ To the naked eye they might look the same (depending on the font the website is using), but they actually are not the same. The first one is all ASCII, but in the second one all the characters are from the Greek alphabet. From a data point of view these characters are very different! Characters that look a lot like other characters are called homoglyphs . I wrote about this before when looking at misspellings in Czechoslovak and Czech releases (which was essentially the same problem). There are many examples of homoglyphs in the Discogs database and it can get confusing. For example, let's look at entries for a Gree...
A blog dedicated to two of my hobbies: vinyl records and digital data and exploring where the two intersect. This blog is not affiliated with Discogs, but uses a lot of its data. On Discogs you can find me as metalmijn. I also get a lot of help from a friend you can find on Discogs as gerjolp Check out my Discogs cleanup scripts at: https://github.com/armijnhemel/cleanup-for-discogs/