Vinyl & Data

Posts

Showing posts from April, 2019

Using image processing to automatically detect labels (part 4)

This is the last blog post (at least for now) about using image processing to automatically detect labels of 7" singles. It is advised to first read part 1 , part 2 and part 3 before reading this post. In my previous blog post I looked at histogram comparisons and for my test image that worked really well. In this post I am going to look at how well it worked on some other images. The first label image that I tried: a Wham! promo single with an (almost) white label, with a perfect white circle around it (either a very good scanner, or the uploader cut the label from the scan), so I was not expecting a lot. The masked versions look like this: I was expecting quite a bad result here, but the comparison for the full image is: correlation: 0.07351815513756556 chi squared: 15.131605789349916 intersection: 0.08877012991160882 Bhattycharyya distance: 0.8772594089938355 That is even better than with my test image, which puzzles me to be honest. The next image

Using image processing to automatically detect labels (part 3)

Finally getting to the meat of using image processing to detect labels automatically. For background information you should first read part 1 and part 2 before reading this. Just computing a histogram won't get me very far. For it to be useful I need to compare it to something. As said what I will do is to compare one part of the image (where I think the label is) with the rest of the image, minus the center hole as that isn't part of the record. The idea is that the overlap between the two will be minimal, but possibly not in the cases where the vinyl and the label have the same colour, or a similar colour. With OpenCV there are built in functions to compare these and basically what I did is follow this blog post at PyImageSearch about comparing histograms and used it on my test images. After computing the histograms (32 buckets) I computed the four types of histogram comparisons from the PyImageSearch post for the red, green and blue channels, as well as the full ima

Using image processing to automatically detect labels (part 2)

Time to look more into histograms and how I can use them to automatically detect if a picture is a label. It is highly advised to first read part 1 before reading this blog post. More selective histograms In the first blog post I looked into histograms, but there I was (mostly) looking at the entire image, which in the case of the label is not helpful, as there is usually a lot of other stuff on the picture too (center hole, parts of the record, etc.). By selectively looking at parts of the image and analyzing those it might be possible to make a much better guess. For example, in the corners of the image I would expect to find dark/black (unless the vinyl was a different colour) and then a circle with usually a handful of colours, and then something in the middle (in this case white, because a scanner was used). One approach to detect this is to slide a square of 100x100 pixels diagonally across the image and then analyze the contents of that square. To visualise that, f

Using image processing to automatically detect labels (part 1)

One thing that I don't like about Discogs is that the images in the database are not tagged with metadata about the contents. It would be a lot easier if I would know if an image is a label, or a sleeve, or something else. I actually talked about this before , but so far my suggestion hasn't been picked up yet. One of the reasons for it is that it would help my quest for finding promising candidates for fixing releases and I don't have to actually click on a release to then find out that there is no label image. It is a really repetitive process, so I wanted to see how easy it is to automatically identify that an image contains a label. The quality of the images in Discogs varies a lot, but typically good quality images are (at least) 600x600. At least a few years ago images were automatically scaled which confused people, as they uploaded what they thought were "better images" only to seem them scaled to the same dimensions! I picked a release from Discogs

Country statistics (part 2)

One thing I wondered about: for how many releases is the country field changed? I looked at the two most recent data dumps (covering February and March 2019) and see where they differed. In total 5274 releases "moved". The top 20 moves are: unknown -> US: 454 Germany -> Europe: 319 UK & Europe -> Europe: 217 unknown -> UK: 178 UK -> Europe: 149 Netherlands -> Europe: 147 unknown -> Europe: 139 unknown -> Germany: 120 UK -> US: 118 Europe -> Germany: 84 US -> UK: 79 USA & Canada -> US: 76 US -> Canada: 65 unknown -> France: 64 UK -> UK & Europe: 62 UK & Europe -> UK: 51 France -> Europe: 51 Saudi Arabia -> United Arab Emirates: 49 US -> Europe: 46 unknown -> Japan: 45 When you think about it these all make sense (there was a big consolidation in Europe in the 1980s and releases for multiple countries were made in a single pressing plant) but there are also a few weird changes:

Finding promising releases for fixing (part 1)

So far these blogs have been about detecting wrong information in releases that people entered into the database, but there are also other errors that are much harder to find, namely detecting if a release is missing information and are incomplete. Many releases in Discogs are missing information. One of the things we did for a while was track new Spanish that were added, looking at the pictures (if any) and then adding the missing information, such as the depósito legal, or a rights society. But this does not scale, and it is also quite labour intensive, and it is not easy to remember by heart which releases have already been checked. So I want to optimize to success: which releases will likely be the easiest to fix? I ran a test, where I confined my search to: releases from Spain 7" releases (vinyl, flexi-disc) with the specific goal to fill in the depósito legal field(s) and other things like rights society fields (if not already there). The following was done: first

Introducing new errors in old releases (part 3)

Although I see the amount of obvious errors decrease (although veeeery slowly) there are always releases where people introduce new errors. I already looked at this several times before, but it never hurts to keep pointing it out. In March 2019 for a total of 1241 releases an error was introduced either because information that contained an error was added, or because fields were changed to something wrong. This excludes possible tracklist errors. The errors are distributed like this across the data. Errors introduced in March 2019 In my opinion this looks quite a bit like the monthly changes and my guess is that if you would normalize it (look at the errors that were changed, compared to the number of releases that were changed) it might look different. I should look into that.

Country statistics (part 1)

A quick post because I had the data in front of me: some statistics about countries. In a recent blog post on the Discogs blog it said " Many of the release countries expanding the fastest in the database are from Central America, South America, and Southeast Asia." As I am already keeping some of that data it is easy to just plot a few simple graphs of some countries using the data from March 31 2019 and see if this is true (spoiler: maybe not). Germany Let's look at German releases first (note: this does not mean releases of German artists, or releases in German, but releases from Germany). Releases from Germany Clearly it seems that many releases from Germany were added in the early days and then fewer and fewer were added. Spain Then Spain, which I feel probably still has many releases missing (although there seems to be a very active Spanish community on Discogs). Releases from Spain There are far fewer releases from Spain than from Germany in