Skip to main content

Posts

Showing posts from May, 2019

Adding missing distribution companies to releases

One thing that I have hardly looked into is the section "Label, Company, Catalog Number, Etc." or "LCCN" in Discogs lingo (although it is only called that in the data entry form, on regular pages it says "Companies, etc."). In this section information such as the label, record company, distributor and much more should be entered. But of course, this often doesn't happen, partially also because people don't know what to enter. I believe that with a little bit of data mining it should be possible to find at least some of the data. I already showed that it works for depĆ³sito legal values so why shouldn't it for others? Let's look at an example from, again, Spain, namely this release on the Pye label . On the rear sleeve the following can be seen: Looking at the label page for Discos Belter : "Discos Belter, S.A. also offered distribution services and was a licensee for Motown, Pye Records, Salsoul Records, Prelude Records,

A case for guided data entry (part 2)

I have been thinking a bit about how to increase data quality when entering data. There is already a lot that could be done by using wizards and asking users the right questions . But I think that it can be made even easier when using graphical hints and explicitly pointing out to users what information should be entered in the database. In the previous article I mentioned using a wizard and guiding users through the process of entering data. When the right questions have been answered (such as country and label) the user could be asked what the label of the release looks like . For example they could be given the following choice (left: typical label EMI used in Spain in the 1980s, right: typical label EMI used in Spain in the 1970s): Examples of labels EMI used in the 1980s (left) and 1970s (right) Based on this they could then be guided through the process of picking the right data. Also, using which label was picked already means that some checks can be applied. For exa

A case for guided data entry (part 1)

I just helped someone add a few releases to Discogs and it was, again, a quite frustrating experience, even though I have added releases to Discogs before. Adding releases to Discogs is actually a lot of work. These are the steps I typically take for a 7" single: see if the release is already there on Discogs. see if the release I have differs from any of the listed releases copy an existing release to draft, or start from scratch fill in all the details that I am sure about make scans, crop the scans, scale them add scans Adding a release properly, with all the details, can easily take 15 to 20 minutes, and then I still only have a fairly barebones release. What I typically don't get right due to lack of knowledge are things like: composers printing companies manufacturing companies pressing plants country specific peculiarities etc. These usually require specialist knowledge about releases from a certain country, label or artist, which I simply don't

ISRC codes in Discogs (part 8)

A bit over a year ago I looked into ISRC codes for checking release dates. The ISRC code actually contains a year component which can be used to check dates, as (except in rare old cases) the release is always after the ISRC code has been assigned, so it cannot be earlier. A few people on Discogs seem to disagree, but meh. For more details you can read an earlier post . So back in February 2018 I got the following graph: Distribution of smells in releases with known ISRC fields in February 2018 which also included some other smells and not just date mismatches. Back then I concluded that there were 273 releases that had a date mismatch. So I decided to rerun my analysis with the latest data dump (up until and including April 30 2019) now that it seems that there are many more ISRC codes in the database, because people seem to be fixing the ISRC field . As expected the amount of releases with a date mismatch is now a lot higher: 1414. They are distributed across the database

Barcodes (part 4)

Continuing with my research into barcodes in the Discogs database: which barcodes appear most often? If you haven't already I would recommend reading part 1 , part 2 and part 3 of this series first. One thing that I was wondering about is which barcodes are found the most. The top 10 values for the barcode field looks like this: 1411: 4820011260011 1252: 4 820011 260011 > 827: 6456489431561 826: 6 456489 431561 460: 4 64043 51662 9 455: 464043516629 447: 4 60980-06754 5 437: 4 619497 411525 429: 4619497411525 427: 460980067545 Because one release could have multiple barcode fields (frequently one for the text representation and one for the scanned version) there is some overlap and the top 10 is actually just 5 different barcodes. I was wondering which release was at number one and to my surprise it was a release by Dissection . To my knowledge the majority of the people on Discogs are not into extreme metal, so I looked a bit further at what other releases ha

Barcodes (part 3)

Continuing with poking into the barcode field: in this post I am going to look at the releases that I think have a valid barcode. Before you read this it is good to first read part 1 and part 2 . After processing all the entries with barcode fields I have 3,313,415 potentially valid barcodes, although I am pretty sure that not all of them will be valid. I looked at the length of the possible barcodes and got this list (after cleaning up whitespace, hyphens, etc.): 13 : 1647416 12 : 1513850 11 : 90840 10 : 30464 14 : 16631 15 : 2817 6 : 2628 8 : 1672 7 : 1410 5 : 1292 18 : 1251 9 : 957 4 : 485 3 : 426 17 : 353 16 : 339 2 : 190 20 : 114 19 : 113 1 : 71 24 : 49 21 : 21 23 : 8 25 : 8 22 : 7 26 : 2 37 : 1 The first two values actually aren't surprising, as they correspond to EAN-13 and UPC respectively. The others need a little bit more attention: codes that are less than 5 I can safely ignore. A barcode with size 37 is certainly intriguing, so I w

Barcodes (part 2)

Time to dive further into the barcode information that is stored in Discogs. As I said in part 1 : I didn't look much into this subject before, because of the potential huge number of errors I would uncover. After digging a bit further into the data I can confirm that there are indeed many releases with errors. But on the way I found a few interesting things that I want to share. People can't enter data properly It is surprising to see how many people added a barcode in the Barcode field and then added a '.' that is not part of the barcode and that even cannot be found in the picture: around 150. I have not even counted things like trailing spaces, soft hyphens (whyyyyy?), and so on. Text representations of barcodes are not consistent I looked into some of the descriptions of several barcodes, but they don't seem to describe the barcodes that are in the wild. For example, some EAN-13 barcodes have a '.' in them between the first 12 digits and the check

Barcodes (part 1)

One thing that I still haven't properly researched is barcodes. This is not because the barcode field in Discogs is used correctly. Quite the contrary: I am seeing lots of releases where the barcode field is used for all kinds of information that is not a barcode. No, the real reason is that I am afraid of how much bad stuff I will find. The current amount of possible known errors in Discogs is already quite large, even when just looking at part of the "Barcodes and Other Identifiers" section! I haven't even started on verifying other data such as labels, companies, and so on, but I am very sure that as soon as I start processing that the flood gates will be open. But, I have to take the plunge one day, so today is as good as any other. Before I can show some useful results it is good to dive into some specifics about barcodes and how these are used in Discogs. Barcodes My guess is that most people are familiar with barcodes, but if not there is an excellent ar

How Discogs' email notification could be improved

I think it is time for another usability rant. Although I myself am far from a usability expert (the best courses I took at university were about usability that taught me to stay far away from it) I do suffer from user interfaces with quirks and Discogs is no exception. I have written about this subject before ( images , data entry , etc.) but it bites me and my friend gerjolp every time we are on Discogs. While editing releases for correctness (such as the depĆ³sito legal fixes described an an earlier post ) where people made snarky remarks about whether or not we didn't have anything better to do. This was a bit strange, as the data for the releases is more complete now and this should be encouraged. My guess is that the real reason is not that the releases were edited but the fact that Discogs sends a notification e-mail for every change that is made and some people got hundreds of e-mails with all an identical message about a similar change in many releases. Instead of flagg

Finding promising releases for fixing (part 2)

Time to bring two topics together: finding promising releases for fixing and image processing. The first topic I wrote about in this post , the second topic I wrote about in a series of posts, ending with a large test of images of a specific Spanish label . I specifically focused on cherry picking the releases with 1 image to see if that image contained a label. I found 1299 releases with 1 image for this label and compared it with the releases that I found a label for. The results were, as expected, very poor: for only 9 releases with 1 image the image was actually a label. Of these a few already had the right depĆ³sito legal information on the release page (and I hadn't filtered for that this time), some had already been updated and others didn't show the label with the depĆ³sito legal information on it, but the label of the other side. Still I would consider this a resounding success, because now I don't have to look at the 1290 other releases, but can very efficient

Label detection: testing with a large collection of images

In the past few weeks I have been experimenting a bit with OpenCV and other techniques to find out if  pictures contain labels and wrote about that ( part 1 , part 2 , part 3 , part 4 , part 5 and part 6 ). As you know the proof of the pudding is in the eating and I do like pudding, so I decided to run a test to see if I could say if an image has a label on it, or could not say it (so basically the result is "has label" or "don't know if there is a label"), as up until now I had only tested with about 30 images. I wanted to keep my test small because of time constraints (I still need to add multiprocessing to my labeling script) but big enough to draw interesting conclusions I chose one particular Spanish label called Belter . This is not because I like the music they released but just one that I encountered many times and that has a rather simple clean label design, so it seemed like a good candidate, although it could also mean that my results need to be

What happened in Discogs in April 2019?

A new data dump has been released by Discogs, so time to look at some statistics again, although I am sure that it is very similar to previous months . Release statistics I looked at the dump file with data covering April 1 - 30 2019, although judging by the name of the previous dump file it might actually be April 2. This dump file has 11,123,192 releases, whereas the previous one had 11,019,051 releases. That means 104,141 releases more. Also: 10,536,611 releases stayed the same 479,953 releases were changed 106,628 releases were added 2,487 releases were removed from the database 380 releases had status Draft, Deleted or Rejected 11 releases that were not Accepted were in both dumps 0 releases were moved from Draft to Accepted What is surprising is the high amount of draft/deleted/rejected releases. Of these 353 are Draft releases. I looked at a handful of releases, but I couldn't find a clear pattern. Changes in the data are distributed as follows: Again,

Using image processing to automatically detect labels (part 6)

It is time to wrap up my quest to automatically detect labels from images in Discogs. If you haven't read it yet, I highly advise reading the previous post for more background information. In the previous post I explained about edge detection and finding the outer edge. What I did next is that I cropped the image, adapted the size of the mask and then used my proven method with histograms on the cropped image. The complete method that I use is like this: use a donut shaped mask and compute histograms to see if there is a label (only works for perfectly cropped and centered images of labels) if no label was found try to detect outer edges if only one outer edge is found, crop the image, resize the donut shaped mask and try the histogram method again Step 2 and 3 are actually repeated several times for different values of the size of the kernel used for blurring. For my extremely small set of test images this method allows me to correctly find an extra two labels that I

Using image processing to automatically detect labels (part 5)

I wanted to dig a bit further into image processing and seeing what else I could do to automatically detect labels. Please read part 1 , part 2 , part 3 and part 4 first, if you haven't already. My previous methods focused on histograms using masks but this only works well if the image is perfectly centered and cropped properly. This is not always the case, as I already showed in the last articles. It would work better if I knew where the label would be in the image, so I can create a different mask for the right part of the image and then work with histograms. Detecting where a label starts can be done with edge detection. This should work in most cases, because the label has a different colour than the vinyl. In OpenCV there are various ways to detect outer edges and I am going to use the " Canny edge detector ". For this I first translate the image to greyscale, and then run the edge detector (for the threshold values I used 40 and 200). For one of my testing i

Face detection for record sleeves

I have been digging a bit into image processing in the last few weeks using OpenCV. OpenCV is very powerful and comes with lots of built-in goodies, such as a face detector. So, I was thinking: what can I do with that? How easy is it to detect sleeves that have people's faces on it? As it turns out: trivial. I grabbed my copy of Practical Python and OpenCV and followed chapter 14. I then downloaded an image and fed it to my code: Not perfect as only three out of four faces get recognized, probably due to a limitation in the classifier but not bad at all for less than 50 lines of Python code. Another example: So, why is this useful? Let's make this a bit more concrete with another example. Some users collect sleeves with a certain theme. One possible theme could be cats. At least one user at Discogs has a list with sleeves that have cats on them . And, believe it or not, OpenCV also has a built-in cat classifier (for frontal cat faces only), so I used that on one