Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues while importing the November 2020 dump to PostgreSQL #131

Open
lighterowl opened this issue Nov 22, 2020 · 4 comments
Open

Issues while importing the November 2020 dump to PostgreSQL #131

lighterowl opened this issue Nov 22, 2020 · 4 comments

Comments

@lighterowl
Copy link

Hi there. I wanted to use the importer with this month's dump, and ran into some problems along the way.
The CSV files were generated by the C# converter. PostgreSQL version is 12.5. Running on Arch Linux.

While importing artist.csv :

psycopg2.errors.NotNullViolation: null value in column "name" violates not-null constraint
DETAIL:  Failing row contains (66827, null, null, [b]DO NOT USE.[/b], Needs Vote).
CONTEXT:  COPY artist, line 78360: "66827,,,[b]DO NOT USE.[/b],Needs Vote"

While importing artist_url.csv :

psycopg2.errors.NotNullViolation: null value in column "url" violates not-null constraint
DETAIL:  Failing row contains (771, 321, null).
CONTEXT:  COPY artist_url, line 772: "321,"

While importing release_track_artist.csv :

psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: "A1"
CONTEXT:  COPY release_track_artist, line 24, column track_sequence: "A1"
psycopg2.errors.NotNullViolation: null value in column "track_sequence" violates not-null constraint
DETAIL:  Failing row contains (35478, 7479.13, 7479, null, 2332, Ray Keith, t, null, 1, null, DJ Mix, null).
CONTEXT:  COPY release_track_artist, line 35479: "7479,,7479.13,2332,Ray Keith,1,,1,,DJ Mix,"

While importing release_track.csv :

psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: "723.11"
CONTEXT:  COPY release_track, line 4299, column parent: "723.11"

parent should be text, since track_id is text as well. Or is the CSV output wrong and this should just be 11 instead of 723.11?

While creating FK constraints :

ERROR:  insert or update on table "release_artist" violates foreign key constraint "release_artist_fk_artist"
DETAIL:  Key (artist_id)=(0) is not present in table "artist".
ERROR:  insert or update on table "release_track_artist" violates foreign key constraint "release_track_artist_fk_artist"
DETAIL:  Key (artist_id)=(118760) is not present in table "artist".
ERROR:  insert or update on table "release_company" violates foreign key constraint "release_company_fk_company"
DETAIL:  Key (company_id)=(1670202) is not present in table "label".

While creating indices :

ERROR:  index row size 5376 exceeds btree version 4 maximum 2704 for index "release_track_idx_title"
DETAIL:  Index row references tuple (1075406,47) in relation "release_track".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Furthermore, the label_id column in the release_label is null for all records :

discogs=# select count(0) from release_label;
  count   
----------
 15166771
(1 row)

discogs=# select * from release_label where label_id is not null;
 id | release_id | label_id | label_name | catno 
----+------------+----------+------------+-------
(0 rows)
@philipmat
Copy link
Owner

philipmat commented Nov 24, 2020

Could you try with the October file and see if you have the same issues, please?

If it is not too much, could you also test the October files with the changes in your PR?

@lighterowl
Copy link
Author

The results are similar when running the process with the October 2020 dumps. No errors are reported while importing with the database created by CreateTables.sql with changes from my PR, though the following errors remain :

$ python3 psql.py < sql/CreateFKConstraints.sql 
ERROR:  insert or update on table "release_artist" violates foreign key constraint "release_artist_fk_artist"
DETAIL:  Key (artist_id)=(0) is not present in table "artist".
ERROR:  insert or update on table "release_track_artist" violates foreign key constraint "release_track_artist_fk_artist"
DETAIL:  Key (artist_id)=(118760) is not present in table "artist".
ERROR:  insert or update on table "release_company" violates foreign key constraint "release_company_fk_company"
DETAIL:  Key (company_id)=(1610443) is not present in table "label".
$ python3 psql.py < sql/CreateIndexes.sql 
ERROR:  index row size 5376 exceeds btree version 4 maximum 2704 for index "release_track_idx_title"
DETAIL:  Index row references tuple (1075344,100) in relation "release_track".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

A zero artist_id is used when the artist name appears in what Discogs calls an "non-linked credit", i.e. a credit whose specifying does not generate an artist name. It is used for some credits which are not necessarily related to music, and can be seen in this release, where it is used as "Other [Spirits Lifted By]". Having a null here instead of zero makes sense, as there is no row in the artist table that would provide additional information.

Trying to visit artist 118760 page on Discogs directly results in "This artist is used as a placeholder entry and does not link to any artist.". The artist name for that ID is "No Artist", so having null in this place actually makes sense.

I have no idea what to do about that bogus company_id, though. The release that references it is here, the company is supposed to be called "Terry's 83rd Street Music" and appears properly linked in a "Published By" credit. However, clicking that credit leads to a label whose ID is 1502386. Trying to visit label 1610443 directly leads to a 404, but it can still be accessed via the API.

label_id in release_label is still null. However, it seems like the label_id column isn't generated at all in release_label.csv, which is probably a bug in the C# converter, and not the scripts themselves :

release_id,label_name,catno
1,Svek,SK032

Since the import script takes the list of columns from the CSV file header, the list of columns is set to (release_id,label_name,catno) and label_id remains null.

@philipmat
Copy link
Owner

Superb analysis, thank you.

So what do you think we should do?

@lighterowl
Copy link
Author

lighterowl commented Nov 25, 2020

Something like this for starters - this includes a fix for missing values in the label_id column of the release_label table :

diff --git a/alternatives/dotnet/discogs/DiscogsRelease.cs b/alternatives/dotnet/discogs/DiscogsRelease.cs
index d554759..75d0b54 100644
--- a/alternatives/dotnet/discogs/DiscogsRelease.cs
+++ b/alternatives/dotnet/discogs/DiscogsRelease.cs
@@ -10,7 +10,7 @@ namespace discogs.Releases
         {
             { "release", "id title released country notes data_quality master_id status".Split(" ") },
             { "release_genre", "release_id genre".Split(" ") },
-            { "release_label", "release_id label_name catno".Split(" ") },
+            { "release_label", "release_id label_id label_name catno".Split(" ") },
             { "release_style", "release_id style".Split(" ") },
             { "release_image", "release_id type width height".Split(" ") },
             { "release_format", "release_id name qty text_string descriptions".Split(" ") },
@@ -22,6 +22,16 @@ namespace discogs.Releases
             { "release_track_artist", "release_id track_sequence track_id artist_id artist_name extra anv position join_string role tracks".Split(" ") },
         };
 
+        private static string SanitizeArtistId(string artistId)
+        {
+            if (artistId == "0" /* non-linked credits in credit lists have this set to zero */
+                || artistId == "118760" /* No Artist */)
+            {
+                return "";
+            }
+            return artistId;
+        }
+
         [XmlAttribute]
         public string id { get; set; }
 
@@ -82,7 +92,7 @@ namespace discogs.Releases
                 foreach (var l in labels)
                 {
                     if (l == null) continue;
-                    yield return ("release_label", new[] { id, l.name, l.catno });
+                    yield return ("release_label", new[] { id, l.id, l.name, l.catno });
                 }
             }
             if (styles?.Length > 0)
@@ -138,7 +148,7 @@ namespace discogs.Releases
                 foreach (var a in artists)
                 {
                     if (a == null) continue;
-                    yield return ("release_artist", new[] { id, a.id, a.name, "0", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_artist", new[] { id, SanitizeArtistId(a.id), a.name, "0", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
                 }
             }
             if (extraartists?.Length > 0)
@@ -147,7 +157,7 @@ namespace discogs.Releases
                 foreach (var a in extraartists)
                 {
                     if (a == null) continue;
-                    yield return ("release_artist", new[] { id, a.id, a.name, "1", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_artist", new[] { id, SanitizeArtistId(a.id), a.name, "1", a.anv, (position++).ToString(), a.join, a.role, a.tracks });
                 }
             }
             int seq = 0;
@@ -159,13 +169,13 @@ namespace discogs.Releases
                 foreach (var a in (t.artists ?? System.Array.Empty<artist>())) {
                     if (a == null) continue;
                     artistSeq += 1;
-                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, a.id, a.name, "0", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, SanitizeArtistId(a.id), a.name, "0", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
                 }
                 artistSeq = 0;
                 foreach (var a in (t.extraartists ?? System.Array.Empty<artist>())) {
                     if (a == null) continue;
                     artistSeq += 1;
-                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, a.id, a.name, "1", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
+                    yield return ("release_track_artist", new[] { id, t.position, t.track_id, SanitizeArtistId(a.id), a.name, "1", a.anv, artistSeq.ToString(), a.join, a.role, a.tracks });
                 }
             }
         }

C# is not my native language, so I'm sure this can be expressed in a better way, but the intent should be clear. The real question is, how many "magic" artist IDs there are : I'm not aware of any list of "artists who have their IDs but aren't artists who have a page and thus an entry in artists.xml" or anything like that, which would make the job easier.
The only thing that comes to mind is resorting to correcting entries at import time, i.e. "while importing releases, look if the artist with the given ID exists, and if not, replace artist_id with null", but that just sounds plain wrong.

The project I've used a few times before actually commented out creating FK constraints in its database initialisation script, possibly because they just weren't worth the trouble. I'm not experienced enough with DBs to say for sure what kind of benefits they bring to the engine, as opposed to the human who can figure out the relations more easily. Perhaps this is really the best way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants