Don't distinct where it's not needed. #4924

mr-russ · 2023-04-20T13:27:45Z

DISTINCT and UNION will remove duplicates, requiring a forced sort, unique and temporary table step. It can't be flattened or optimized in other ways. So they are removed to allow SQLite to do optimization of searches.

ts678 · 2023-06-22T00:02:37Z

I'm wondering how these specific ones were chosen and by what process they're guaranteed to not change behavior badly.
Long ago I tested two (possibly just the DISTINCT part rather than final result) on an example database and result changed.

Jojo-1000 · 2023-06-22T06:24:35Z

I started to run through all of these queries on a copy of my main backup database (with 100kB blocksize, so lots of blocks) to see what improvement is there. It seems like in some cases it can be 30% faster while in others it is just the same.

what process they're guaranteed to not change behavior badly

What I see in a lot of places is a NOT IN (SELECT DISTINCT ...). I am pretty sure at least those are save to change (IN should not care about duplicates). The others need some careful consideration, which is why I am trying to see if it is worth it in terms of performance.

duplicatibot · 2023-06-23T07:29:09Z

This pull request has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/reducing-time-spent-deleting-blocks/16403/9

Jojo-1000 · 2023-06-21T17:42:57Z

Duplicati/Library/Main/Database/LocalBugReportDatabase.cs

+ SELECT DISTINCT ""Path"" AS ""RealPath"",
+ ? || length(""RealPath"") || ? || row_number() OVER () || 
+ CASE WHEN substr(""RealPath"", length(""RealPath"")) = ? THEN ? ELSE ? END) AS ""Obfuscated"" FROM ""File""",
+ Platform.IsClientPosix ? "/" : "X:\\", Util.DirectorySeparatorString, Util.DirectorySeparatorString, ".bin");


This query has a missing ( before line 3 and lacks a 3rd dir separator in parameters.
Combined execution time on my main backup database of old 3 queries was 3938ms, new is 6030ms.

Jojo-1000 · 2023-06-21T17:58:12Z

Duplicati/Library/Main/Database/LocalDatabase.cs


 // Create a temporary table to cache subquery result, as it might take long (SQLite does not cache at all). 
 deletecmd.ExecuteNonQuery(string.Format(@"CREATE TEMP TABLE ""{0}"" (""ID"" INTEGER PRIMARY KEY)", blocksetidstable));
 deletecmd.ExecuteNonQuery(string.Format(@"INSERT OR IGNORE INTO ""{0}"" (""ID"") {1}", blocksetidstable, bsIdsSubQuery));
- bsIdsSubQuery = string.Format(@"SELECT DISTINCT ""ID"" FROM ""{0}"" ", blocksetidstable);
+ bsIdsSubQuery = string.Format(@"SELECT ""ID"" FROM ""{0}"" ", blocksetidstable);


No difference in performance

Jojo-1000 · 2023-06-21T17:58:45Z

Duplicati/Library/Main/Database/LocalDatabase.cs

+  UNION ALL
+  SELECT ""BlocksetID"" FROM ""BlocklistHash""
+  WHERE ""Hash"" IN (SELECT ""Hash"" FROM ""Block"" WHERE ""VolumeID"" IN ({volIdsSubQuery}))";
+


Test with 4 random volumes
Old: 98ms
New: 86ms (contains duplicates, but should not matter)

In combined temp table insert:
Old: 100ms
New: 90ms

Jojo-1000 · 2023-06-21T21:12:22Z

Duplicati/Library/Main/Database/LocalDatabase.cs

@@ -777,7 +777,7 @@ public void VerifyConsistency(long blocksize, long hashsize, bool verifyfilelist
 }

 var real_count = cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM ""BlocklistHash""", 0);
- var unique_count = cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECT DISTINCT ""BlocksetID"", ""Index"" FROM ""BlocklistHash"")", 0);
+ var unique_count = cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECT ""BlocksetID"", ""Index"" FROM ""BlocklistHash"")", 0);


In my dataset both queries returned the same, but this specifically compares unique entries to all entries and (BlocksetID, Index) is not guaranteed unique in the table

Jojo-1000 · 2023-06-21T21:20:26Z

Duplicati/Library/Main/Database/LocalDatabase.cs

@@ -799,7 +799,7 @@ public void VerifyConsistency(long blocksize, long hashsize, bool verifyfilelist
 using (var cmd2 = m_connection.CreateCommand(transaction))
 foreach (var filesetid in cmd.ExecuteReaderEnumerable(@"SELECT ""ID"" FROM ""Fileset"" ").Select(x => x.ConvertValueToInt64(0, -1)))
 {
- var expandedCmd = string.Format(@"SELECT COUNT(*) FROM (SELECT DISTINCT ""Path"" FROM ({0}) UNION SELECT DISTINCT ""Path"" FROM ({1}))", LocalDatabase.LIST_FILESETS, LocalDatabase.LIST_FOLDERS_AND_SYMLINKS);
+ var expandedCmd = string.Format(@"SELECT COUNT(DISTINCT ""Path"") FROM (SELECT ""Path"" FROM ({0}) UNION ALL SELECT ""Path"" FROM ({1}))", LocalDatabase.LIST_FILESETS, LocalDatabase.LIST_FOLDERS_AND_SYMLINKS);


Old: 13398ms
New: 14160ms

Jojo-1000 · 2023-06-21T22:15:53Z

Duplicati/Library/Main/Database/LocalListAffectedDatabase.cs

@@ -65,7 +65,7 @@ public IEnumerable<Duplicati.Library.Interface.IListResultFileset> GetFilesets(I
 var sql = string.Format(
 @"SELECT DISTINCT ""FilesetID"" FROM (" +
 @"SELECT ""FilesetID"" FROM ""FilesetEntry"" WHERE ""FileID"" IN ( SELECT ""ID"" FROM ""FileLookup"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""BlocksetEntry"" WHERE ""BlockID"" IN ( SELECT ""ID"" From ""Block"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0})))))" +
- " UNION " +
+ " UNION ALL " +
 @"SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))" +
 ")",


No difference in performance

Jojo-1000 · 2023-06-21T22:19:41Z

Duplicati/Library/Main/Database/LocalListAffectedDatabase.cs

 @"SELECT ""Path"" FROM ""File"" WHERE ""MetadataID"" IN (SELECT ""ID"" FROM ""Metadataset"" WHERE ""BlocksetID"" IN (SELECT ""BlocksetID"" FROM ""BlocksetEntry"" WHERE ""BlockID"" IN (SELECT ""ID"" FROM ""Block"" WHERE ""VolumeID"" IN (SELECT ""ID"" from ""RemoteVolume"" WHERE ""Name"" IN ({0})))))" +
- @" UNION " +
+ @" UNION ALL " +
 @"SELECT ""Path"" FROM ""File"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))" +
 @") ORDER BY ""Path"" ",
 string.Join(",", items.Select(x => "?"))


Old: 12914ms
New: 6123ms

Jojo-1000 · 2023-06-21T22:23:22Z

Duplicati/Library/Main/Database/LocalListAffectedDatabase.cs

@@ -139,7 +139,7 @@ public IEnumerable<Duplicati.Library.Interface.IListResultRemoteVolume> GetVolum
 var sql = string.Format(
 @"SELECT DISTINCT ""Name"" FROM ( " +
 @" SELECT ""Name"" FROM ""Remotevolume"" WHERE ""ID"" IN ( SELECT ""VolumeID"" FROM ""Block"" WHERE ""ID"" IN ( SELECT ""BlockID"" FROM ""BlocksetEntry"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""FileLookup"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))))) " +
- @" UNION " +
+ @" UNION ALL " +
 @" SELECT ""Name"" FROM ""Remotevolume"" WHERE ""ID"" IN ( SELECT ""VolumeID"" FROM ""Block"" WHERE ""ID"" IN ( SELECT ""BlockID"" FROM ""BlocksetEntry"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""Metadataset"" WHERE ""ID"" IN ( SELECT ""MetadataID"" FROM ""FileLookup"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))))))" +
 @")",


Old: 19458ms
New: 13540ms

Jojo-1000 · 2023-06-21T22:45:36Z

Duplicati/Library/Main/Database/LocalListChangesDatabase.cs

@@ -161,7 +161,7 @@ public void AddFromDb(long filesetId, bool asNew, Library.Utility.IFilter filter
 cmd.ExecuteNonQuery();
 }

- cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECT DISTINCT ""Path"" FROM ""{2}"") ", tablename, combined, filenamestable), filesetId);
+ cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECT ""Path"" FROM ""{2}"") ", tablename, combined, filenamestable), filesetId);


Old: 13285ms
New: 10527ms

Jojo-1000 · 2023-06-22T07:49:30Z

Duplicati/Library/Main/Database/LocalListDatabase.cs

@@ -73,7 +73,7 @@ public FileSets(LocalListDatabase owner, DateTime time, long[] versions)

 using(var cmd = m_connection.CreateCommand())
 {
- cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECT DISTINCT ""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" " + query, m_tablename), args);
+ cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECT ""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" " + query, m_tablename), args);


Both run quickly in my test, but ID is unique so there should be no difference in results

Jojo-1000 · 2023-06-23T14:56:31Z

I tried to run each query individually to see how much faster it is, but I gave up in the middle so I don't have times for the others. However, in my test case that runs all database queries there was no significant performance improvement from this PR.

Don't distinct where it's not needed.

d1f95da

DISTINCT and UNION will remove duplicates, requiring a forced sort, unique and temporary table step. It can't be flattened or optimized in other ways. So they are removed to allow SQLite to do optimization of searches.

mr-russ force-pushed the reducedistinct branch from 9e1e98e to d1f95da Compare April 20, 2023 13:29

Quote table name.

c56e0c6

Jojo-1000 reviewed Jun 23, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't distinct where it's not needed. #4924

Don't distinct where it's not needed. #4924

mr-russ commented Apr 20, 2023

ts678 commented Jun 22, 2023

Jojo-1000 commented Jun 22, 2023

duplicatibot commented Jun 23, 2023

Jojo-1000 Jun 21, 2023

Jojo-1000 Jun 21, 2023

Jojo-1000 Jun 21, 2023

Jojo-1000 Jun 21, 2023

Jojo-1000 Jun 21, 2023

Jojo-1000 Jun 21, 2023

Jojo-1000 Jun 21, 2023

Jojo-1000 Jun 21, 2023

Jojo-1000 Jun 21, 2023

Jojo-1000 Jun 22, 2023

Jojo-1000 commented Jun 23, 2023 •

edited

Don't distinct where it's not needed. #4924

Are you sure you want to change the base?

Don't distinct where it's not needed. #4924

Conversation

mr-russ commented Apr 20, 2023

ts678 commented Jun 22, 2023

Jojo-1000 commented Jun 22, 2023

duplicatibot commented Jun 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jojo-1000 commented Jun 23, 2023 • edited

Jojo-1000 commented Jun 23, 2023 •

edited