-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't distinct where it's not needed. #4924
base: master
Are you sure you want to change the base?
Conversation
DISTINCT and UNION will remove duplicates, requiring a forced sort, unique and temporary table step. It can't be flattened or optimized in other ways. So they are removed to allow SQLite to do optimization of searches.
I'm wondering how these specific ones were chosen and by what process they're guaranteed to not change behavior badly. |
I started to run through all of these queries on a copy of my main backup database (with 100kB blocksize, so lots of blocks) to see what improvement is there. It seems like in some cases it can be 30% faster while in others it is just the same.
What I see in a lot of places is a |
This pull request has been mentioned on Duplicati. There might be relevant details there: https://forum.duplicati.com/t/reducing-time-spent-deleting-blocks/16403/9 |
SELECT DISTINCT ""Path"" AS ""RealPath"", | ||
? || length(""RealPath"") || ? || row_number() OVER () || | ||
CASE WHEN substr(""RealPath"", length(""RealPath"")) = ? THEN ? ELSE ? END) AS ""Obfuscated"" FROM ""File""", | ||
Platform.IsClientPosix ? "/" : "X:\\", Util.DirectorySeparatorString, Util.DirectorySeparatorString, ".bin"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query has a missing (
before line 3 and lacks a 3rd dir separator in parameters.
Combined execution time on my main backup database of old 3 queries was 3938ms, new is 6030ms.
|
||
// Create a temporary table to cache subquery result, as it might take long (SQLite does not cache at all). | ||
deletecmd.ExecuteNonQuery(string.Format(@"CREATE TEMP TABLE ""{0}"" (""ID"" INTEGER PRIMARY KEY)", blocksetidstable)); | ||
deletecmd.ExecuteNonQuery(string.Format(@"INSERT OR IGNORE INTO ""{0}"" (""ID"") {1}", blocksetidstable, bsIdsSubQuery)); | ||
bsIdsSubQuery = string.Format(@"SELECT DISTINCT ""ID"" FROM ""{0}"" ", blocksetidstable); | ||
bsIdsSubQuery = string.Format(@"SELECT ""ID"" FROM ""{0}"" ", blocksetidstable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No difference in performance
UNION ALL | ||
SELECT ""BlocksetID"" FROM ""BlocklistHash"" | ||
WHERE ""Hash"" IN (SELECT ""Hash"" FROM ""Block"" WHERE ""VolumeID"" IN ({volIdsSubQuery}))"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test with 4 random volumes
Old: 98ms
New: 86ms (contains duplicates, but should not matter)
In combined temp table insert:
Old: 100ms
New: 90ms
@@ -777,7 +777,7 @@ public void VerifyConsistency(long blocksize, long hashsize, bool verifyfilelist | |||
} | |||
|
|||
var real_count = cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM ""BlocklistHash""", 0); | |||
var unique_count = cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECT DISTINCT ""BlocksetID"", ""Index"" FROM ""BlocklistHash"")", 0); | |||
var unique_count = cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECT ""BlocksetID"", ""Index"" FROM ""BlocklistHash"")", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my dataset both queries returned the same, but this specifically compares unique entries to all entries and (BlocksetID, Index) is not guaranteed unique in the table
@@ -799,7 +799,7 @@ public void VerifyConsistency(long blocksize, long hashsize, bool verifyfilelist | |||
using (var cmd2 = m_connection.CreateCommand(transaction)) | |||
foreach (var filesetid in cmd.ExecuteReaderEnumerable(@"SELECT ""ID"" FROM ""Fileset"" ").Select(x => x.ConvertValueToInt64(0, -1))) | |||
{ | |||
var expandedCmd = string.Format(@"SELECT COUNT(*) FROM (SELECT DISTINCT ""Path"" FROM ({0}) UNION SELECT DISTINCT ""Path"" FROM ({1}))", LocalDatabase.LIST_FILESETS, LocalDatabase.LIST_FOLDERS_AND_SYMLINKS); | |||
var expandedCmd = string.Format(@"SELECT COUNT(DISTINCT ""Path"") FROM (SELECT ""Path"" FROM ({0}) UNION ALL SELECT ""Path"" FROM ({1}))", LocalDatabase.LIST_FILESETS, LocalDatabase.LIST_FOLDERS_AND_SYMLINKS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old: 13398ms
New: 14160ms
@@ -65,7 +65,7 @@ public IEnumerable<Duplicati.Library.Interface.IListResultFileset> GetFilesets(I | |||
var sql = string.Format( | |||
@"SELECT DISTINCT ""FilesetID"" FROM (" + | |||
@"SELECT ""FilesetID"" FROM ""FilesetEntry"" WHERE ""FileID"" IN ( SELECT ""ID"" FROM ""FileLookup"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""BlocksetEntry"" WHERE ""BlockID"" IN ( SELECT ""ID"" From ""Block"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0})))))" + | |||
" UNION " + | |||
" UNION ALL " + | |||
@"SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))" + | |||
")", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No difference in performance
@"SELECT ""Path"" FROM ""File"" WHERE ""MetadataID"" IN (SELECT ""ID"" FROM ""Metadataset"" WHERE ""BlocksetID"" IN (SELECT ""BlocksetID"" FROM ""BlocksetEntry"" WHERE ""BlockID"" IN (SELECT ""ID"" FROM ""Block"" WHERE ""VolumeID"" IN (SELECT ""ID"" from ""RemoteVolume"" WHERE ""Name"" IN ({0})))))" + | ||
@" UNION " + | ||
@" UNION ALL " + | ||
@"SELECT ""Path"" FROM ""File"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))" + | ||
@") ORDER BY ""Path"" ", | ||
string.Join(",", items.Select(x => "?")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old: 12914ms
New: 6123ms
@@ -139,7 +139,7 @@ public IEnumerable<Duplicati.Library.Interface.IListResultRemoteVolume> GetVolum | |||
var sql = string.Format( | |||
@"SELECT DISTINCT ""Name"" FROM ( " + | |||
@" SELECT ""Name"" FROM ""Remotevolume"" WHERE ""ID"" IN ( SELECT ""VolumeID"" FROM ""Block"" WHERE ""ID"" IN ( SELECT ""BlockID"" FROM ""BlocksetEntry"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""FileLookup"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))))) " + | |||
@" UNION " + | |||
@" UNION ALL " + | |||
@" SELECT ""Name"" FROM ""Remotevolume"" WHERE ""ID"" IN ( SELECT ""VolumeID"" FROM ""Block"" WHERE ""ID"" IN ( SELECT ""BlockID"" FROM ""BlocksetEntry"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""Metadataset"" WHERE ""ID"" IN ( SELECT ""MetadataID"" FROM ""FileLookup"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))))))" + | |||
@")", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old: 19458ms
New: 13540ms
@@ -161,7 +161,7 @@ public void AddFromDb(long filesetId, bool asNew, Library.Utility.IFilter filter | |||
cmd.ExecuteNonQuery(); | |||
} | |||
|
|||
cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECT DISTINCT ""Path"" FROM ""{2}"") ", tablename, combined, filenamestable), filesetId); | |||
cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECT ""Path"" FROM ""{2}"") ", tablename, combined, filenamestable), filesetId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old: 13285ms
New: 10527ms
@@ -73,7 +73,7 @@ public FileSets(LocalListDatabase owner, DateTime time, long[] versions) | |||
|
|||
using(var cmd = m_connection.CreateCommand()) | |||
{ | |||
cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECT DISTINCT ""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" " + query, m_tablename), args); | |||
cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECT ""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" " + query, m_tablename), args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both run quickly in my test, but ID is unique so there should be no difference in results
I tried to run each query individually to see how much faster it is, but I gave up in the middle so I don't have times for the others. However, in my test case that runs all database queries there was no significant performance improvement from this PR. |
DISTINCT and UNION will remove duplicates, requiring a forced sort, unique and temporary table step. It can't be flattened or optimized in other ways. So they are removed to allow SQLite to do optimization of searches.