Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't distinct where it's not needed. #4924

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mr-russ
Copy link
Contributor

@mr-russ mr-russ commented Apr 20, 2023

DISTINCT and UNION will remove duplicates, requiring a forced sort, unique and temporary table step. It can't be flattened or optimized in other ways. So they are removed to allow SQLite to do optimization of searches.

DISTINCT and UNION will remove duplicates, requiring
a forced sort, unique and temporary table step.  It can't
be flattened or optimized in other ways.  So they are removed
to allow SQLite to do optimization of searches.
@ts678
Copy link
Collaborator

ts678 commented Jun 22, 2023

I'm wondering how these specific ones were chosen and by what process they're guaranteed to not change behavior badly.
Long ago I tested two (possibly just the DISTINCT part rather than final result) on an example database and result changed.

@Jojo-1000
Copy link
Contributor

I started to run through all of these queries on a copy of my main backup database (with 100kB blocksize, so lots of blocks) to see what improvement is there. It seems like in some cases it can be 30% faster while in others it is just the same.

what process they're guaranteed to not change behavior badly

What I see in a lot of places is a NOT IN (SELECT DISTINCT ...). I am pretty sure at least those are save to change (IN should not care about duplicates). The others need some careful consideration, which is why I am trying to see if it is worth it in terms of performance.

@duplicatibot
Copy link

This pull request has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/reducing-time-spent-deleting-blocks/16403/9

SELECT DISTINCT ""Path"" AS ""RealPath"",
? || length(""RealPath"") || ? || row_number() OVER () ||
CASE WHEN substr(""RealPath"", length(""RealPath"")) = ? THEN ? ELSE ? END) AS ""Obfuscated"" FROM ""File""",
Platform.IsClientPosix ? "/" : "X:\\", Util.DirectorySeparatorString, Util.DirectorySeparatorString, ".bin");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query has a missing ( before line 3 and lacks a 3rd dir separator in parameters.
Combined execution time on my main backup database of old 3 queries was 3938ms, new is 6030ms.


// Create a temporary table to cache subquery result, as it might take long (SQLite does not cache at all).
deletecmd.ExecuteNonQuery(string.Format(@"CREATE TEMP TABLE ""{0}"" (""ID"" INTEGER PRIMARY KEY)", blocksetidstable));
deletecmd.ExecuteNonQuery(string.Format(@"INSERT OR IGNORE INTO ""{0}"" (""ID"") {1}", blocksetidstable, bsIdsSubQuery));
bsIdsSubQuery = string.Format(@"SELECT DISTINCT ""ID"" FROM ""{0}"" ", blocksetidstable);
bsIdsSubQuery = string.Format(@"SELECT ""ID"" FROM ""{0}"" ", blocksetidstable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference in performance

UNION ALL
SELECT ""BlocksetID"" FROM ""BlocklistHash""
WHERE ""Hash"" IN (SELECT ""Hash"" FROM ""Block"" WHERE ""VolumeID"" IN ({volIdsSubQuery}))";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test with 4 random volumes
Old: 98ms
New: 86ms (contains duplicates, but should not matter)

In combined temp table insert:
Old: 100ms
New: 90ms

@@ -777,7 +777,7 @@ public void VerifyConsistency(long blocksize, long hashsize, bool verifyfilelist
}

var real_count = cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM ""BlocklistHash""", 0);
var unique_count = cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECT DISTINCT ""BlocksetID"", ""Index"" FROM ""BlocklistHash"")", 0);
var unique_count = cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECT ""BlocksetID"", ""Index"" FROM ""BlocklistHash"")", 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my dataset both queries returned the same, but this specifically compares unique entries to all entries and (BlocksetID, Index) is not guaranteed unique in the table

@@ -799,7 +799,7 @@ public void VerifyConsistency(long blocksize, long hashsize, bool verifyfilelist
using (var cmd2 = m_connection.CreateCommand(transaction))
foreach (var filesetid in cmd.ExecuteReaderEnumerable(@"SELECT ""ID"" FROM ""Fileset"" ").Select(x => x.ConvertValueToInt64(0, -1)))
{
var expandedCmd = string.Format(@"SELECT COUNT(*) FROM (SELECT DISTINCT ""Path"" FROM ({0}) UNION SELECT DISTINCT ""Path"" FROM ({1}))", LocalDatabase.LIST_FILESETS, LocalDatabase.LIST_FOLDERS_AND_SYMLINKS);
var expandedCmd = string.Format(@"SELECT COUNT(DISTINCT ""Path"") FROM (SELECT ""Path"" FROM ({0}) UNION ALL SELECT ""Path"" FROM ({1}))", LocalDatabase.LIST_FILESETS, LocalDatabase.LIST_FOLDERS_AND_SYMLINKS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old: 13398ms
New: 14160ms

@@ -65,7 +65,7 @@ public IEnumerable<Duplicati.Library.Interface.IListResultFileset> GetFilesets(I
var sql = string.Format(
@"SELECT DISTINCT ""FilesetID"" FROM (" +
@"SELECT ""FilesetID"" FROM ""FilesetEntry"" WHERE ""FileID"" IN ( SELECT ""ID"" FROM ""FileLookup"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""BlocksetEntry"" WHERE ""BlockID"" IN ( SELECT ""ID"" From ""Block"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0})))))" +
" UNION " +
" UNION ALL " +
@"SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))" +
")",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference in performance

@"SELECT ""Path"" FROM ""File"" WHERE ""MetadataID"" IN (SELECT ""ID"" FROM ""Metadataset"" WHERE ""BlocksetID"" IN (SELECT ""BlocksetID"" FROM ""BlocksetEntry"" WHERE ""BlockID"" IN (SELECT ""ID"" FROM ""Block"" WHERE ""VolumeID"" IN (SELECT ""ID"" from ""RemoteVolume"" WHERE ""Name"" IN ({0})))))" +
@" UNION " +
@" UNION ALL " +
@"SELECT ""Path"" FROM ""File"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))" +
@") ORDER BY ""Path"" ",
string.Join(",", items.Select(x => "?"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old: 12914ms
New: 6123ms

@@ -139,7 +139,7 @@ public IEnumerable<Duplicati.Library.Interface.IListResultRemoteVolume> GetVolum
var sql = string.Format(
@"SELECT DISTINCT ""Name"" FROM ( " +
@" SELECT ""Name"" FROM ""Remotevolume"" WHERE ""ID"" IN ( SELECT ""VolumeID"" FROM ""Block"" WHERE ""ID"" IN ( SELECT ""BlockID"" FROM ""BlocksetEntry"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""FileLookup"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))))) " +
@" UNION " +
@" UNION ALL " +
@" SELECT ""Name"" FROM ""Remotevolume"" WHERE ""ID"" IN ( SELECT ""VolumeID"" FROM ""Block"" WHERE ""ID"" IN ( SELECT ""BlockID"" FROM ""BlocksetEntry"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""Metadataset"" WHERE ""ID"" IN ( SELECT ""MetadataID"" FROM ""FileLookup"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))))))" +
@")",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old: 19458ms
New: 13540ms

@@ -161,7 +161,7 @@ public void AddFromDb(long filesetId, bool asNew, Library.Utility.IFilter filter
cmd.ExecuteNonQuery();
}

cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECT DISTINCT ""Path"" FROM ""{2}"") ", tablename, combined, filenamestable), filesetId);
cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECT ""Path"" FROM ""{2}"") ", tablename, combined, filenamestable), filesetId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old: 13285ms
New: 10527ms

@@ -73,7 +73,7 @@ public FileSets(LocalListDatabase owner, DateTime time, long[] versions)

using(var cmd = m_connection.CreateCommand())
{
cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECT DISTINCT ""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" " + query, m_tablename), args);
cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECT ""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" " + query, m_tablename), args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both run quickly in my test, but ID is unique so there should be no difference in results

@Jojo-1000
Copy link
Contributor

Jojo-1000 commented Jun 23, 2023

I tried to run each query individually to see how much faster it is, but I gave up in the middle so I don't have times for the others. However, in my test case that runs all database queries there was no significant performance improvement from this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants