Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: #6146 - utf8/utf8mb3 alias causes redundant ALTER TABLE changes #6370

Open
wants to merge 2 commits into
base: 4.3.x
Choose a base branch
from

Conversation

fisharebest
Copy link

@fisharebest fisharebest commented Apr 29, 2024

Q A
Type bug
Fixed issues #6146

Summary

In MySQL (and MariaDB), utf8 and utf8mb3 are aliases. Depending on the database server version, the information schema may contain one or the other. Similarly, column definitions may contain one or the other.

When performing schema-comparisons, a mismatch between utf8 and utf8mb3 causes an ALTER TABLE statement to be generated - even when columns are identical.

The code already contains a function to normalize the charset/collation options:

private function normalizeOptions(array $options): array

My solution is to extend this function so that it also replaces utf8 with utf8mb3 (or vice-versa), to match the current database connection.

To detect the prefered character set for the connection, I've added:
AbstractMySQLPlatform::informationSchemaUsesUtf8mb3()
MariaDBPlatform::informationSchemaUsesUtf8mb3()

The obvious place for the new normalization logic is the CharsetMetadataProvider and CollationMetadataProvider classes. These are interfaces, so this is technically a BC break, however, they are marked @internal.

I've added tests which should cover all the new/modified code. I've run it against MySQL. I don't have MariaDB to hand.

@fisharebest
Copy link
Author

Just pushed a change to fix the psalm issues

@greg0ire
Copy link
Member

It seems there are CI jobs failing. Please take a look at this guide for more on how to handle those.

@greg0ire
Copy link
Member

ಠ_ಠ

@greg0ire
Copy link
Member

greg0ire commented May 1, 2024

I've added tests which should cover all the new/modified code. I've run it against MySQL. I don't have MariaDB to hand.

Codecov says that some of the new code is untested, most notably the parts using version_compare. Please add functional tests for this change. As you can see, our test suite features, among other platforms MariaDB 10.5 and MariaDB 10.6, so you should be able to test that schema comparisons on either version of MariaDB indeed doesn't cause ALTER TABLE statements to be generated. Same goes for MySQL.

You can learn more about such tests at https://www.doctrine-project.org/projects/doctrine-dbal/en/stable/reference/testing.html

@fisharebest
Copy link
Author

Codecov says that some of the new code is untested, most notably the parts using version_compare. Please add functional tests for this change.

Running the tests locally says the code is covered. (Ad hoc testing confirms this).

XDEBUG_MODE=coverage vendor/bin/phpunit -c ci/github/phpunit/pdo_mysql.xml --coverage-html=coverage

Screenshot 2024-05-01 at 13 47 44

However, Codecov says it isn't.

Screenshot 2024-05-01 at 13 48 31

@greg0ire
Copy link
Member

greg0ire commented May 1, 2024

I think I found the reason: [2024-05-01T09:48:28.648Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 429 - {'detail': ErrorDetail(string='Rate limit reached. Please upload with the Codecov repository upload token to resolve issue. Expected available in 1935 seconds.', code='throttled')}

Let me retry this.

greg0ire added a commit to greg0ire/dbal that referenced this pull request May 1, 2024
In a recent PR, we got reports from CodeCov about several lines not
being covered. After further inspection, I found an error message in the
coverage file upload saying we hit a rate limit.
This will make it more likely we spot such issues in the future.

The PR: doctrine#6370
@greg0ire
Copy link
Member

greg0ire commented May 1, 2024

From the Codecov UI:

Coverage data is based on head 9c87c57(2 uploads) compared to base 7fb00fd(5 uploads)

… we currently send 69 uploads, so there is no way we are going to be below the upload limit. I will have to look into this separately.

greg0ire
greg0ire previously approved these changes May 1, 2024
Copy link
Member

@greg0ire greg0ire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, my feedback is not a blocker.

public function normalizeCollation(string $collation): string
{
if ($this->useUtf8mb3 && str_starts_with($collation, 'utf8_')) {
return 'utf8mb3' . substr($collation, 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make things more obvious here:

Suggested change
return 'utf8mb3' . substr($collation, 4);
return 'utf8mb3' . substr($collation, strlen('utf8'));

}

if (! $this->useUtf8mb3 && str_starts_with($collation, 'utf8mb3_')) {
return 'utf8' . substr($collation, 7);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return 'utf8' . substr($collation, 7);
return 'utf8' . substr($collation, strlen('utf8mb3'));

@greg0ire
Copy link
Member

greg0ire commented May 1, 2024

Ah there is one issue though: since this is a fix, you should target 3.8.x

@fisharebest
Copy link
Author

Ah there is one issue though: since this is a fix, you should target 3.8.x

I can create another PR for this branch.

@greg0ire
Copy link
Member

greg0ire commented May 1, 2024

Please don't, we usually do merges up. Instead, let me retarget your PR, and please rebase your commits on 3.8.x and force push.

@greg0ire greg0ire changed the base branch from 4.0.x to 3.8.x May 1, 2024 16:36
@greg0ire greg0ire dismissed their stale review May 1, 2024 16:36

The base branch was changed.

greg0ire added a commit to greg0ire/dbal that referenced this pull request May 1, 2024
In a recent PR, we got reports from CodeCov about several lines not
being covered. After further inspection, I found an error message in the
coverage file upload saying we hit a rate limit.
This will make it more likely we spot such issues in the future.

The PR: doctrine#6370
@fisharebest
Copy link
Author

Please don't, we usually do merges up. Instead, let me retarget your PR, and please rebase your commits on 3.8.x and force push.

The PR uses Connection::getServerVersion() to select the prefered alias for utf8/utf8mb3.
In 3.8.x, this method isn't available, so I'll need to find another approach.

I'm going to be offline for the next week or so, and will look at this on my return.

@greg0ire
Copy link
Member

greg0ire commented May 2, 2024

Enjoy your time off 👍

fisharebest added a commit to fisharebest/dbal that referenced this pull request Jun 4, 2024
@greg0ire
Copy link
Member

greg0ire commented Jun 4, 2024

@fisharebest do you need help with the rebase?

@fisharebest
Copy link
Author

I have rewritten my patch to use the information_schema instead of server version numbers. This seems more robust, and it should also continue to work if MySQL remove utf8mb3 in version 9.

However, I have been unable to back-port this PR to the 3.8.x branch. (I can, but it doesn't fix the bug...).

The handling of charset/collation is different. I'd probably need to copy lots of other changes from the 4.0.x branch to make it work.

Options are:

  • apply this patch to 4.0.x only
  • attempt to backport the changes in charset/collation handling from 4.0.x to 3.8.x and then try again
  • ????

@greg0ire
Copy link
Member

greg0ire commented Jun 4, 2024

I'd be surprised if there were lots of changes you could backport that are not breaking changes, so I'd say we should consider applying this to 4.0.x although given the usage statistics of 4.0.x, it won't have much effect at first.

@fisharebest fisharebest changed the base branch from 3.8.x to 4.0.x June 4, 2024 21:13
Copy link

@arbor95 arbor95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All seems ok

@SenseException
Copy link
Member

I have rewritten my patch to use the information_schema instead of server version numbers. This seems more robust, and it should also continue to work if MySQL remove utf8mb3 in version 9.

As far as I understand this, utf8 would stay utf8 once utf8mb3 isn't supported anymore. I didn't find information about when this alias handling was introduced on a short search though.

@derrabus derrabus changed the base branch from 4.0.x to 4.1.x August 14, 2024 11:03
@fisharebest
Copy link
Author

it should also continue to work if MySQL remove utf8mb3 in version 9.

Now that MySQL 9.0 is released, we can see that the logic still works, and the tests pass.

@derrabus
Copy link
Member

Can you please rebase onto 4.1? That should trigger a CI pipeline that includes MySQL 9.

@fisharebest
Copy link
Author

All the tests are passing. Looks like a missing token in the CI tools.

@greg0ire
Copy link
Member

greg0ire commented Sep 4, 2024

You can ignore the Codecov failure, that issue is tracked here: codecov/codecov-action#1548

@derrabus
Copy link
Member

derrabus commented Sep 4, 2024

Yes sorry, we lost track of this PR, apparently. I'll do a review of the PR.

Comment on lines +860 to +868
/**
* INFORMATION_SCHEMA.CHARACTER_SETS will contain either 'utf8' or 'utf8mb3' but not both.
*/
public function informationSchemaUsesUtf8mb3(Connection $connection): bool
{
$sql = "SELECT character_set_name FROM INFORMATION_SCHEMA.CHARACTER_SETS WHERE character_set_name = 'utf8mb3'";

return $connection->fetchOne($sql) === 'utf8mb3';
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having my issues with this method.

First of all, it is highly unusual that our platform classes interact with the connection, in this case send a query to the server. Our platform classes are meant to collect and normalize differences between database systems and versions. Since you don't ever override this method, there's no reason why this method should exist on a platform class.

Moreover, you wrote that…

In MySQL (and MariaDB), utf8 and utf8mb3 are aliases. Depending on the database server version, the information schema may contain one or the other. Similarly, column definitions may contain one or the other.

If this behavior really depends on the database version and taking into account that we know the database version: Why do we need to perform feature detection by running a query on the server?

Comment on lines +18 to +29
public function normalizeCharset(string $charset): string
{
if ($this->useUtf8mb3 && $charset === 'utf8') {
return 'utf8mb3';
}

if (! $this->useUtf8mb3 && $charset === 'utf8mb3') {
return 'utf8';
}

return $charset;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that CharsetMetadataProvider is the right abstraction for this logic. The purpose of CharsetMetadataProvider is to query a piece of configuration from the server in a cachable manner. A normalization method is something entirely different.

Copy link
Member

@derrabus derrabus Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it: This normalization logic belongs into the platform classes. And if you put it there, you don't need that detection logic of yours and you don't that useUtf8mb3 flag. Give it a try. 🙂

Comment on lines +23 to +27
$utf8mb3Provider = new ConnectionCharsetMetadataProvider($connection, true);

self::assertSame('utf8mb3', $utf8mb3Provider->normalizeCharset('utf8'));
self::assertSame('utf8mb3', $utf8mb3Provider->normalizeCharset('utf8mb3'));
self::assertSame('foobar', $utf8mb3Provider->normalizeCharset('foobar'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a separate test.

Comment on lines +23 to +27
$utf8mb3Provider = new ConnectionCollationMetadataProvider($connection, true);

self::assertSame('utf8mb3_unicode_ci', $utf8mb3Provider->normalizeCollation('utf8_unicode_ci'));
self::assertSame('utf8mb3_unicode_ci', $utf8mb3Provider->normalizeCollation('utf8mb3_unicode_ci'));
self::assertSame('foobar_unicode_ci', $utf8mb3Provider->normalizeCollation('foobar_unicode_ci'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a separate test.

@derrabus derrabus changed the base branch from 4.1.x to 4.2.x October 10, 2024 13:13
@derrabus
Copy link
Member

derrabus commented Nov 24, 2024

@fisharebest Will you have time to pick up the work on this PR? From what I gathered, we don't need the detection logic that you've built at all.

The charsets utf8 and utf8mb3 are the same charsets in MySQL and most likely will be for eternity. MySQL considers one of the identifiers as canonical and the other one as alias. No matter which of the both identifiers we use when creating a table, MySQL will always report the canonical name.

  • In MySQL 5.7 and below, utf8 is the canonical name and utf8mb3 the alias.
  • In MySQL 8.0 and above, utf8mb3 is the canonical name and utf8 the alias.

This means, that if we normalize both schemas to let's say the long version utf8mb3, the comparison should always succeed.

Note that MySQL 5.7 support is deprecated by now, so I would really like to keep logic that works around behavioral changes of MySQL 8.0 as simple as possible.

@derrabus derrabus changed the base branch from 4.2.x to 4.3.x November 24, 2024 16:16
@fisharebest
Copy link
Author

fisharebest commented Nov 25, 2024

  • In MySQL 5.7 and below, utf8 is the canonical name and utf8mb3 the alias.
  • In MySQL 8.0 and above, utf8mb3 is the canonical name and utf8 the alias.

Not quite. The change in canonical/alias happened between 8.0.29 and 8.0.30.
See https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html

For MariaDB it is a little more complicated. The same change was made in 10.6.1, but a run-time variable (old_mode) allows the previous canonical/alias to be used. This setting still exists on the latest version of MariaDB.
See https://mariadb.com/kb/en/old-mode/#utf8_is_utf8mb3

This means, that if we normalize both schemas to let's say the long version utf8mb3, the comparison should always succeed.

Yes, but there is still the small issue of DBAL looking up default collations for charsets. See the second part of my comment at #6146 (comment)

We'd need to normalise the result of these queries too.

@derrabus
Copy link
Member

Not quite. The change in canonical/alias happened between 8.0.29 and 8.0.30. See https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html

Oh right. MySQL's versioning policy during 8.0.x times was wild.

For MariaDB it is a little more complicated. The same change was made in 10.6.1, but a run-time variable (old_mode) allows the previous canonical/alias to be used. This setting still exists on the latest version of MariaDB. See https://mariadb.com/kb/en/old-mode/#utf8_is_utf8mb3

WTF. Thank you very much for your thorough research. So, we could set that variable before we run our queries and reset it afterwards, right?

This means, that if we normalize both schemas to let's say the long version utf8mb3, the comparison should always succeed.

Yes, but there is still the small issue of DBAL looking up default collations for charsets. See the second part of my comment at #6146 (comment)

We'd need to normalise the result of these queries too.

Fine by me. But again, if we do all this normalization, we don't need run any detection logic on the server because we enable DBAL to work with MySQL/MariaDB responsing with old and new character set names (and corresponsing collation names), correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants