-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards #1802
base: trunk
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## trunk #1802 +/- ##
==========================================
+ Coverage 57.38% 57.44% +0.05%
==========================================
Files 84 84
Lines 6517 6530 +13
==========================================
+ Hits 3740 3751 +11
- Misses 2777 2779 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I left a couple alternative suggestions.
Please add some test coverage for the lines not covered be tests.
if ( isset( $parsed_url['path'] ) ) { | ||
$path_segments = explode( '/', $parsed_url['path'] ); | ||
$last_segment = array_pop( $path_segments ); | ||
$encoded_last_segment = rawurlencode( $last_segment ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this conditionally only encode the $last_segment
if it contains any characters which are not ASCII?
$parsed_url = wp_parse_url( $link['href'] ); | ||
if ( isset( $parsed_url['path'] ) ) { | ||
$path_segments = explode( '/', $parsed_url['path'] ); | ||
$last_segment = array_pop( $path_segments ); | ||
$encoded_last_segment = rawurlencode( $last_segment ); | ||
|
||
$encoded_path = implode( '/', $path_segments ) . '/' . $encoded_last_segment; | ||
|
||
$scheme = isset( $parsed_url['scheme'] ) ? $parsed_url['scheme'] : ''; | ||
$host = isset( $parsed_url['host'] ) ? $parsed_url['host'] : ''; | ||
|
||
$link['href'] = esc_url_raw( $scheme . '://' . $host . $encoded_path ); | ||
|
||
// Append query and fragment if they exist. | ||
$link['href'] .= isset( $parsed_url['query'] ) ? '?' . $parsed_url['query'] : ''; | ||
$link['href'] .= isset( $parsed_url['fragment'] ) ? '#' . $parsed_url['fragment'] : ''; | ||
} else { | ||
$link['href'] = esc_url_raw( $link['href'] ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of parsing the URL and then re-constructing it, what if you just check if the href
has any non-ASCII chars and then encode the entire URL?
$parsed_url = wp_parse_url( $link['href'] ); | |
if ( isset( $parsed_url['path'] ) ) { | |
$path_segments = explode( '/', $parsed_url['path'] ); | |
$last_segment = array_pop( $path_segments ); | |
$encoded_last_segment = rawurlencode( $last_segment ); | |
$encoded_path = implode( '/', $path_segments ) . '/' . $encoded_last_segment; | |
$scheme = isset( $parsed_url['scheme'] ) ? $parsed_url['scheme'] : ''; | |
$host = isset( $parsed_url['host'] ) ? $parsed_url['host'] : ''; | |
$link['href'] = esc_url_raw( $scheme . '://' . $host . $encoded_path ); | |
// Append query and fragment if they exist. | |
$link['href'] .= isset( $parsed_url['query'] ) ? '?' . $parsed_url['query'] : ''; | |
$link['href'] .= isset( $parsed_url['fragment'] ) ? '#' . $parsed_url['fragment'] : ''; | |
} else { | |
$link['href'] = esc_url_raw( $link['href'] ); | |
} | |
if ( 1 === preg_match( '/[^\x20-\x7E]/', $link['href'] ) ) { | |
$link['href'] = rawurlencode( urldecode( $link['href'] ) ); | |
} else { | |
$link['href'] = esc_url_raw( $link['href'] ); | |
} |
The regular expression range there is to match everything from a space to a tilde. This might not be the right range.
What about when an internationalized domain name is used? Couldn't this also cause problems with encoding? |
Additionally, on multisite subdirectory installs, in theory the path before
|
Summary
This PR addresses an issue where non-ASCII characters in URL filenames caused HTTP headers to break reverse proxies and violate standards. The solution encodes only the filename part of URLs, ensuring compliance with ISO-8859-1 character requirements. This change maintains URL integrity while preventing potential issues with reverse proxies.
Example URL :
https://testsite.com/wp-content/uploads/2025/01/חנות-scaled.avif
Corrected URL:
https://testsite.com/wp-content/uploads/2025/01/%D7%97%D7%A0%D7%95%D7%AA-scaled.avif
Fixes #1775
Relevant technical choices
rawurlencode()
since the rest of the path is the standard uploads path in WordPress, ensuring that only ASCII characters are present in the HTTP Link headers.wp_parse_url()
to decompose URLs into components, allowing for precise encoding of the filename while preserving the rest of the URL structure.