Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards #1802

Draft
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

AhmarZaidi
Copy link
Contributor

Summary

This PR addresses an issue where non-ASCII characters in URL filenames caused HTTP headers to break reverse proxies and violate standards. The solution encodes only the filename part of URLs, ensuring compliance with ISO-8859-1 character requirements. This change maintains URL integrity while preventing potential issues with reverse proxies.

Example URL : https://testsite.com/wp-content/uploads/2025/01/חנות-scaled.avif
Corrected URL: https://testsite.com/wp-content/uploads/2025/01/%D7%97%D7%A0%D7%95%D7%AA-scaled.avif

Fixes #1775

Relevant technical choices

  • Implemented a solution to address the issue where non-ASCII characters in URLs, such as Hebrew characters in filenames, were causing HTTP headers to break reverse proxies and violate HTTP standards.
  • Added a function to specifically encode the filename part of URLs using rawurlencode() since the rest of the path is the standard uploads path in WordPress, ensuring that only ASCII characters are present in the HTTP Link headers.
  • Utilized wp_parse_url() to decompose URLs into components, allowing for precise encoding of the filename while preserving the rest of the URL structure.
  • Reconstructed the URL with optional scheme and host, and appended query and fragment components if they exist, ensuring full URL integrity.
  • This change ensures compliance with ISO-8859-1 character requirements in HTTP headers.

@AhmarZaidi AhmarZaidi changed the title Fix: Optimize URL encoding logic in get_response_header Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards Jan 14, 2025
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 57.44%. Comparing base (6ca5c4b) to head (700ec77).
Report is 63 commits behind head on trunk.

Files with missing lines Patch % Lines
...ptimization-detective/class-od-link-collection.php 86.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            trunk    #1802      +/-   ##
==========================================
+ Coverage   57.38%   57.44%   +0.05%     
==========================================
  Files          84       84              
  Lines        6517     6530      +13     
==========================================
+ Hits         3740     3751      +11     
- Misses       2777     2779       +2     
Flag Coverage Δ
multisite 57.44% <86.66%> (+0.05%) ⬆️
single 34.44% <0.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left a couple alternative suggestions.

Please add some test coverage for the lines not covered be tests.

if ( isset( $parsed_url['path'] ) ) {
$path_segments = explode( '/', $parsed_url['path'] );
$last_segment = array_pop( $path_segments );
$encoded_last_segment = rawurlencode( $last_segment );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this conditionally only encode the $last_segment if it contains any characters which are not ASCII?

Comment on lines +253 to +271
$parsed_url = wp_parse_url( $link['href'] );
if ( isset( $parsed_url['path'] ) ) {
$path_segments = explode( '/', $parsed_url['path'] );
$last_segment = array_pop( $path_segments );
$encoded_last_segment = rawurlencode( $last_segment );

$encoded_path = implode( '/', $path_segments ) . '/' . $encoded_last_segment;

$scheme = isset( $parsed_url['scheme'] ) ? $parsed_url['scheme'] : '';
$host = isset( $parsed_url['host'] ) ? $parsed_url['host'] : '';

$link['href'] = esc_url_raw( $scheme . '://' . $host . $encoded_path );

// Append query and fragment if they exist.
$link['href'] .= isset( $parsed_url['query'] ) ? '?' . $parsed_url['query'] : '';
$link['href'] .= isset( $parsed_url['fragment'] ) ? '#' . $parsed_url['fragment'] : '';
} else {
$link['href'] = esc_url_raw( $link['href'] );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of parsing the URL and then re-constructing it, what if you just check if the href has any non-ASCII chars and then encode the entire URL?

Suggested change
$parsed_url = wp_parse_url( $link['href'] );
if ( isset( $parsed_url['path'] ) ) {
$path_segments = explode( '/', $parsed_url['path'] );
$last_segment = array_pop( $path_segments );
$encoded_last_segment = rawurlencode( $last_segment );
$encoded_path = implode( '/', $path_segments ) . '/' . $encoded_last_segment;
$scheme = isset( $parsed_url['scheme'] ) ? $parsed_url['scheme'] : '';
$host = isset( $parsed_url['host'] ) ? $parsed_url['host'] : '';
$link['href'] = esc_url_raw( $scheme . '://' . $host . $encoded_path );
// Append query and fragment if they exist.
$link['href'] .= isset( $parsed_url['query'] ) ? '?' . $parsed_url['query'] : '';
$link['href'] .= isset( $parsed_url['fragment'] ) ? '#' . $parsed_url['fragment'] : '';
} else {
$link['href'] = esc_url_raw( $link['href'] );
}
if ( 1 === preg_match( '/[^\x20-\x7E]/', $link['href'] ) ) {
$link['href'] = rawurlencode( urldecode( $link['href'] ) );
} else {
$link['href'] = esc_url_raw( $link['href'] );
}

The regular expression range there is to match everything from a space to a tilde. This might not be the right range.

@westonruter
Copy link
Member

The solution encodes only the filename part of URLs

What about when an internationalized domain name is used? Couldn't this also cause problems with encoding?

@westonruter westonruter added [Type] Bug An existing feature is broken [Plugin] Optimization Detective Issues for the Optimization Detective plugin labels Jan 14, 2025
@westonruter
Copy link
Member

Additionally, on multisite subdirectory installs, in theory the path before wp-content could also include non-ASCII chars:

https://testsite.com/חנות/wp-content/uploads/2025/01/example-scaled.avif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Plugin] Optimization Detective Issues for the Optimization Detective plugin [Type] Bug An existing feature is broken
Projects
None yet
2 participants