Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards #1802

AhmarZaidi · 2025-01-14T07:30:47Z

Summary

This PR addresses an issue where non-ASCII characters in URL filenames caused HTTP headers to break reverse proxies and violate standards. The solution encodes only the filename part of URLs, ensuring compliance with ISO-8859-1 character requirements. This change maintains URL integrity while preventing potential issues with reverse proxies.

Example URL : https://testsite.com/wp-content/uploads/2025/01/חנות-scaled.avif
Corrected URL: https://testsite.com/wp-content/uploads/2025/01/%D7%97%D7%A0%D7%95%D7%AA-scaled.avif

Fixes #1775

Relevant technical choices

Implemented a solution to address the issue where non-ASCII characters in URLs, such as Hebrew characters in filenames, were causing HTTP headers to break reverse proxies and violate HTTP standards.
Added a function to specifically encode the filename part of URLs using rawurlencode() since the rest of the path is the standard uploads path in WordPress, ensuring that only ASCII characters are present in the HTTP Link headers.
Utilized wp_parse_url() to decompose URLs into components, allowing for precise encoding of the filename while preserving the rest of the URL structure.
Reconstructed the URL with optional scheme and host, and appended query and fragment components if they exist, ensuring full URL integrity.
This change ensures compliance with ISO-8859-1 character requirements in HTTP headers.

codecov · 2025-01-14T07:38:46Z

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 57.44%. Comparing base (6ca5c4b) to head (700ec77).
Report is 63 commits behind head on trunk.

Files with missing lines	Patch %	Lines
...ptimization-detective/class-od-link-collection.php	86.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            trunk    #1802      +/-   ##
==========================================
+ Coverage   57.38%   57.44%   +0.05%     
==========================================
  Files          84       84              
  Lines        6517     6530      +13     
==========================================
+ Hits         3740     3751      +11     
- Misses       2777     2779       +2

Flag	Coverage Δ
multisite	`57.44% <86.66%> (+0.05%)`	⬆️
single	`34.44% <0.00%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

westonruter

Thanks for the PR! I left a couple alternative suggestions.

Please add some test coverage for the lines not covered be tests.

westonruter · 2025-01-14T17:31:43Z

plugins/optimization-detective/class-od-link-collection.php

+				if ( isset( $parsed_url['path'] ) ) {
+					$path_segments        = explode( '/', $parsed_url['path'] );
+					$last_segment         = array_pop( $path_segments );
+					$encoded_last_segment = rawurlencode( $last_segment );


Should this conditionally only encode the $last_segment if it contains any characters which are not ASCII?

westonruter · 2025-01-14T17:44:00Z

plugins/optimization-detective/class-od-link-collection.php

+				$parsed_url = wp_parse_url( $link['href'] );
+				if ( isset( $parsed_url['path'] ) ) {
+					$path_segments        = explode( '/', $parsed_url['path'] );
+					$last_segment         = array_pop( $path_segments );
+					$encoded_last_segment = rawurlencode( $last_segment );
+
+					$encoded_path = implode( '/', $path_segments ) . '/' . $encoded_last_segment;
+
+					$scheme = isset( $parsed_url['scheme'] ) ? $parsed_url['scheme'] : '';
+					$host   = isset( $parsed_url['host'] ) ? $parsed_url['host'] : '';
+
+					$link['href'] = esc_url_raw( $scheme . '://' . $host . $encoded_path );
+
+					// Append query and fragment if they exist.
+					$link['href'] .= isset( $parsed_url['query'] ) ? '?' . $parsed_url['query'] : '';
+					$link['href'] .= isset( $parsed_url['fragment'] ) ? '#' . $parsed_url['fragment'] : '';
+				} else {
+					$link['href'] = esc_url_raw( $link['href'] );
+				}


Instead of parsing the URL and then re-constructing it, what if you just check if the href has any non-ASCII chars and then encode the entire URL?

Suggested change

$parsed_url = wp_parse_url( $link['href'] );

if ( isset( $parsed_url['path'] ) ) {

$path_segments = explode( '/', $parsed_url['path'] );

$last_segment = array_pop( $path_segments );

$encoded_last_segment = rawurlencode( $last_segment );

$encoded_path = implode( '/', $path_segments ) . '/' . $encoded_last_segment;

$scheme = isset( $parsed_url['scheme'] ) ? $parsed_url['scheme'] : '';

$host = isset( $parsed_url['host'] ) ? $parsed_url['host'] : '';

$link['href'] = esc_url_raw( $scheme . '://' . $host . $encoded_path );

// Append query and fragment if they exist.

$link['href'] .= isset( $parsed_url['query'] ) ? '?' . $parsed_url['query'] : '';

$link['href'] .= isset( $parsed_url['fragment'] ) ? '#' . $parsed_url['fragment'] : '';

} else {

$link['href'] = esc_url_raw( $link['href'] );

}

if ( 1 === preg_match( '/[^\x20-\x7E]/', $link['href'] ) ) {

$link['href'] = rawurlencode( urldecode( $link['href'] ) );

} else {

$link['href'] = esc_url_raw( $link['href'] );

}

The regular expression range there is to match everything from a space to a tilde. This might not be the right range.

westonruter · 2025-01-14T17:46:08Z

The solution encodes only the filename part of URLs

What about when an internationalized domain name is used? Couldn't this also cause problems with encoding?

westonruter · 2025-01-21T04:32:15Z

Additionally, on multisite subdirectory installs, in theory the path before wp-content could also include non-ASCII chars:

https://testsite.com/חנות/wp-content/uploads/2025/01/example-scaled.avif

Optimize URL encoding logic in get_response_header

700ec77

AhmarZaidi mentioned this pull request Jan 14, 2025

Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards. #1775

Open

AhmarZaidi changed the title ~~Fix: Optimize URL encoding logic in get_response_header~~ Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards Jan 14, 2025

westonruter reviewed Jan 14, 2025

View reviewed changes

westonruter added this to the optimization-detective n.e.x.t milestone Jan 14, 2025

westonruter added [Type] Bug An existing feature is broken [Plugin] Optimization Detective Issues for the Optimization Detective plugin labels Jan 14, 2025

westonruter modified the milestones: optimization-detective 1.0.0-beta1, optimization-detective n.e.x.t Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards #1802

Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards #1802

AhmarZaidi commented Jan 14, 2025

codecov bot commented Jan 14, 2025 •

edited

Loading

westonruter left a comment

westonruter Jan 14, 2025

westonruter Jan 14, 2025

westonruter commented Jan 14, 2025

westonruter commented Jan 21, 2025

Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards #1802

Are you sure you want to change the base?

Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards #1802

Conversation

AhmarZaidi commented Jan 14, 2025

Summary

Relevant technical choices

codecov bot commented Jan 14, 2025 • edited Loading

Codecov Report

westonruter left a comment

Choose a reason for hiding this comment

westonruter Jan 14, 2025

Choose a reason for hiding this comment

westonruter Jan 14, 2025

Choose a reason for hiding this comment

westonruter commented Jan 14, 2025

westonruter commented Jan 21, 2025

codecov bot commented Jan 14, 2025 •

edited

Loading