Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect title detection on some pages (SVG title instead of HTML title) #1934

Open
francoishumanoid opened this issue Jan 24, 2023 · 1 comment
Labels
bug it's broken!

Comments

@francoishumanoid
Copy link

When I share an url on Shaarli, it uses the <title> inside svg instead of HTML Title.

https://www.numerama.com/sciences/1247146-aller-sur-mars-en-45-jours-cest-la-promesse-de-ce-propulseur-nucleaire.html

@nodiscc
Copy link
Member

nodiscc commented Jan 24, 2023

On my instance (Shaarli 0.12.1 installed from release zip),

  • Shaaring this page from the bookmarklet or Firefox extension: the title is correctly detected Aller sur Mars en 45 jours, c'est la promesse de ce propulseur nucléaire - Numerama (copied from the tab title)
  • Shaaring this page from the + Shaare button -> paste URL -> Add link -> the title is populated as Numerama.

I think the title extraction regex

function html_extract_title($html)
{
if (preg_match('!<title.*?>(.*?)</title>!is', $html, $matches)) {
return trim(str_replace("\n", '', $matches[1]));
}
return false;
}
indeed matches something it shouldn't (the SVG <title>):

$ curl -s https://www.numerama.com/sciences/1247146-aller-sur-mars-en-45-jours-cest-la-promesse-de-ce-propulseur-nucleaire.html|grep '<title>'
	<title>Aller sur Mars en 45 jours, c'est la promesse de ce propulseur nucléaire - Numerama</title>
        <symbol id="logo-full" viewbox="0 0 509 49" xmlns="http://www.w3.org/2000/svg"><title>Numerama</title><path d="M240.55 48.462c6.658 0 13.499-1.651...
        <title>Numerama, le média de référence sur la société numérique et l'innovation technologique</title>

return trim(str_replace("\n", '', $matches[1])); should return the first match for <title>, but in this case it returns the second.

@nodiscc nodiscc added the bug it's broken! label Jan 24, 2023
@nodiscc nodiscc changed the title Shaarli is using svg title instead of html title Incorrect title detection on some pages (SVG title instead of HTML title) Jan 24, 2023
@nodiscc nodiscc added this to the backlog to the future milestone Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug it's broken!
Projects
None yet
Development

No branches or pull requests

2 participants