Release v1.0.5 · janreges/siteone-crawler

The first version that is already used by the Electron-based desktop application https://github.com/janreges/siteone-crawler-gui

Changes

option: replace placeholders like a '%domain' also in validateValue() method because there is also check if path is writable with attempt to mkdir 329143f
swoole in cygwin: improved getBaseDir() to work better even with the version of Swoole that does not have SCRIPT_DIR 94cc5af
html processor: it must also process the page with the redirect, because is needed to replace the URL in the meta redirect tag 9ce0eee
sitemap: use formatted output path (primary for better output in Cygwin environment with needed C:/foo <-> /cygwin/c/foo conversion) 6297a7f
file exporter: use formatted output path (primary for better output in Cygwin environment with needed C:/foo <-> /cygwin/c/foo conversion) 426cfb2
options: in the case of dir/file validation, we want to work with absolute paths for more precise error messages 6df228b
crawler.php: improved baseDir detection - we want to work with absolute path in all scenarios 9d1b2ce
utils: improved getAbsolutePath() for cygwin and added getOutputFormattedPath() with reverse logic for cygwin (C:/foo/bar <-> /cygdrive/c/foo/bar) 161cfc5
offline export: renamed --offline-export-directory to --offline-export-dir for consistency with --http-cache-dir or --result-storage-dir 26ef45d

Changes in 1.0.4 (skipped release)

dom parsing: handling warnings in case of impossibility to parse some DOM elements correctly, fixes #3 #3
version: 1.0.4.20231201 + changelog 8e15781
options: ignore empty values in the case of directives with the possibility of repeated definition 5e30c2f
http-cache: now the http cache is turned off using the 'off' value (it's more understandable) 9508409
core options: added --console-width to enforce the definition of the console width and disable automatic detection via 'tput cols' on macOS/Linux or 'mode con' on Windows (used by Electron GUI) 8cf44b0
gui support: added base-dir detection for Windows where the GUI crawler runs in Cygwin 5ce893a
renaming: renamed 'siteone-website-crawler' to 'siteone-crawler' and 'SiteOne Website Crawler' to 'SiteOne Crawler' 64ddde4
utils: fixed color-support detection 62dbac0
core options: added --force-color options to bypass tty detection (used by Electron GUI) 607b4ad
best practice analysis: in the case of checking an image (e.g. for the existence of WebP/AVIF), we also want to check external images, because very often websites have images linked from external domains or services for image modification or optimization 6100187
html report: set scaleDown as default object-fit for image gallery 91cd300
offline exporter: added short -oed as alias to --offline-export-directory 22368d9
image gallery: list of all images on the website (except those from the srcset, where there would be duplicates only in other sizes or formats), including SVG with rich filtering options (through image format, size and source tag/attribute) and the option of choosing small/medium/view and scale-down/contains/cover for object-fit css property 43de0af
core options: added a shortened version of the command name consisting of only one hyphen and the first letters of the words of the full command (e.g. --memory-limit has short version -ml), added getInitialScheme() eb9a3cc
visited url: added 'sourceAttr' with information about where the given URL was found and useful helper methods 6de4e39
found urls: in the case of the occurrence of one URL in several places/attributes, we consider the first one to be the main one (typically the same URL in src and then also in srcset) 660bb2b
url parsing: added more recognition of which attributes the given URL address was parsed from (we need to recognize src and srcset for ImageGallery in particular) 802c3c6
supertable and urls: in removing the redundant hostname for a more compact URL output, we also take into account the scheme http:// or https:// of initial URL (otherwise somewhere it lookedlike duplicate) + prevention of ansi-color definitions for bash in the HTML output 915469e
title/description/keywords parsing: added html entities decoding because some website uses decoded entities with í – etc 920523d
crawler: added 'sourceAttr' to the swoole table queue and already visited URLs (we will use it in the Image Gallery for filtering, so as not to display unnecessarily and a lot of duplicate images only in other resolutions from the srcsets) 0345abc
url parameter: it is already possible not to enter the scheme and https:// or http:// will be added automatically (http:// for e.g. for localhost) 85e14e9
disabled images: in the case of a request to remove the images, replace their body with a 1x1px transparent gif and place a semi-transparent hatch with the crawler logo and opacity as a background c1418c3
url regex filtering: added option , which will allow you to limit the list of crawled pages according to the declared regexps, but at the same time it will allow you to crawl and download assets (js, css, images, fonts, documents, etc.) from any URL (but with respect to allowed domains) 21e67e5
img srcset parsing: because a valid URL can also contain a comma (and various dynamic parametric img generators use them) and in the srcset a comma+whitespace should be used to separate multiple values, this is also reflected in the srcset parsing 0db578b
websocket server: added option to set --websocket-server, which starts a parallel process with the websocket server, through which the crawler sends various information about the progress of crawling (this will also be used by Electron UI applications) 649132f
http client: handle scenario when content loaded from cache is not valid (is_bool) 1ddd099
HTML report: updated logo with final look 2a3bb42
mailer: shortening and simplifying email content e797107
robots.txt: added info about loaded robots.txt to summary (limited to 10 domains for case of huge multi domain crawling) 00f9365
redirects analyzer: handled edge case with empty url e9be1e3
text output: added fancy banner with crawler logo (thanks to great SiteOne designers!) and smooth effect e011c35
content processors: added applyContentChangesBeforeUrlParsing() and better NextJS chunks handling e5c404f
url searches: added ignoring data:, mailto:, tel:, file:// and other non-requestable resources also to FoundUrls 5349be2
crawler: added declare(strict_types=1) and banner 27134d2
heading structure analysis: highlighting and calculating errors for duplicate <h1> + added help cursor with a hint f5c7db6
core options: added --help and --version, colorized help 6f1ada1
./crawler binary - send output of cd - to /dev/null and hide unwanted printed script path 16fe79d
README: updated paths in the documentation - it is now possible to use the ERROR: Option --url () must be valid URL 86abd99
options: --workers default for Cygwin runtime is now 1 (instead of 3), because Cygwin runtime is highly unstable when workers > 1 f484960

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.5

Changes

Changes in 1.0.4 (skipped release)