Skip to content

v1.0.5

Compare
Choose a tag to compare
@janreges janreges released this 03 Dec 23:20
· 99 commits to main since this release

The first version that is already used by the Electron-based desktop application https://github.com/janreges/siteone-crawler-gui

Changes

  • option: replace placeholders like a '%domain' also in validateValue() method because there is also check if path is writable with attempt to mkdir 329143f
  • swoole in cygwin: improved getBaseDir() to work better even with the version of Swoole that does not have SCRIPT_DIR 94cc5af
  • html processor: it must also process the page with the redirect, because is needed to replace the URL in the meta redirect tag 9ce0eee
  • sitemap: use formatted output path (primary for better output in Cygwin environment with needed C:/foo <-> /cygwin/c/foo conversion) 6297a7f
  • file exporter: use formatted output path (primary for better output in Cygwin environment with needed C:/foo <-> /cygwin/c/foo conversion) 426cfb2
  • options: in the case of dir/file validation, we want to work with absolute paths for more precise error messages 6df228b
  • crawler.php: improved baseDir detection - we want to work with absolute path in all scenarios 9d1b2ce
  • utils: improved getAbsolutePath() for cygwin and added getOutputFormattedPath() with reverse logic for cygwin (C:/foo/bar <-> /cygdrive/c/foo/bar) 161cfc5
  • offline export: renamed --offline-export-directory to --offline-export-dir for consistency with --http-cache-dir or --result-storage-dir 26ef45d

Changes in 1.0.4 (skipped release)

  • dom parsing: handling warnings in case of impossibility to parse some DOM elements correctly, fixes #3 #3
  • version: 1.0.4.20231201 + changelog 8e15781
  • options: ignore empty values in the case of directives with the possibility of repeated definition 5e30c2f
  • http-cache: now the http cache is turned off using the 'off' value (it's more understandable) 9508409
  • core options: added --console-width to enforce the definition of the console width and disable automatic detection via 'tput cols' on macOS/Linux or 'mode con' on Windows (used by Electron GUI) 8cf44b0
  • gui support: added base-dir detection for Windows where the GUI crawler runs in Cygwin 5ce893a
  • renaming: renamed 'siteone-website-crawler' to 'siteone-crawler' and 'SiteOne Website Crawler' to 'SiteOne Crawler' 64ddde4
  • utils: fixed color-support detection 62dbac0
  • core options: added --force-color options to bypass tty detection (used by Electron GUI) 607b4ad
  • best practice analysis: in the case of checking an image (e.g. for the existence of WebP/AVIF), we also want to check external images, because very often websites have images linked from external domains or services for image modification or optimization 6100187
  • html report: set scaleDown as default object-fit for image gallery 91cd300
  • offline exporter: added short -oed as alias to --offline-export-directory 22368d9
  • image gallery: list of all images on the website (except those from the srcset, where there would be duplicates only in other sizes or formats), including SVG with rich filtering options (through image format, size and source tag/attribute) and the option of choosing small/medium/view and scale-down/contains/cover for object-fit css property 43de0af
  • core options: added a shortened version of the command name consisting of only one hyphen and the first letters of the words of the full command (e.g. --memory-limit has short version -ml), added getInitialScheme() eb9a3cc
  • visited url: added 'sourceAttr' with information about where the given URL was found and useful helper methods 6de4e39
  • found urls: in the case of the occurrence of one URL in several places/attributes, we consider the first one to be the main one (typically the same URL in src and then also in srcset) 660bb2b
  • url parsing: added more recognition of which attributes the given URL address was parsed from (we need to recognize src and srcset for ImageGallery in particular) 802c3c6
  • supertable and urls: in removing the redundant hostname for a more compact URL output, we also take into account the scheme http:// or https:// of initial URL (otherwise somewhere it lookedlike duplicate) + prevention of ansi-color definitions for bash in the HTML output 915469e
  • title/description/keywords parsing: added html entities decoding because some website uses decoded entities with í – etc 920523d
  • crawler: added 'sourceAttr' to the swoole table queue and already visited URLs (we will use it in the Image Gallery for filtering, so as not to display unnecessarily and a lot of duplicate images only in other resolutions from the srcsets) 0345abc
  • url parameter: it is already possible not to enter the scheme and https:// or http:// will be added automatically (http:// for e.g. for localhost) 85e14e9
  • disabled images: in the case of a request to remove the images, replace their body with a 1x1px transparent gif and place a semi-transparent hatch with the crawler logo and opacity as a background c1418c3
  • url regex filtering: added option , which will allow you to limit the list of crawled pages according to the declared regexps, but at the same time it will allow you to crawl and download assets (js, css, images, fonts, documents, etc.) from any URL (but with respect to allowed domains) 21e67e5
  • img srcset parsing: because a valid URL can also contain a comma (and various dynamic parametric img generators use them) and in the srcset a comma+whitespace should be used to separate multiple values, this is also reflected in the srcset parsing 0db578b
  • websocket server: added option to set --websocket-server, which starts a parallel process with the websocket server, through which the crawler sends various information about the progress of crawling (this will also be used by Electron UI applications) 649132f
  • http client: handle scenario when content loaded from cache is not valid (is_bool) 1ddd099
  • HTML report: updated logo with final look 2a3bb42
  • mailer: shortening and simplifying email content e797107
  • robots.txt: added info about loaded robots.txt to summary (limited to 10 domains for case of huge multi domain crawling) 00f9365
  • redirects analyzer: handled edge case with empty url e9be1e3
  • text output: added fancy banner with crawler logo (thanks to great SiteOne designers!) and smooth effect e011c35
  • content processors: added applyContentChangesBeforeUrlParsing() and better NextJS chunks handling e5c404f
  • url searches: added ignoring data:, mailto:, tel:, file:// and other non-requestable resources also to FoundUrls 5349be2
  • crawler: added declare(strict_types=1) and banner 27134d2
  • heading structure analysis: highlighting and calculating errors for duplicate <h1> + added help cursor with a hint f5c7db6
  • core options: added --help and --version, colorized help 6f1ada1
  • ./crawler binary - send output of cd - to /dev/null and hide unwanted printed script path 16fe79d
  • README: updated paths in the documentation - it is now possible to use the ERROR: Option --url () must be valid URL 86abd99
  • options: --workers default for Cygwin runtime is now 1 (instead of 3), because Cygwin runtime is highly unstable when workers > 1 f484960