Skip to content

Releases: autogram-is/spidergram

v0.10.0 — Ham

10 May 20:04
Compare
Choose a tag to compare

This release is dedicated to Peter Porker of Earth-8311, an innocent pig raised by animal scientist May Porker. After a freak accident with the world's first atomic powered hairdryer, Peter was bitten by the scientist and transformed into a crime-fighting superhero pig.

New Additions

  • Custom queries and multi-query reports can be defined in the Spidergram config files; Spidergram now ships with a handful of simple queries and an overview report as part of its core configuration.
  • Spidergram can run an Axe Accessibility Report on every page as it crawls a site; this behavior can be turned on and off via the spider.auditAccessiblity config property.
  • Spidergram can now save cookies, performance data, and remote API requests made during page load using the config.spider.saveCookies, .savePerformance, and .saveXhr config properties.
  • Spidergram can identify and catalog design patterns during the post-crawl page analysis process; pattern definitions can also include rules for extracting pattern properties like a card's title and CTA link.
  • Resources with attached downloads can be processed using file parsing plugins; Spidergram 0.10.0 comes with support for PDF and .docx content and metadata, image EXIF metadata, and audio/video metadata in a variety of formats.
  • The config.spider.seed setting lets you set one or more URLs as the default starting points for crawling.
  • For large crawls, an experimental config.offloadBodyHtml settings flag has been added to Spidergram's global configuration. When it's set to 'db', all body HTML will be stored in a dedicated key-value collection, rather than the resources collection. On sites with many large pages (50k+ pages of 500k+ html or more) this can significantly improve the speed of filtering, queries and reporting.

Changes

  • Spidergram's CLI commands have been overhauled; vestigial commands from the 0.5.0 era have been removed and replaced. Of particular interest:
    • spidergram status summarizes the current config and DB state
    • spidergram init generates a fresh configuration file in the current directory
    • spidergram ping tests a remote URL using the current analysis settings
    • spidergram query displays and saves filtered snapshots of the saved crawl graph
    • spidergram report outputs a collection of query results as a combined workbook or JSON file
    • spidergram go crawls one or more URLs, analyzes the crawled files, and generates a report in a single step.
    • spidergram url test tests a URL against the current normalizer and filter settings.
    • spidergram url tree replaces the old urls command for building site hierarchies.
  • CLI consistency is significantly improved. For example: analyze, query, report, and url tree all support the same --filter syntax for controlling which records are loaded from the database.

Fixes and under-the-hood improvements

  • URL matching and filtering has been smoothed out, and a host of tests have been added to ensure things stay solid. Previously, filter strings were treated as globs matched against the entire URL. Now, { property: 'hostname', glob: '*.foo.com' } objects can be used to explicitly specify glob orr regex matches against individual URL components.

v0.9.0

09 Mar 04:11
Compare
Choose a tag to compare

Spidergram 0.9.0: Gwen

This release is dedicated to teen crime-fighter Gwen Stacy of Earth-65. She juggles high school, her band, and wisecracking web-slinging until her boyfriend Peter Parker becomes infatuated with Spider-Woman. Unable to reveal her secret identity, Spider-Woman is blamed for Peter's tragic lizard-themed death on prom night… and Gwen goes on the run.

Like Gwen Stacy's social calendar, this version of Spidergram has a lot going on. Hold onto your seats!

Major Changes

  • Vertice and Edge have been renamed to Entity and Relationship to avoid confusion with ArangoDB graph traversal and storage concepts. With the arrival of the Dataset and KeyValueStore classes (see below), we also needed the clarity when dealing with full-fledged Entities vs random datatypes.
  • Improved report/query helpers. The GraphWorker and VerticeQuery — both of which relied on raw snippets of AQL for filtering — have been replaced by a new query-builder system. A unified Query class can take a query definition in JSON format, or construct one piecemeal using fluent methods like filterBy() and sort(). A related EntityQuery class returns pre-instantiated Entity instances to eliminate boilerplate code, and a WorkerQuery class executes a worker function against each query result while emitting progress events for easy monitoring.
  • HtmlTools.getPageContent() and .getPageData() are both async now, allowing them to use some of the aync parsing and extraction tools in our toolbox. If your extracted data and content suddenly appear empty, make sure you're awaiting the results of these two calls in your handlers and scripts.
  • Project class replaced by Spidergram class, as part of the configuration management overhaul mentioned below. In most code, changing const project = await Project.config(); to const spidergram = await Spidergram.load(); and const db = await project.graph(); to const db = spidergram.arango; should be sufficient.

New Additions

  • Spidergram configuration can now live in .json, .js, or .ts files — and can control a much wider variety of internal behaviors. JS and TS configuration files can also pass in custom functions where appropriate, like the urlNormalizer and spider.requestHandlers settings. Specific environment variables, or .env files, can also be used to supply or override sensitive properties like API account credentials.
  • Ad-hoc data storage with the Dataset and KeyValueStore classes. Both offer static open methods that give quick access to default or named data stores -- creating new storage buckets if needed, or pulling up existing ones. Datasets offer pushItem(anyData) and getItems() methods, while KeyValueStores offer setItem(key, value) and getItem(key) methods. Behind the scenes, they create and manage dedicated ArangoDB collections that can be used in custom queries.
  • PDF and DocX parsing via FileTools.Pdf and FileTools.Document, based on the pdf-parse and mammoth projects. Those two formats are a first trial run for more generic parsing/handling of arbitrary formats; both can return filetype-specific metadata, and plaintext versions of file contents. For consistency, the Spreadsheet class has also been moved to FileTools.Spreadsheet.
  • Site technology detection via BrowserTools.Fingerprint. Fingerprinting is currently based on the Wappalyzer project and uses markup, script, and header patterns to identify the technologies and platforms used to build/host a page.
  • CLI improvements. The new spidergram report command can pull up filtered, aggregated, and formatted versions of Spidergram crawl data. It can output to tabular overviews on the command line, raw JSON files for use with data visualization tools, or ready-to-read Excel worksheets. The spidergram probe command allows the new Fingerprint tool to be run from the command line, as well.
  • Groundwork for cleaner CLI code. While it's not as obvious to end users, we're moving more and more code away from the Oclif-dependent SgCommand class and putting it into the shared SpiderCli helper class where it can be used in more contexts. In the next version, we'll be leveraging these improvements to make Spidergram's built-in CLI tools take better advantage of the new global configuration settings.

Fixes and minor improvements

  • Internal errors (aka, pre-request DNS problems or errors thrown during response processing) saave a wider range of error codes rather than mapping everything to -1. Any thrown errors are also saved in Resource.errors for later reference.
  • A subtle but long-standing issue with the downloadHandler (and by extension sitemapHandler and robotsTxtHandler choked on most downloads but "properly" persisted status records rather than erroring out. The improved error handling caught it, and downloads now work consistently.
  • A handful of request handlers were awaiting promises unecessarily, clogging up the Spider's request queue. Crawls with multiple concurrent browser sessions will see some performance improvements.

v0.8.3

28 Jan 01:27
Compare
Choose a tag to compare

The inevitable forehead-slapping release to fix the urls CLI command; the --children flag was being ignored due to a typo, which made several of the display presets a bit useless. Ta-da!

v0.8.0

27 Jan 17:10
Compare
Choose a tag to compare

Spidergram 0.8.0: Miles

Brooklyn teenager Miles Morales successfully juggled school, friends, and family — until his uncle Aaron's breakin at an Osborne Labs facility brought a genetically engineered spider to Miles' doorstep. Once bitten, Miles developed a range of varyingly spider-related powers and a much, much busier day planner.

Thrust into the role of super-hero by the death of Peter Parker, Miles is forced to balance the safety of his loved ones against his responsibilities as a crime-fighter, yoinked from his home reality in a sweeping Multiversal disaster, and cloned a bunch of times because Marvel. Miles is the protagonist of — and this is a fact, not opinion — the best Spider-Man movie ever produced.

What's Changed

Lots of quality of life improvements, including bug fixes for report generation and finessing of return/input types that made a number of helper functions difficult to use in conjunction with each other.

Streamlined structured data parsing

The HtmlTools collection of helpers now includes a one-shot getPageData() helper function that attempts to parse out all the standard HTML stuff: HEAD subtags like <base> and <title>, meta tags, <script> and <style> tags, JSON and LDJSON data present in any script tags, etc. Options to toggle various chunks of that data on and off can be passed into the function to avoid spamming yourself.

Streamlined content extraction

Similarly, HtmlTools.getPageContent() can now act as a quick wrapper for standard extraction of on-page content. Pass in a list of CSS selectors to help it find a page's "primary content" and it will spit out a scrubbed plaintext version, then calculate a readability score.

In the next release we'll be adding some general-purpose "find element X on the page, and if it's there, add its text to the content results" helper functions to HtmlTools; when that happens, it will be possible to include those instructions in the getPageContent() options, allowing custom analyzer code to use that function for most garden variety extraction.

Improved pattern/component extraction

Finding and saving sub-page patterns to their own pool of data for querying is a bit simpler, and also uses the same underlying code as getPageData() for extracting element attributes and content. The HtmlTools.findPattern() function accepts an array of pattern descriptions, each of which can include CSS selectors, instructions on what data to pull from the markup that's found, and (optionally) a callback to tweak the data before it's returned. The resulting Found Pattern Instances can be saved straight to the fragments collection, with links back to the pages they occurred on, for easy analysis.

URL hierarchy parsing and reporting

A new HierarchyTools utility pack has been added, with base classes to simplify hierarchy parsing, manipulation, and reporting code. The first example is UrlHierarchyBuilder, which accepts a giant array of strings (or any array of objects that have 'url' properties). It will use the URL paths to construct a complete tree, with configurable options for filling gaps in the tree, dealing with multiple subdomains under a single TLD, and so on.

The Resulting hierarchy has a host of helper functions and convenience properties for pulling out top-level root URLs, orphans disconnected from the rest of the tree, leaf and branch nodes, etc. Every item in the hierarchy also has a render() function that can output a nicely-formatted tree view of the item and its children; a number of rendering presets are included, from 'only show me directories, but summarize how many leaf URLs are in each directory' to 'show me everything, and format it as markdown'.

Finally, the spidergram urls CLI command has been updated to use the new hierarchy tools. It can quickly spit out a summary view of the URLs that have been crawled (or discovered); it can pull in a URL tree from a text, csv, or sitemap.xml file for formatting.

Sample output from the 'urls' command

Full Changelog: v0.7.1...v0.8.0

0.7.1

09 Jan 17:30
Compare
Choose a tag to compare

A handful of minor fixes and a license update. Spidergram is released under the GPL, but previous NPM releases had a dangling MIT license in package.json. If you keep your old MIT copy of Spidergram, it may eventually be quite valuable to collectors on the secondary market.

v0.7.0

05 Jan 06:32
80d727f
Compare
Choose a tag to compare

Spidergram 0.7.0: Cindy

While attending a public exhibition on safe handling of nuclear waste, teenager Cindy Moon was bitten by a particle-accelerator irradiated spider. After manifesting the usual laundry list of spider powers, Cindy was locked in an underground bunker to protect her from trans-dimensional vampire spider hunters. Peter Parker discovered the bunker thirteen years later, opened it, and was immediately attacked by Cindy. She subsequently created a cool costume, started fighting crime as the superhero Silk, and secured a job as social media manager for the Daily Bugle. Cindy is one of the few superheroes with a full-time digital content gig.

What's New in Spidergram 0.7.0

  • Design pattern/component extraction
  • Companion create-spidergram project with example project templates
  • More helpers for common analysis tasks
    • Schema.org page metadata
    • Reusable query-builders and data visualizations
    • Google Analytics queries
    • Sitemap and Robots.txt parsing
  • Internal improvements (linting and formatting rules, limited tests, fewer dependencies)
  • Still a mind-boggling lack of documentation

Full Changelog: 0.6.0...0.7.0

0.6.0 (Peter)

02 Dec 21:12
a010500
Compare
Choose a tag to compare
0.6.0 (Peter) Pre-release
Pre-release

Initial semi-public version of Spidergram. Please do not fold, spindle, or mutilate.