Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OLD IDEA / no longer active: New modular plugin design + 3 new plugins: browsertrix-crawler, ReplayWeb.page, gallery-dl #1327

Closed
wants to merge 62 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
16adff4
add ncat to docker container to use for ipc tunnel
pirate Jan 17, 2024
1d4ec6f
wip new plugins system with browsertrix
pirate Jan 17, 2024
91c3458
wip attempt to tweak uwsgi to try to serve archive media files with b…
pirate Jan 17, 2024
216e0b7
add ipc listener server script
pirate Jan 17, 2024
820c152
add note about issue 1191 and shell requoting or special chars
pirate Jan 19, 2024
77075b3
add ncat to docker container to use for ipc tunnel
pirate Jan 17, 2024
5cf94a3
wip new plugins system with browsertrix
pirate Jan 17, 2024
64f6816
wip attempt to tweak uwsgi to try to serve archive media files with b…
pirate Jan 17, 2024
a61c9cf
add ipc listener server script
pirate Jan 17, 2024
aeaefe8
persist snapshot index header collapse state
pirate Jan 19, 2024
624fd2d
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
861d44d
fix sorting by Size or by Files to sort by number of archive results
pirate Jan 19, 2024
17fdf13
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
99bb02c
add fallback to check wget output dir with port stripped
pirate Jan 19, 2024
19e9c1c
include more output file locations when considering whether snapshot.…
pirate Jan 19, 2024
a6b241f
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
3ebac0f
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
475217a
minor formatting cleanup
pirate Jan 19, 2024
0f5cc78
add django solo models for plugin config
pirate Jan 19, 2024
3421833
replaywebpage js files
pirate Jan 19, 2024
b652024
add gallerydl plugin
pirate Jan 19, 2024
ea2c5a2
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
3234a36
wip refactoring
pirate Jan 19, 2024
4c8db99
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
7aa0dd1
Merge branch 'dev' into plugins-browsertrix
pirate Jan 20, 2024
d2ec6d7
Merge branch 'dev' into plugins-browsertrix
pirate Jan 23, 2024
f1b5df3
Merge branch 'dev' into plugins-browsertrix
pirate Jan 23, 2024
2ba72f1
Merge branch 'dev' into plugins-browsertrix
pirate Jan 23, 2024
0c878eb
Merge branch 'dev' into plugins-browsertrix
pirate Jan 24, 2024
d0e3c95
add defaults and system plugins
pirate Jan 24, 2024
8e41aec
fix plugin loading and admin config display
pirate Jan 24, 2024
4aee3b5
Merge branch 'dev' into plugins-browsertrix
pirate Jan 24, 2024
54ae6a0
Merge branch 'dev' into plugins-browsertrix
pirate Jan 26, 2024
d96d986
Merge branch 'dev' into plugins-browsertrix
pirate Jan 28, 2024
beb83f2
mypy fixes
pirate Jan 25, 2024
eaa4a9c
fix django core auth
pirate Jan 25, 2024
9861a4f
more mypy fixes
pirate Jan 25, 2024
17213ea
chown /var/spool/cron/crontabs in docker entrypoint
pirate Jan 28, 2024
308b493
add pudb and type hints
pirate Jan 28, 2024
ef667a4
Merge branch 'dev' into plugins-browsertrix
pirate Jan 28, 2024
63b2c9e
Merge branch 'dev' into plugins-browsertrix
pirate Jan 30, 2024
c4888ac
Merge branch 'dev' into plugins-browsertrix
pirate Jan 30, 2024
c6faa9a
add extra information to headers extractor output
pirate Feb 7, 2024
342ecd2
update gitignore to ignore all data dirs
pirate Feb 7, 2024
6faf7aa
ignore vscode dir
pirate Feb 7, 2024
b56bfe5
add CUSTOM_TEMPLATES=/data/templates default config in Docker
pirate Feb 8, 2024
97b1859
add TODO to support archive.org-style urls
pirate Feb 8, 2024
777694e
add type hints to plugin config models
pirate Feb 8, 2024
95e866c
fix dockerfile syntax trailing slash
pirate Feb 8, 2024
fbb3c84
Merge branch 'dev' into plugins-browsertrix
pirate Feb 13, 2024
edabc47
Merge branch 'dev' into plugins-browsertrix
pirate Feb 13, 2024
93ed633
Merge branch 'dev' into plugins-browsertrix
pirate Feb 18, 2024
2b99dcd
Update setup.sh
pirate Feb 18, 2024
50d52ea
fix requirements.txt so docks build doesnt crash on missing ldap c he…
pirate Feb 19, 2024
ddc639e
bump required python version to 3.10 to match brew and apt
pirate Feb 19, 2024
15d1865
Merge branch 'dev' into plugins-browsertrix
pirate Feb 21, 2024
1ea7ac1
Merge branch 'dev' into plugins-browsertrix
pirate Feb 22, 2024
11b067a
Merge branch 'dev' into plugins-browsertrix
pirate Mar 15, 2024
c22df0b
Merge branch 'dev' into plugins-browsertrix
pirate Mar 18, 2024
b5311d2
Merge branch 'dev' into plugins-browsertrix
pirate Mar 26, 2024
33e8273
Merge branch 'dev' into plugins-browsertrix
pirate Apr 24, 2024
e594065
Merge branch 'dev' into plugins-browsertrix
pirate Apr 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,9 @@ dist/

# Data folders
data/
data1/
data2/
data3/
data*/
output/

# vim
*.sw?
.vscode/
6 changes: 3 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@
# Read more about [developing Archivebox](https://github.com/ArchiveBox/ArchiveBox#archivebox-development).


# Use Debian 12 w/ faster package updates: https://packages.debian.org/bookworm-backports/
FROM python:3.11-slim-bookworm
# Uses Debian 12 w/ faster-updating apt-lists added below: https://packages.debian.org/bookworm-backports/

LABEL name="archivebox" \
maintainer="Nick Sweeting <[email protected]>" \
Expand Down Expand Up @@ -127,9 +127,9 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked,id=apt-$TARGETARCH$T
# 1. packaging dependencies
apt-transport-https ca-certificates apt-utils gnupg2 curl wget \
# 2. docker and init system dependencies
zlib1g-dev dumb-init gosu cron unzip grep \
zlib1g-dev dumb-init gosu cron unzip grep ncat \
# 3. frivolous CLI helpers to make debugging failed archiving easier
# nano iputils-ping dnsutils htop procps jq yq
# nano iputils-ping dnsutils htop procps jq yq \
&& rm -rf /var/lib/apt/lists/*

######### Language Environments ####################################
Expand Down
22 changes: 16 additions & 6 deletions archivebox/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional, Type, Tuple, Dict, Union, List
from subprocess import run, PIPE, DEVNULL
from subprocess import run, PIPE, STDOUT, DEVNULL
from configparser import ConfigParser
from collections import defaultdict
import importlib.metadata
Expand Down Expand Up @@ -854,7 +854,7 @@ def hint(text: Union[Tuple[str, ...], List[str], str], prefix=' ', config: Op


# Dependency Metadata Helpers
def bin_version(binary: Optional[str]) -> Optional[str]:
def bin_version(binary: Optional[str], cmd=None) -> Optional[str]:
"""check the presence and return valid version line of a specified binary"""

abspath = bin_path(binary)
Expand All @@ -863,11 +863,21 @@ def bin_version(binary: Optional[str]) -> Optional[str]:

try:
bin_env = os.environ | {'LANG': 'C'}
version_str = run([abspath, "--version"], stdout=PIPE, env=bin_env).stdout.strip().decode()
is_cmd_str = cmd and isinstance(cmd, str)
version_str = run(cmd or [abspath, "--version"], shell=is_cmd_str, stdout=PIPE, stderr=STDOUT, env=bin_env).stdout.strip().decode()
if not version_str:
version_str = run([abspath, "--version"], stdout=PIPE).stdout.strip().decode()
# take first 3 columns of first line of version info
return ' '.join(version_str.split('\n')[0].strip().split()[:3])
version_str = run(cmd or [abspath, "--version"], shell=is_cmd_str, stdout=PIPE, stderr=STDOUT).stdout.strip().decode()

version_ptn = re.compile(r"\d+?\.\d+?\.?\d*", re.MULTILINE)
try:
version_nums = version_ptn.findall(version_str.split('\n')[0])[0]
if version_nums:
return version_nums
else:
raise IndexError
except IndexError:
# take first 3 columns of first line of version info
return ' '.join(version_str.split('\n')[0].strip().split()[:3])
except OSError:
pass
# stderr(f'[X] Unable to find working version of dependency: {binary}', color='red')
Expand Down
81 changes: 68 additions & 13 deletions archivebox/config_stubs.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,19 @@
SimpleConfigValueGetter = Callable[[], SimpleConfigValue]
ConfigValue = Union[SimpleConfigValue, SimpleConfigValueDict, SimpleConfigValueGetter]

SHArgs = List[str] # shell command args list e.g. ["--something=1", "--someotherarg"]


class BaseConfig(TypedDict):
pass

class ConfigDict(BaseConfig, total=False):
"""
# Regenerate by pasting this quine into `archivebox shell` 🥚
from archivebox.config import ConfigDict, CONFIG_DEFAULTS
from archivebox.config import ConfigDict, CONFIG_SCHEMA
print('class ConfigDict(BaseConfig, total=False):')
print(' ' + '"'*3 + ConfigDict.__doc__ + '"'*3)
for section, configs in CONFIG_DEFAULTS.items():
for section, configs in CONFIG_SCHEMA.items():
for key, attrs in configs.items():
Type, default = attrs['type'], attrs['default']
if default is None:
Expand All @@ -32,24 +34,51 @@ class ConfigDict(BaseConfig, total=False):
USE_COLOR: bool
SHOW_PROGRESS: bool
IN_DOCKER: bool
IN_QEMU: bool
PUID: int
PGID: int

PACKAGE_DIR: Path
OUTPUT_DIR: Path
CONFIG_FILE: Path
OUTPUT_DIR: Optional[str]
CONFIG_FILE: Optional[str]
ONLY_NEW: bool
TIMEOUT: int
MEDIA_TIMEOUT: int
OUTPUT_PERMISSIONS: str
RESTRICT_FILE_NAMES: str
URL_DENYLIST: str
URL_ALLOWLIST: Optional[str]
ADMIN_USERNAME: Optional[str]
ADMIN_PASSWORD: Optional[str]
ENFORCE_ATOMIC_WRITES: bool
TAG_SEPARATOR_PATTERN: str

SECRET_KEY: Optional[str]
BIND_ADDR: str
ALLOWED_HOSTS: str
DEBUG: bool
PUBLIC_INDEX: bool
PUBLIC_SNAPSHOTS: bool
PUBLIC_ADD_VIEW: bool
FOOTER_INFO: str
SNAPSHOTS_PER_PAGE: int
CUSTOM_TEMPLATES_DIR: Optional[str]
TIME_ZONE: str
TIMEZONE: str
REVERSE_PROXY_USER_HEADER: str
REVERSE_PROXY_WHITELIST: str
LOGOUT_REDIRECT_URL: str
PREVIEW_ORIGINALS: bool
LDAP: bool
LDAP_SERVER_URI: Optional[str]
LDAP_BIND_DN: Optional[str]
LDAP_BIND_PASSWORD: Optional[str]
LDAP_USER_BASE: Optional[str]
LDAP_USER_FILTER: Optional[str]
LDAP_USERNAME_ATTR: Optional[str]
LDAP_FIRSTNAME_ATTR: Optional[str]
LDAP_LASTNAME_ATTR: Optional[str]
LDAP_EMAIL_ATTR: Optional[str]
LDAP_CREATE_SUPERUSER: bool

SAVE_TITLE: bool
SAVE_FAVICON: bool
Expand All @@ -58,25 +87,50 @@ class ConfigDict(BaseConfig, total=False):
SAVE_SINGLEFILE: bool
SAVE_READABILITY: bool
SAVE_MERCURY: bool
SAVE_HTMLTOTEXT: bool
SAVE_PDF: bool
SAVE_SCREENSHOT: bool
SAVE_DOM: bool
SAVE_HEADERS: bool
SAVE_WARC: bool
SAVE_GIT: bool
SAVE_MEDIA: bool
SAVE_ARCHIVE_DOT_ORG: bool
SAVE_ALLOWLIST: dict
SAVE_DENYLIST: dict

RESOLUTION: str
GIT_DOMAINS: str
CHECK_SSL_VALIDITY: bool
MEDIA_MAX_SIZE: str
CURL_USER_AGENT: str
WGET_USER_AGENT: str
CHROME_USER_AGENT: str
COOKIES_FILE: Union[str, Path, None]
CHROME_USER_DATA_DIR: Union[str, Path, None]
COOKIES_FILE: Optional[str]
CHROME_USER_DATA_DIR: Optional[str]
CHROME_TIMEOUT: int
CHROME_HEADLESS: bool
CHROME_SANDBOX: bool
YOUTUBEDL_ARGS: list
WGET_ARGS: list
CURL_ARGS: list
GIT_ARGS: list
SINGLEFILE_ARGS: Optional[list]
FAVICON_PROVIDER: str

USE_INDEXING_BACKEND: bool
USE_SEARCHING_BACKEND: bool
SEARCH_BACKEND_ENGINE: str
SEARCH_BACKEND_HOST_NAME: str
SEARCH_BACKEND_PORT: int
SEARCH_BACKEND_PASSWORD: str
SEARCH_PROCESS_HTML: bool
SONIC_COLLECTION: str
SONIC_BUCKET: str
SEARCH_BACKEND_TIMEOUT: int
FTS_SEPARATE_DATABASE: bool
FTS_TOKENIZERS: str
FTS_SQLITE_MAX_LENGTH: int

USE_CURL: bool
USE_WGET: bool
Expand All @@ -85,21 +139,22 @@ class ConfigDict(BaseConfig, total=False):
USE_MERCURY: bool
USE_GIT: bool
USE_CHROME: bool
USE_NODE: bool
USE_YOUTUBEDL: bool
USE_RIPGREP: bool
CURL_BINARY: str
GIT_BINARY: str
WGET_BINARY: str
SINGLEFILE_BINARY: str
READABILITY_BINARY: str
MERCURY_BINARY: str
YOUTUBEDL_BINARY: str
NODE_BINARY: str
RIPGREP_BINARY: str
CHROME_BINARY: Optional[str]

YOUTUBEDL_ARGS: List[str]
WGET_ARGS: List[str]
CURL_ARGS: List[str]
GIT_ARGS: List[str]
TAG_SEPARATOR_PATTERN: str
POCKET_CONSUMER_KEY: Optional[str]
POCKET_ACCESS_TOKENS: dict
READWISE_READER_TOKENS: dict


ConfigDefaultValueGetter = Callable[[ConfigDict], ConfigValue]
Expand Down
1 change: 1 addition & 0 deletions archivebox/core/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
__package__ = 'archivebox.core'

# default_app_config = 'core.apps.CoreAppConfig'
2 changes: 2 additions & 0 deletions archivebox/core/admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from django.utils.safestring import mark_safe
from django.shortcuts import render, redirect
from django.contrib.auth import get_user_model
from django.contrib.auth.models import Group, Permission
from django import forms

from ..util import htmldecode, urldecode, ansi_to_html
Expand Down Expand Up @@ -159,6 +160,7 @@ class SnapshotAdmin(SearchResultsAdminMixin, admin.ModelAdmin):

action_form = SnapshotActionForm


def changelist_view(self, request, extra_context=None):
extra_context = extra_context or {}
return super().changelist_view(request, extra_context | GLOBAL_CONTEXT)
Expand Down
9 changes: 8 additions & 1 deletion archivebox/core/apps.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
from django.apps import AppConfig


class CoreConfig(AppConfig):
class CoreAppConfig(AppConfig):
name = 'core'

# label = 'Archive Data'
verbose_name = "Archive Data"

# WIP: broken by Django 3.1.2 -> 4.0 migration
# default_auto_field = 'django.db.models.UUIDField'


def ready(self):
from .auth import register_signals

Expand Down
6 changes: 4 additions & 2 deletions archivebox/core/auth.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import os
from django.conf import settings
__package__ = 'archivebox.core'



from ..config import (
LDAP
)
Expand Down
10 changes: 9 additions & 1 deletion archivebox/core/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ class Tag(models.Model):

class Meta:
verbose_name = "Tag"
verbose_name_plural = "Tags"
verbose_name_plural = "🏷️ Tags"

def __str__(self):
return self.name
Expand Down Expand Up @@ -98,6 +98,10 @@ class Snapshot(models.Model):

keys = ('url', 'timestamp', 'title', 'tags', 'updated')

class Meta:
verbose_name = "Snapshot"
verbose_name_plural = "⭐️ Archived Webpages (Snapshots)"

def __repr__(self) -> str:
title = self.title or '-'
return f'[{self.timestamp}] {self.url[:64]} ({title[:64]})'
Expand Down Expand Up @@ -282,5 +286,9 @@ class ArchiveResult(models.Model):

objects = ArchiveResultManager()

class Meta:
verbose_name = "ArchiveResult"
verbose_name_plural = "📑 Logs (ArchiveResults)"

def __str__(self):
return self.extractor
Loading
Loading