Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OLD IDEA / no longer active: New modular plugin design + 3 new plugins: browsertrix-crawler, ReplayWeb.page, gallery-dl #1327

Draft
wants to merge 62 commits into
base: dev
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
16adff4
add ncat to docker container to use for ipc tunnel
pirate Jan 17, 2024
1d4ec6f
wip new plugins system with browsertrix
pirate Jan 17, 2024
91c3458
wip attempt to tweak uwsgi to try to serve archive media files with b…
pirate Jan 17, 2024
216e0b7
add ipc listener server script
pirate Jan 17, 2024
820c152
add note about issue 1191 and shell requoting or special chars
pirate Jan 19, 2024
77075b3
add ncat to docker container to use for ipc tunnel
pirate Jan 17, 2024
5cf94a3
wip new plugins system with browsertrix
pirate Jan 17, 2024
64f6816
wip attempt to tweak uwsgi to try to serve archive media files with b…
pirate Jan 17, 2024
a61c9cf
add ipc listener server script
pirate Jan 17, 2024
aeaefe8
persist snapshot index header collapse state
pirate Jan 19, 2024
624fd2d
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
861d44d
fix sorting by Size or by Files to sort by number of archive results
pirate Jan 19, 2024
17fdf13
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
99bb02c
add fallback to check wget output dir with port stripped
pirate Jan 19, 2024
19e9c1c
include more output file locations when considering whether snapshot.…
pirate Jan 19, 2024
a6b241f
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
3ebac0f
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
475217a
minor formatting cleanup
pirate Jan 19, 2024
0f5cc78
add django solo models for plugin config
pirate Jan 19, 2024
3421833
replaywebpage js files
pirate Jan 19, 2024
b652024
add gallerydl plugin
pirate Jan 19, 2024
ea2c5a2
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
3234a36
wip refactoring
pirate Jan 19, 2024
4c8db99
Merge branch 'dev' into plugins-browsertrix
pirate Jan 19, 2024
7aa0dd1
Merge branch 'dev' into plugins-browsertrix
pirate Jan 20, 2024
d2ec6d7
Merge branch 'dev' into plugins-browsertrix
pirate Jan 23, 2024
f1b5df3
Merge branch 'dev' into plugins-browsertrix
pirate Jan 23, 2024
2ba72f1
Merge branch 'dev' into plugins-browsertrix
pirate Jan 23, 2024
0c878eb
Merge branch 'dev' into plugins-browsertrix
pirate Jan 24, 2024
d0e3c95
add defaults and system plugins
pirate Jan 24, 2024
8e41aec
fix plugin loading and admin config display
pirate Jan 24, 2024
4aee3b5
Merge branch 'dev' into plugins-browsertrix
pirate Jan 24, 2024
54ae6a0
Merge branch 'dev' into plugins-browsertrix
pirate Jan 26, 2024
d96d986
Merge branch 'dev' into plugins-browsertrix
pirate Jan 28, 2024
beb83f2
mypy fixes
pirate Jan 25, 2024
eaa4a9c
fix django core auth
pirate Jan 25, 2024
9861a4f
more mypy fixes
pirate Jan 25, 2024
17213ea
chown /var/spool/cron/crontabs in docker entrypoint
pirate Jan 28, 2024
308b493
add pudb and type hints
pirate Jan 28, 2024
ef667a4
Merge branch 'dev' into plugins-browsertrix
pirate Jan 28, 2024
63b2c9e
Merge branch 'dev' into plugins-browsertrix
pirate Jan 30, 2024
c4888ac
Merge branch 'dev' into plugins-browsertrix
pirate Jan 30, 2024
c6faa9a
add extra information to headers extractor output
pirate Feb 7, 2024
342ecd2
update gitignore to ignore all data dirs
pirate Feb 7, 2024
6faf7aa
ignore vscode dir
pirate Feb 7, 2024
b56bfe5
add CUSTOM_TEMPLATES=/data/templates default config in Docker
pirate Feb 8, 2024
97b1859
add TODO to support archive.org-style urls
pirate Feb 8, 2024
777694e
add type hints to plugin config models
pirate Feb 8, 2024
95e866c
fix dockerfile syntax trailing slash
pirate Feb 8, 2024
fbb3c84
Merge branch 'dev' into plugins-browsertrix
pirate Feb 13, 2024
edabc47
Merge branch 'dev' into plugins-browsertrix
pirate Feb 13, 2024
93ed633
Merge branch 'dev' into plugins-browsertrix
pirate Feb 18, 2024
2b99dcd
Update setup.sh
pirate Feb 18, 2024
50d52ea
fix requirements.txt so docks build doesnt crash on missing ldap c he…
pirate Feb 19, 2024
ddc639e
bump required python version to 3.10 to match brew and apt
pirate Feb 19, 2024
15d1865
Merge branch 'dev' into plugins-browsertrix
pirate Feb 21, 2024
1ea7ac1
Merge branch 'dev' into plugins-browsertrix
pirate Feb 22, 2024
11b067a
Merge branch 'dev' into plugins-browsertrix
pirate Mar 15, 2024
c22df0b
Merge branch 'dev' into plugins-browsertrix
pirate Mar 18, 2024
b5311d2
Merge branch 'dev' into plugins-browsertrix
pirate Mar 26, 2024
33e8273
Merge branch 'dev' into plugins-browsertrix
pirate Apr 24, 2024
e594065
Merge branch 'dev' into plugins-browsertrix
pirate Apr 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -113,9 +113,9 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked,id=apt-$TARGETARCH$T
# 1. packaging dependencies
apt-transport-https ca-certificates apt-utils gnupg2 curl wget \
# 2. docker and init system dependencies
zlib1g-dev dumb-init gosu cron unzip grep \
zlib1g-dev dumb-init gosu cron unzip grep ncat \
# 3. frivolous CLI helpers to make debugging failed archiving easier
# nano iputils-ping dnsutils htop procps jq yq
# nano iputils-ping dnsutils htop procps jq yq \
&& rm -rf /var/lib/apt/lists/*

######### Language Environments ####################################
Expand Down
27 changes: 26 additions & 1 deletion archivebox/core/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,12 @@
'django.contrib.admin',

'core',

# Plugins
'plugins.replaywebpage',
# ...
# someday we may have enough plugins to justify dynamic loading:
# *(path.parent.name for path in (Path(PACKAGE_DIR) / 'plugins').glob('*/apps.py')),,

'django_extensions',
]
Expand Down Expand Up @@ -162,7 +168,7 @@
'debug_toolbar.panels.request.RequestPanel',
'debug_toolbar.panels.sql.SQLPanel',
'debug_toolbar.panels.staticfiles.StaticFilesPanel',
# 'debug_toolbar.panels.templates.TemplatesPanel',
# 'debug_toolbar.panels.templates.TemplatesPanel', # buggy/slow
'debug_toolbar.panels.cache.CachePanel',
'debug_toolbar.panels.signals.SignalsPanel',
'debug_toolbar.panels.logging.LoggingPanel',
Expand All @@ -178,16 +184,35 @@

STATIC_URL = '/static/'

STATIC_ROOT = Path(PACKAGE_DIR) / 'collected_static'

STATICFILES_DIRS = [
*([str(CUSTOM_TEMPLATES_DIR / 'static')] if CUSTOM_TEMPLATES_DIR else []),
str(Path(PACKAGE_DIR) / TEMPLATES_DIR_NAME / 'static'),

# Plugins
str(Path(PACKAGE_DIR) / 'plugins/replaywebpage/static'),
# ...
# someday if there are many more plugins / user-addable plugins:
# *(str(path) for path in (Path(PACKAGE_DIR) / 'plugins').glob('*/static')),
]

MEDIA_URL = '/archive/'
MEDIA_ROOT = OUTPUT_DIR / 'archive'


TEMPLATE_DIRS = [
*([str(CUSTOM_TEMPLATES_DIR)] if CUSTOM_TEMPLATES_DIR else []),
str(Path(PACKAGE_DIR) / TEMPLATES_DIR_NAME / 'core'),
str(Path(PACKAGE_DIR) / TEMPLATES_DIR_NAME / 'admin'),
str(Path(PACKAGE_DIR) / TEMPLATES_DIR_NAME),

# Plugins
str(Path(PACKAGE_DIR) / 'plugins/replaywebpage/templates')
# ...
#
# someday if there are many more plugins / user-addable plugins:
# *(str(path) for path in (Path(PACKAGE_DIR) / 'plugins').glob('*/templates')),
]

TEMPLATES = [
Expand Down
4 changes: 4 additions & 0 deletions archivebox/core/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from core.views import HomepageView, SnapshotView, PublicIndexView, AddView, HealthCheckView


# GLOBAL_CONTEXT doesn't work as-is, disabled for now: https://github.com/ArchiveBox/ArchiveBox/discussions/1306
# from config import VERSION, VERSIONS_AVAILABLE, CAN_UPGRADE
# GLOBAL_CONTEXT = {'VERSION': VERSION, 'VERSIONS_AVAILABLE': VERSIONS_AVAILABLE, 'CAN_UPGRADE': CAN_UPGRADE}
Expand All @@ -26,6 +27,9 @@
path('archive/', RedirectView.as_view(url='/')),
path('archive/<path:path>', SnapshotView.as_view(), name='Snapshot'),

path('plugins/replaywebpage/', include('plugins.replaywebpage.urls')),
# ... dynamic load these someday if there are more of them

path('admin/core/snapshot/add/', RedirectView.as_view(url='/add/')),
path('add/', AddView.as_view(), name='add'),

Expand Down
2 changes: 1 addition & 1 deletion archivebox/manage.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# versions of ./manage.py commands whenever possible. When that's not possible
# (e.g. makemigrations), you can comment out this check temporarily

if not ('makemigrations' in sys.argv or 'migrate' in sys.argv):
if not ('makemigrations' in sys.argv or 'migrate' in sys.argv or 'collectstatic' in sys.argv):
print("[X] Don't run ./manage.py directly (unless you are a developer running makemigrations):")
print()
print(' Hint: Use these archivebox CLI commands instead of the ./manage.py equivalents:')
Expand Down
3 changes: 3 additions & 0 deletions archivebox/plugins/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
__package__ = 'archivebox.plugins'


1 change: 1 addition & 0 deletions archivebox/plugins/replaywebpage/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__package__ = 'archivebox.plugins.replaywebpage'
8 changes: 8 additions & 0 deletions archivebox/plugins/replaywebpage/apps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from django.apps import AppConfig


class ReplayWebPageConfig(AppConfig):
label = "ReplayWeb.Page"
name = "plugin_replaywebpage"

default_auto_field = "django.db.models.BigAutoField"
50 changes: 50 additions & 0 deletions archivebox/plugins/replaywebpage/extractors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# browsertrix extractor

def save_browsertrix(link, out_dir, timeout, config):


browsertrix_dir = out_dir / 'browsertrix'
browsertrix_dir.mkdir(exist_ok=True)

crawl_id = link.timestamp

browsertrix_crawler_cmd = [
'crawl',
f'--url', link.url,
f'--collection={crawl_id}',
'--scopeType=page',
'--generateWACZ',
'--text=final-to-warc',
'--timeLimit=60',
]

remote_cmd = """
rm /tmp/dump.rdb;
rm -rf /crawls/collections;
mkdir /crawls/collections;
env CRAWL_ID={crawl_id}
"""

local_cmd = ['nc', 'browsertrix', '2222']

status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
try:
result = run(local_cmd, cwd=str(out_dir), input=remote_cmd, timeout=timeout)

cmd_output = result.stdout.decode()

wacz_output_file = Path('/browsertrix/crawls') / crawl_id / f'{crawl_id}'.wacz

copy_and_overwrite(wacz_output_file, browsertrix_dir / wacz_output_file.name)



TEMPLATE = """

"""

# rm /tmp/dump.rdb;
# rm -rf /crawls/collections;
# mkdir /crawls/collections;
# env CRAWL_ID=tec2342 crawl --url 'https://example.com' --scopeType page --generateWACZ --collection tec2342 --text final-to-warc --timeLimit 60
124 changes: 124 additions & 0 deletions archivebox/plugins/replaywebpage/static/sw.js

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions archivebox/plugins/replaywebpage/static/test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
test content this should be visible
Binary file not shown.
3,392 changes: 3,392 additions & 0 deletions archivebox/plugins/replaywebpage/static/ui.js

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{% load tz core_tags static %}

<!DOCTYPE html>
<html lang="en">
<head>
<title>{{title}}</title>
<meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">

</style>
<style>
html, body {
width: 100%;
height: 100%;
background-color: #ddd;
}

</style>
</head>
<body>
ReplayWeb.page for: {{snapshot.url}} ({{timestamp}}) /{{warc_filename}}

{{snapshot}}

<script>
// https://cdn.jsdelivr.net/npm/[email protected]/sw.min.js
// https://cdn.jsdelivr.net/npm/[email protected]/ui.min.js
</script>

<style>
replay-web-page {
width: 100%;
height: 900px;
}
</style>
<script src="/static/ui.js"></script>

<replay-web-page style="height: 600px" embed="replay-with-info" replayBase="/static/" source="{% static 'test.wacz' %}" url="https://example.com/"></replay-web-page>
</body>
</html>
7 changes: 7 additions & 0 deletions archivebox/plugins/replaywebpage/urls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from django.urls import path

from .views import ReplayWebPageViewer

urlpatterns = [
path('<path:path>', ReplayWebPageViewer.as_view(), name='plugin_replaywebpage__viewer'),
]
47 changes: 47 additions & 0 deletions archivebox/plugins/replaywebpage/views.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import os
import sys
from pathlib import Path

from django.views import View
from django.shortcuts import render
from django.db.models import Q

from core.models import Snapshot

# from archivebox.config import PUBLIC_SNAPSHOTS
PUBLIC_SNAPSHOTS = True


class ReplayWebPageViewer(View):
template_name = 'plugin_replaywebpage__viewer.html'

# render static html index from filesystem archive/<timestamp>/index.html

def get_context_data(self, **kwargs):
return {
# **super().get_context_data(**kwargs),
# 'VERSION': VERSION,
# 'COMMIT_HASH': COMMIT_HASH,
# 'FOOTER_INFO': FOOTER_INFO,
}


def get(self, request, path):
if not request.user.is_authenticated and not PUBLIC_SNAPSHOTS:
return redirect(f'/admin/login/?next={request.path}')

try:
timestamp, warc_filename = path.split('/', 1)
except (IndexError, ValueError):
timestamp, warc_filename = path.split('/', 1)[0], ''

snapshot = Snapshot.objects.get(Q(timestamp=timestamp) | Q(id__startswith=timestamp))

context = self.get_context_data()
context.update({
"snapshot": snapshot,
"timestamp": timestamp,
"warc_filename": warc_filename,
})
return render(template_name=self.template_name, request=self.request, context=context)

38 changes: 38 additions & 0 deletions bin/docker_ipc_listener.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/usr/bin/env python3

# Allow another docker container to run commands on this container
# This is the script to run on the server container.
# The client can connect and run a command like so:
# $ echo whoami | nc servercontainername 2222
# root

import socket
import subprocess as sp
from datetime import datetime

LISTEN_PORT = 2222

s1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s1.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s1.bind(("0.0.0.0", LISTEN_PORT))
Dismissed Show dismissed Hide dismissed
s1.listen(1)
print("Listening for shell commands on 0.0.0.0:2222", flush=True)

conn, addr = s1.accept()
while True:
cmd = conn.recv(1024).decode()
if not cmd:
conn, addr = s1.accept()
continue

timestamp = datetime.now().isoformat()
client_ip, client_port = conn.getsockname()
print(f'\n[{timestamp}][{client_ip}:{client_port}] $', cmd)

with sp.Popen(cmd, shell=True, stdout=sp.PIPE, stderr=sp.STDOUT, stdin=sp.PIPE, bufsize=1, universal_newlines=True) as p:
for line in p.stdout:
print(line.strip(), flush=True)
conn.sendall(line.encode("utf-8"))

conn.close()
conn, addr = s1.accept()
12 changes: 11 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,14 @@ version: '3.9'
services:
archivebox:
#image: ${DOCKER_IMAGE:-archivebox/archivebox:dev}
image: archivebox/archivebox:dev
image: archivebox:test
# image: archivebox/archivebox:dev
command: server --quick-init 0.0.0.0:8000
ports:
- 8000:8000
volumes:
- ./data:/data
- /Volumes/OPT/browsertrix:/browsertrix:z
# - ./etc/crontabs:/var/spool/cron/crontabs # uncomment this and archivebox_scheduler below to set up automatic recurring archive jobs
# - ./archivebox:/app/archivebox # uncomment this to mount the ArchiveBox source code at runtime (for developers working on archivebox)
# build: . # uncomment this to build the image from source code at buildtime (for developers working on archivebox)
Expand Down Expand Up @@ -48,6 +50,14 @@ services:
# dns:
# - 172.20.0.53

browsertrix:
image: webrecorder/browsertrix-crawler:latest
command: /bin/docker_ipc_listener.py
expose:
- 2222
volumes:
- /Volumes/OPT/browsertrix:/crawls:z
- ./bin/docker_ipc_listener.py:/bin/docker_ipc_listener.py

######## Optional Addons: tweak examples below as needed for your specific use case ########

Expand Down
7 changes: 6 additions & 1 deletion etc/uwsgi.ini
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,12 @@ wsgi-file = archivebox/core/wsgi.py
processes = 4
threads = 1
stats = 127.0.0.1:9191
static-map /static=./archivebox/templates/static
static-map = /static=./archivebox/templates/static
static-map = /static=./archivebox/plugins/replaywebpage/static
static-map = /archive=$(PWD)/archive
static=index = index.html
harakiri = 172800
post-buffering = 1
disable-logging = True
check-static
honour-range = True