Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform on-host conversion for the pixels to PDF stage #748

Merged
merged 19 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 62 additions & 5 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,32 @@ jobs:
sudo apt-get install -y python3-poetry
python3 ./install/common/build-image.py

download-tessdata:
name: Download and cache Tesseract data
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Cache Tessdata
id: cache-tessdata
uses: actions/cache@v4
with:
path: share/tessdata/
key: v1-tessdata-${{ hashFiles('./install/common/download-tessdata.py') }}
enableCrossOsArchive: true
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Download Tessdata
run: |-
if [ -f "share/tessdata" ]; then
echo "Already cached, skipping"
else
python3 ./install/common/download-tessdata.py
fi

windows:
runs-on: windows-latest
needs: download-tessdata
env:
DUMMY_CONVERSION: 1
steps:
Expand All @@ -77,6 +101,13 @@ jobs:
python-version: "3.12"
- run: pip install poetry
- run: poetry install
- name: Restore cached tessdata
uses: actions/cache/restore@v4
with:
path: share/tessdata/
enableCrossOsArchive: true
fail-on-cache-miss: true
key: v1-tessdata-${{ hashFiles('./install/common/download-tessdata.py') }}
- name: Run CLI tests
run: poetry run make test
# Taken from: https://github.com/orgs/community/discussions/27149#discussioncomment-3254829
Expand All @@ -90,6 +121,7 @@ jobs:
macOS:
name: "macOS (${{ matrix.arch }})"
runs-on: ${{ matrix.runner }}
needs: download-tessdata
strategy:
matrix:
include:
Expand All @@ -104,6 +136,13 @@ jobs:
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Restore cached tessdata
uses: actions/cache/restore@v4
with:
path: share/tessdata/
enableCrossOsArchive: true
fail-on-cache-miss: true
key: v1-tessdata-${{ hashFiles('./install/common/download-tessdata.py') }}
- run: pip install poetry
- run: poetry install
- name: Run CLI tests
Expand Down Expand Up @@ -174,7 +213,7 @@ jobs:
if: matrix.distro == 'debian' && matrix.version == 'bookworm'
uses: actions/upload-artifact@v4
with:
name: dangerzone.deb
name: dangerzone-${{ matrix.distro }}-${{ matrix.version }}.deb
path: "deb_dist/dangerzone_*_*.deb"
if-no-files-found: error
compression-level: 0
Expand Down Expand Up @@ -214,7 +253,7 @@ jobs:
- name: Download Dangerzone .deb
uses: actions/download-artifact@v4
with:
name: dangerzone.deb
name: dangerzone-debian-bookworm.deb
apyrgio marked this conversation as resolved.
Show resolved Hide resolved
path: "deb_dist/"

- name: Build end-user environment
Expand All @@ -227,7 +266,7 @@ jobs:
run: |
./dev_scripts/env.py --distro ${{ matrix.distro }} \
--version ${{ matrix.version }} \
run dangerzone-cli dangerzone/tests/test_docs/sample-pdf.pdf
run dangerzone-cli dangerzone/tests/test_docs/sample-pdf.pdf --ocr-lang eng

- name: Check that the Dangerzone GUI imports work
run: |
Expand Down Expand Up @@ -291,7 +330,7 @@ jobs:
- name: Run a test command
run: |
./dev_scripts/env.py --distro ${{ matrix.distro }} --version ${{ matrix.version }} \
run dangerzone-cli dangerzone/tests/test_docs/sample-pdf.pdf
run dangerzone-cli dangerzone/tests/test_docs/sample-pdf.pdf --ocr-lang eng

- name: Check that the Dangerzone GUI imports work
run: |
Expand All @@ -301,7 +340,9 @@ jobs:
run-tests:
name: "run tests (${{ matrix.distro }} ${{ matrix.version }})"
runs-on: ubuntu-latest
needs: build-container-image
needs:
- build-container-image
- download-tessdata
strategy:
matrix:
include:
Expand Down Expand Up @@ -360,6 +401,22 @@ jobs:
share/image-id.txt
fail-on-cache-miss: true

- name: Restore cached tessdata
uses: actions/cache/restore@v4
with:
path: share/tessdata/
enableCrossOsArchive: true
fail-on-cache-miss: true
key: v1-tessdata-${{ hashFiles('./install/common/download-tessdata.py') }}

- name: Setup xvfb (Linux)
run: |
# Stuff copied wildly from several stackoverflow posts
sudo apt-get install -y xvfb libxkbcommon-x11-0 libxcb-icccm4 libxcb-image0 libxcb-keysyms1 libxcb-randr0 libxcb-render-util0 libxcb-xinerama0 libxcb-xinput0 libxcb-xfixes0 libxcb-shape0 libglib2.0-0 libgl1-mesa-dev '^libxcb.*-dev' libx11-xcb-dev libglu1-mesa-dev libxrender-dev libxi-dev libxkbcommon-dev libxkbcommon-x11-dev

# start xvfb in the background
sudo /usr/bin/Xvfb $DISPLAY -screen 0 1280x1024x24 &

- name: Run CI tests
run: |-
# Pass the -ac Xserver flag, to disable host-based access controls.
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ var/
wheels/
pip-wheel-metadata/
share/python-wheels/
share/tessdata/
*.egg-info/
.installed.cfg
*.egg
Expand Down
29 changes: 25 additions & 4 deletions BUILD.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,12 @@ Build the latest container:
python3 ./install/common/build-image.py
```

Download the OCR language data:

```sh
python3 ./install/common/download-tessdata.py
```

Run from source tree:

```sh
Expand Down Expand Up @@ -174,6 +180,12 @@ Build the latest container:
python3 ./install/common/build-image.py
```

Download the OCR language data:

```sh
python3 ./install/common/download-tessdata.py
```

Run from source tree:

```sh
Expand Down Expand Up @@ -278,10 +290,7 @@ test it.
cd dangerzone
```

2. Follow the Fedora instructions for setting up the development environment with the particularity of running the following instead of `poetry install`:
```
poetry install --with qubes
```
2. Follow the Fedora instructions for setting up the development environment.

3. Build a dangerzone `.rpm` for qubes with the command

Expand Down Expand Up @@ -379,6 +388,12 @@ Build the dangerzone container image:
python3 ./install/common/build-image.py
```

Download the OCR language data:

```sh
python3 ./install/common/download-tessdata.py
```

Run from source tree:

```sh
Expand Down Expand Up @@ -440,6 +455,12 @@ Build the dangerzone container image:
python3 .\install\common\build-image.py
```

Download the OCR language data:

```sh
python3 .\install\common\download-tessdata.py
```

After that you can launch dangerzone during development with:

```
Expand Down
25 changes: 0 additions & 25 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,30 +21,6 @@ RUN case "$ARCH" in \
RUN pip install -vv --break-system-packages --require-hashes -r /tmp/requirements.txt


###########################################
# Download Tesseract data

FROM alpine:latest as tessdata-dl
ARG TESSDATA_CHECKSUM=d0e3bb6f3b4e75748680524a1d116f2bfb145618f8ceed55b279d15098a530f9

# Download the trained models from the latest GitHub release of Tesseract, and
# store them under /usr/share/tessdata. This is basically what distro packages
# do under the hood.
#
# Because the GitHub release contains more files than just the trained models,
# we use `find` to fetch only the '*.traineddata' files in the top directory.
#
# Before we untar the models, we also check if the checksum is the expected one.
RUN mkdir /usr/share/tessdata/ && mkdir tessdata && cd tessdata \
&& TESSDATA_VERSION=$(wget -O- -nv https://api.github.com/repos/tesseract-ocr/tessdata_fast/releases/latest \
| sed -n 's/^.*"tag_name": "\([0-9.]\+\)".*$/\1/p') \
&& wget https://github.com/tesseract-ocr/tessdata_fast/archive/$TESSDATA_VERSION/tessdata_fast-$TESSDATA_VERSION.tar.gz \
&& echo "$TESSDATA_CHECKSUM tessdata_fast-$TESSDATA_VERSION.tar.gz" | sha256sum -c \
&& tar -xzvf tessdata_fast-$TESSDATA_VERSION.tar.gz -C . \
&& find . -name '*.traineddata' -maxdepth 2 -exec cp {} /usr/share/tessdata/ \; \
&& cd .. && rm -r tessdata


###########################################
# Download H2ORestart
FROM alpine:latest as h2orestart-dl
Expand Down Expand Up @@ -74,7 +50,6 @@ RUN apk --no-cache -U upgrade && \
COPY --from=pymupdf-build /usr/lib/python3.12/site-packages/fitz/ /usr/lib/python3.12/site-packages/fitz
COPY --from=pymupdf-build /usr/lib/python3.12/site-packages/pymupdf/ /usr/lib/python3.12/site-packages/pymupdf
COPY --from=pymupdf-build /usr/lib/python3.12/site-packages/PyMuPDF.libs/ /usr/lib/python3.12/site-packages/PyMuPDF.libs
COPY --from=tessdata-dl /usr/share/tessdata/ /usr/share/tessdata
COPY --from=h2orestart-dl /libreoffice_ext/ /libreoffice_ext

RUN install -dm777 "/usr/lib/libreoffice/share/extensions/"
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ test:
# Make each GUI test run as a separate process, to avoid segfaults due to
# shared state.
# See more in https://github.com/freedomofpress/dangerzone/issues/493
pytest --co -q tests/gui | grep -v ' collected' | xargs -n 1 pytest -v
pytest --co -q tests/gui | grep -e '^tests/' | xargs -n 1 pytest -v
pytest -v --cov --ignore dev_scripts --ignore tests/gui --ignore tests/test_large_set.py


Expand Down
16 changes: 0 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,19 +92,3 @@ Dangerzone gets updates to improve its features _and_ to fix problems. So, updat
1. Check which version of Dangerzone you are currently using: run Dangerzone, then look for a series of numbers to the right of the logo within the app. The format of the numbers will look similar to `0.4.1`
2. Now find the latest available version of Dangerzone: go to the [download page](https://dangerzone.rocks/#downloads). Look for the version number displayed. The number will be using the same format as in Step 1.
3. Is the version on the Dangerzone download page higher than the version of your installed app? Go ahead and update.

### "I get `invalid json returned from container` on MacOS Big Sur or newer (MacOS 11.x.x or higher)"

Are you using the latest version of Dangerzone? See the FAQ for: "I'm experiencing an issue while using Dangerzone."

You _may_ be attempting to convert a file in a directory to which Docker Desktop does not have access. Dangerzone for Mac requires Docker Desktop for conversion. Docker Desktop, in turn, requires permission from MacOS to access the directory in which your target file is located.

To grant this permission:

1. On MacOS 13, choose Apple menu > System Settings. On lower versions, choose System Preferences.
2. Tap into Privacy & Security in the sidebar. (You may need to scroll down.)
3. In the Privacy section, tap into Files & Folders. (Again, you may need to scroll down.)
4. Scroll to the entry for Docker. Tap the > to expand the entry.
5. Enable the toggle beside the directory where your file is present. For example, if the file to be converted is in the Downloads folder, enable the toggle beside Downloads.

(Full Disk Access permission has a similar effect, but it's enough to give Docker access to _only_ the directory containing the intended file(s) to be converted. Full Disk is unnecessary. As of 2023.04.28, granting one of these permissions continues to be required for successful conversion. Apologies for the extra steps. Dangerzone depends on Docker, and the fix for this issue needs to come from upstream. Read more on [#371](https://github.com/freedomofpress/dangerzone/issues/371#issuecomment-1516863056).)
9 changes: 0 additions & 9 deletions dangerzone/conversion/common.py
apyrgio marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,6 @@ def running_on_qubes() -> bool:
return os.path.exists("/usr/share/qubes/marker-vm")


def get_tessdata_dir() -> str:
if os.environ.get("TESSDATA_PREFIX"):
return os.environ["TESSDATA_PREFIX"]
elif running_on_qubes():
return "/usr/share/tesseract/tessdata/"
else:
return "/usr/share/tessdata/"


class DangerzoneConverter:
apyrgio marked this conversation as resolved.
Show resolved Hide resolved
def __init__(self, progress_callback: Optional[Callable] = None) -> None:
self.percentage: float = 0.0
Expand Down
Loading
Loading