Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proxy-DHCP support #73

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

rjocoleman
Copy link

@rjocoleman rjocoleman commented Oct 10, 2024

This PR adds proxy-DHCP support:

  • What is proxy-DHCP?
    Proxy-DHCP allows a secondary DHCP server to provide boot configuration (such as next-server and boot file) while the primary DHCP server continues to assign IP addresses. This is useful in environments where modifying the primary DHCP server is not feasible, or the primary DHCP server doesn't have a static IP.

  • How proxy-DHCP works
    When a client sends out a DHCP request, the proxy-DHCP service will respond with boot options such as the next-server and boot filename, while leaving the IP address assignment to the primary DHCP server. This allows the client to chainload iPXE without requiring modifications to the existing DHCP server.

  • Pairs well with netboot.xyz PR #953
    This PR works alongside #953, which adds support for proxy-DHCP in the iPXE menus, allowing users to press a key to select the proxy offer and load netboot.xyz from there.

  • How to use it
    Set the DHCP_RANGE_START environment variable to the first IP in your network’s DHCP range. This will enable the optional proxy-DHCP mode. When enabled, dnsmasq calculates the range and handles proxy requests automatically.

    Edit: Ensure the docker container is on the same network e.g. --network host (or ipvlan, macvlan) so that it can receive broadcast DHCP messages and respond with its own broadcasts.

  • Moved dnsmasq config to a file
    To enable this functionality cleanly, the dnsmasq configuration has been moved into a config file, allowing for different config based on the presence of env DHCP_RANGE_START and substitution of some values via envsubst.

  • Proxy-DHCP behaviour
    When DHCP_RANGE_START is set, the provided dnsmasq will behave in proxy-DHCP mode (in addition to tftp), with the following key sections in the configuration:

    # DHCP Proxy range and enable verbose DHCP logging
    dhcp-range=${DHCP_RANGE_START},proxy
    log-dhcp
    leasefile-ro
    
    # Detect iPXE requests via user class (Option 175)
    dhcp-match=set:ipxe-bios,175,33
    dhcp-match=set:ipxe-efi,175,36
    
    # Serve appropriate bootloaders for non-iPXE clients (initial PXE boot)
    pxe-service=tag:bios,tag:!ipxe-ok,X86PC,"Legacy BIOS",netboot.xyz-undionly.kpxe
    ...

    This configuration sets up the proxy-DHCP to respond only to PXE clients (non-iPXE), serving the appropriate bootloaders for BIOS, UEFI, ARM64, and Raspberry Pi clients, while iPXE clients will be served an HTTP boot script.

  • Dynamic IP handling with envsubst
    The CONTAINER_IP is dynamically injected into the configuration using envsubst, after retrieving the container’s IP address at runtime from the container itself via init.sh. This ensures that the correct container next-server IP is set in the configuration.

  • User experience
    Users can start the container with the relevant environment variables set (DHCP_RANGE_START and optionally others). When a DHCP request is detected, this container sends a proxy offer with the next-server and boot file. With PR #953, netboot.xyz will detect the proxy next-server, allowing users to press p to boot from the proxy-DHCP server.

As it depends on a new env var being added DHCP_RANGE_START, this should be backwards compatible.

Docs & resources:
https://www.ipxe.org/appnote/proxydhcp
https://gist.github.com/NiKiZe/5c181471b96ac37a069af0a76688944d
https://thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html

@roger-
Copy link

roger- commented Nov 2, 2024

Does this accomplish the same as https://github.com/samdbmg/dhcp-netboot.xyz ?

EDIT

Gave it a test and got this error from docker compose:

process is missing required capability NET_ADMIN

Fixed by adding this to compose.yaml:

    cap_add:
      - NET_ADMIN

However spinning up a proxmox VM gives PXE-E16: No valid off received so something isn't working.

@rjocoleman
Copy link
Author

Does this accomplish the same as https://github.com/samdbmg/dhcp-netboot.xyz ?

Yes and no.

Yes - it offers up proxyDHCP with netboot.xyz files via dnsmasq

No:

  • samdbmg/dhcp-netboot.xyz is based on linuxserver/docker-netbootxyz, this is in netbootxyz/docker-netbootxyz
  • samdbmg/dhcp-netboot.xyz doesn't work with the current linuxserver/docker-netbootxyz due to changes in the underlying docker image structure (my understanding is the linuxserver release on 12.10.22 made it incompatible)
  • samdbmg/dhcp-netboot.xyz supports fewer architectures
pxe-service=tag:!ipxe-ok,X86PC,PXE,netboot.xyz-undionly.kpxe
pxe-service=tag:!ipxe-ok,BC_EFI,PXE,netboot.xyz.efi
pxe-service=tag:!ipxe-ok,X86-64_EFI,PXE,netboot.xyz.efi

this PR supports more

# Legacy BIOS (not iPXE)
pxe-service=tag:bios,tag:!ipxe-ok,X86PC,"Legacy BIOS",netboot.xyz-undionly.kpxe
# UEFI 32-bit (not iPXE)
pxe-service=tag:efi32,tag:!ipxe-ok,BC_EFI,"UEFI 32-bit",netboot.xyz.efi
# UEFI 64-bit (not iPXE)
pxe-service=tag:efi64,tag:!ipxe-ok,X86-64_EFI,"UEFI 64-bit",netboot.xyz.efi
# ARM64 UEFI (not iPXE)
pxe-service=tag:arm64-efi,tag:!ipxe-ok,ARM64_EFI,"ARM64 UEFI",netboot.xyz-arm64.efi
# Raspberry Pi Boot (using rpi4 tag, not iPXE)
pxe-service=tag:rpi4,tag:!ipxe-ok,0,"Raspberry Pi Boot",netboot.xyz-rpi4-snp.efi

EDIT

Gave it a test and got this error from docker compose:

process is missing required capability NET_ADMIN

Fixed by adding this to compose.yaml:

    cap_add:
      - NET_ADMIN

However spinning up a proxmox VM gives PXE-E16: No valid off received so something isn't working.

Thanks for the feedback on the cap_add. I hadn't used the example docker compose file and missed it.

I was also missing the required networking configuration.
The issue you've seen is due to the proxy DHCP response not being received in your VM, this is because it wasn't sent.
ProxyDHCP works by observing broadcast DHCP requests and responding to them, thus the docker container needs to be on the network in a way that it can receive broadcast messages.

The simplest way to do this is network_mode: host in docker compose yaml (or --network host for docker cli). But ipvlan/macvlan are better choices in some environments however they're slightly more involved for the user to set up.

I have added a commit that documents these requirements.

For reference my minimal docker-compose.yml based off the example from this repo is:

services:
  netbootxyz:
    build: .
    container_name: netbootxyz
    environment:
      - MENU_VERSION=2.0.82 # optional
      - NGINX_PORT=80 # optional
      - WEB_APP_PORT=3000 # optional
      - DHCP_RANGE_START=192.168.0.1  # optional, enables DHCP Proxy mode. set to your network's DHCP range first IP.
    # volumes:
    #   - /path/to/config:/config # optional
    #   - /path/to/assets:/assets # optional
    ports:
      - 3000:3000  # optional, destination should match ${WEB_APP_PORT} variable above.
      - 69:69/udp
      - 8080:80  # optional, destination should match ${NGINX_PORT} variable above.
    restart: unless-stopped
    cap_add:
      - NET_ADMIN  # required for DHCP Proxy mode.
    network_mode: host

As long as the machines/vms are on the same network proxydhcp responses should be offered up.

@roger-
Copy link

roger- commented Nov 2, 2024

Thanks, adding network_mode: host helped, but port redirection doesn't work in host mode so I had to disable that and edit the environment variables.

When I boot up the VM now I do see some activity in the docker logs, but nothing happens on the VM side:

netbootxyz  | 2024-11-02 21:38:25,513 INFO success: messages-log entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
netbootxyz  | 2024-11-02 21:38:22 info dnsmasq[18]: started, version 2.90 DNS disabled
netbootxyz  | 2024-11-02 21:38:51 warning dnsmasq[18]: overflow: 4 log entries lost
netbootxyz  | 2024-11-02 21:38:55 warning dnsmasq-dhcp[18]: no address range available for DHCP request via eth0
netbootxyz  | 2024-11-02 21:39:03 warning dnsmasq-dhcp[18]: no address range available for DHCP request via eth0
netbootxyz  | 2024-11-02 21:39:19 warning dnsmasq-dhcp[18]: no address range available for DHCP request via eth0

@rjocoleman
Copy link
Author

rjocoleman commented Nov 2, 2024

The error cited is saying that it's seeing a DHCP request in a range that the netboot.xyz proxydhcp dnsmasq is not configured to handle.

The most likely cause is that DHCP_RANGE_START env var in docker compose isn't configured to the first IP address in your dhcp range.

It should be noted that i retained the project pattern of not overwriting generated config, so if DHCP_RANGE_START is changed you need to delete the /config/dnsmasq/dnsmasq.conf if it's persisited to a volume (etc) for it to be regenerated with the changed env value.
(i dont like this behaviour but it is the pattern in this project for files being generated in the same way)

@rjocoleman
Copy link
Author

port redirection doesn't work in host mode so I had to disable that and edit the environment variables.

as an aside, this is why macvlan would be preferable in some cases. it's likely you have something on port 80 or 3000 on the host already. macvlan would give the container its own mac address and functionally let it sit on the network as if it was directly connected (to get it's own IP and interact with broadcast messages). it is more involved to set up and environment specific so i think out of scope for the minimum functional examples presented in the docs here

@roger-
Copy link

roger- commented Nov 2, 2024

Thanks, deleting ./config/ and checking the range fixed it. Also had to go to my Proxmox VM BIOS in Device Manager -> Secure Boot Configuration and then delete keys.

@h1ght
Copy link

h1ght commented Nov 3, 2024

this looks promising

@ColinHebert
Copy link

@antonym an chance to have this reviewed/merged?

@antonym
Copy link
Member

antonym commented Dec 8, 2024

Yeah, I looked at it a few weeks ago and it looks good, there were one or two things I wanted to bring up, but they slip my mind for the moment. I'll try and look at it once more and provide some feedback.

@david-sharer
Copy link

david-sharer commented Jan 14, 2025

This is great stuff. +1 to this. Hoping you get a chance to review this soon.
(Sorry for the wall of text)

In the meantime, I have attempted to get it working in my own setup after passing thru a sea of failed config attempts.
In the process, I have created a collection of couple docker-compose examples which allow monkeypatching this config into the current images. I intend to submit those as a PR once I get it working suitably well.

In the process, I found a number of things that didn't quite work right or make sense... but those may well be just a result of "being new to DHCP". It ultimately did boot, but not from my LAN (after the initial netboot.xyz image load).

Here's what I found:


The default value of CONTAINER_IP seems incorrect.
It gets a "strange" (incorrect?) IP like 127.0.1.1 on my device.
It looks like it can be overriden from the existing PR tho, so maybe that should be documented.

I assume the command is meant to retrieve something like 192.168.11.1 -- the LAN IP of this machine.
Regardless, it appeared to properly serve up a file and I managed to boot into the menu.

CONTAINER_IP=$(hostname -i) produces 127.0.1.1 on the container.

me@homelab-pi:~/homelab 
$ ./external/start-docker-netbootxyz.sh 
 ✔ Container netbootxyz  Recreated
Attaching to netbootxyz
netbootxyz  | 127.0.1.1 # hostname -i

Inspecting from outside the container yields similar results

me@homelab-pi:~/homelab 
$ hostname -i
127.0.1.1
me@homelab-pi:~/homelab 
$ hostname -I | tr ' ' '\n'
192.168.11.1
172.22.0.1
172.21.0.1
172.17.0.1
172.19.0.1
172.20.0.1
172.23.0.1
172.18.0.1

The dhcp-range config does not allow specifying netmask.
dhcp-range=${DHCP_RANGE_START},proxy
Many other examples I've seen do specify netmask
dhcp-range=192.168.0.1,proxy,255.255.240.0

Altho the manpage suggests this does not matter

For directly connected networks (ie, networks on which the machine running dnsmasq has an interface) the netmask is optional: dnsmasq will determine it from the interface configuration.

I don't know enough about this setup to tell either way.


MENU_VERSION=2.0.47 attempted to boot everything from a local IP that is not serving up files.
Naturally, this did not work.

I know that version is quite old. But it was in the sample file and uncommented.
It should probably just be commented out to make it clearer that it's a placeholder that should be avoided.

(Not specifying MENU_VERSION pulled 2.0.84, which booted successfully into a Mint Live CD)


The machine I am attempting to PXE boot really wants some files from 192.168.0.1.
192.168.0.1 is my router, while the image was loaded off 192.168.11.1
This causes the process to sit around for a long time, waiting for responses that will never come.

I could find no resolution w/in dnsmasq configs.
It appears to be the result of an interaction between many things...

  • The netboot.xyz.j2 startup script's structure
    • Appears to load autoexec.ipxe from correct source, but doesn't execute it? (meaning I can't customize the rest of this from w/in the iPXE context)
    • re-initializes next-server with dhcp command, which produces a confused var configuration
    • attempts to get local-vars.ipxe first from next-server, then timeout-defaults to bothering next-server again instead of proxydhcp/next-server
  • ipxe's dhcp command resulting in odd configuration (the configuration is as-expected before then. and autoboot works)
  • My router emitting next-server (when it shouldn't be?)
  • dnsmasq stripping dhcp-option=encap:175,1,42b (priority) when in proxy mode, eliminating an easy fix for all this by providing increased priority to the proxy values

Quick fixes I can see are either of these:

  • Boot to a direct ipxe.efi build, dumped into the directory as part of setup. Use autoexec.ipxe to initialize + chainload
  • Use ISC DHCP (now EOL?) in a sibling container, since it supports sending any arbitrary DHCP payload
    • Which ought to allow sending encapsulated priority option (which the iPXE dhcp command seems to respect)

But a "proper" resolution is probably going to be resolving something from that first list.
I can dump my more-complete investigation... but this is already wall-of-texty enough.

@rjocoleman
Copy link
Author

I have resolved the merge conflict in this branch.

It should be noted that a potential major source of indirection with this container is https://github.com/netbootxyz/docker-netbootxyz/blob/master/root/init.sh#L13 - this pattern (which I have copied in this branch for dnsmasq) checks to see if nginx.conf (and in my branch dnsmasq.conf) exists, and if they do it does not overwrite them.

In my opinion that is a trap for users, as those files are generated from ENV variables that are able to be changed container runtime, but the files are only generated if they do not already exist. Meaning users have to manually delete the files or volume that houses them to get a refreshed config which matches the ENV the container was invoked with.

I'd suggest this is non-obvious UX and could lead users to going down unexpected paths trying to debug things assuming that changing the ENV was enough. I think having the files generated with the current ENV values on each run would be preferable, but I didn't design the container and don't understand the use-case that design decision was intended to address!

@david-sharer It's difficult to follow and understand your situation (the dump of info is fine but it's not the same as being at the console and able to poke it), however:

The default value of CONTAINER_IP seems incorrect.

Good catch, hostname -i seems overly simplistic in some environments. I was able to reproduce this and have changed the way it looks for the container IP.

The dhcp-range config does not allow specifying netmask.

Per the cited man page this isn't an issue, it certainly hasn't been a problem in my testing or usage (dnsmasq seems to do a fine job of working it out in my varied environments)

MENU_VERSION=2.0.47 attempted to boot everything from a local IP that is not serving up files.

I don't know anything about this, the value was in that file already but I have bumped the example MENU_VERSION in this branch.

The machine I am attempting to PXE boot really wants some files from 192.168.0.1.

My memory is hazy on this. I do think it's something around autoexec.ipxe or netbootxyz/netboot.xyz#953 meaning that p has to be pushed to use the proxydhcp response. This is currently a hard requirement.

However 192.168.0.1 is coming from somewhere, indicating that someone in your network is responding with that and likely a cause of your issues at this point.

If next-server is being emitted by another DHCP server it could confuse things, I'd be looking at the DHCP traffic in verbose mode from dhclient or tcpdump'ing dhcp. Resolving this would eliminate your need for priority and the other proposed workarounds - I don't think that complexity is needed in a generic implementation.

I'm not keen to change the flow of things around autoexec.ipxe, to avoid the p keypress, as that changes this container more significantly and diverges from how the rest of netboot.xyz works.

@david-sharer
Copy link

Wow that was way faster of a reply than I was expecting.

The dhcp-range config does not allow specifying netmask.

Per the cited man page this isn't an issue, it certainly hasn't been a problem in my testing or usage (dnsmasq seems to do a fine job of working it out in my varied environments)

Excellent. Just making sure.
(I don't have enough context in this space to know how to separate relevant-differences from irrelevant-differences yet)


MENU_VERSION=2.0.47 attempted to boot everything from a local IP that is not serving up files.

I don't know anything about this, the value was in that file already but I have bumped the example MENU_VERSION in this branch.

Yeah I think it's just very old behavior. Just calling it out. Thanks for updating the example!


The machine I am attempting to PXE boot really wants some files from 192.168.0.1.

My memory is hazy on this. I do think it's something around autoexec.ipxe or netbootxyz/netboot.xyz#953 meaning that p has to be pushed to use the proxydhcp response. This is currently a hard requirement.

Yes. Even in the latest netboot.xyz.j2 it does the same.
I think this is not anything to change for your PR but rather a separate issue, but I'm not entirely sure where to take it.

I think the potential fault in netboot.xyz.j2 is that next-server is set by the prime DHCP server, but it did not provide filename (I thought that was required?) and so next-server is not really a tftp server.

Immediate (naive) thoughts on that:

  1. After the first tftp://${next-server}/local-vars.ipxe request fails, it expect it should not be re-attempted
  2. autoexec.ipxe appears to be loaded in the console, but not executed? And so I was unable to (easily) set the suggested flag use_proxydhcp_settings. Tho tested this by throwing a few prompt commands in autoexec.ipxe, and perhaps those are somehow disabled by netboot.xyz's startup? I can investigate further if you think this is the route.

However 192.168.0.1 is coming from somewhere, indicating that someone in your network is responding with that and likely a cause of your issues at this point.

If next-server is being emitted by another DHCP server it could confuse things, I'd be looking at the DHCP traffic in verbose mode from dhclient or tcpdump'ing dhcp. Resolving this would eliminate your need for priority and the other proposed workarounds - I don't think that complexity is needed in a generic implementation.

I have captured these packets previously thru tcpdump -i eth0 port 67 or port 68 -e -n -vvv -s 0 -w ./logs/tcpdump-$(date '+%Y.%m.%d.%H%M').$1.pcap -U --immediate-mode.

It did appear that the router was producing next-server values (screenshots at end).
I saw no option that looked like it controlled that.
Perhaps the router's firmware is doing things incorrectly? It is unclear to me. nvm it's definitely wrong
But the end result is that this proxydhcp setup did not work smoothly (i.e. without pressing p on startup).
(For reference, I am using Omada ER605 v2.0 w/ firmware version 2.2.6, which I believe is latest)

I'm not keen to change the flow of things around autoexec.ipxe, to avoid the p keypress, as that changes this container more significantly and diverges from how the rest of netboot.xyz works.

Yes. I don't think it's the responsibility of this PR or of the docker-netbootxyz repo's setup.
Tho I found no combination of router config, dnsmasq config, or file config of the tftp server to work.

I think ultimately this would have to be resolved upstream

  • perhaps in dnsmasq (to allow sending encap:175 options) ?
  • perhaps in Omada's router firmware ? (if it is indeed doing the incorrect thing) jk it seems it is
  • perhaps in iPXE's dhcp command (assuming it's unexpected for dhcp to produce different results than were available on startup) ?
  • perhaps in netboot.xyz.j2, as a follow-up to that PR you linked
    • to respect autoexec.ipxe options ?
    • to work around lying(?) DHCP servers ?
      • thru some sort of "liveness check" against tftp://${next-server}/... before bothering it a bunch?
      • thru automatic failover to proxydhcp ?
      • thru skipping next-server if filename was not also provided (...somehow..?)

Here's screenshots from the network dump, tho I don't expect you to do much with this.

Request DHCP Proxy DHCP
image image image
image image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants