Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoconfigure network without DHCP #64

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

slp
Copy link
Collaborator

@slp slp commented Sep 26, 2024

Passt includes a DHCP server to allow guests to easily configure the network without being aware passt is on the other side. But we are aware of that, so we can take advantage of it.

Instead of using DHCP, read the configuration output from passt, marshall it into some environment variables, pass it to the guest, and have krun-guest read it and apply it using rtnetlink.

By doing that, we reduce the startup time in half, from...

$ time krun /bin/false
(...)
real 0m4,301s

... to ...

$ time krun /bin/false
(...)
real 0m1,966s

crates/krun/Cargo.toml Outdated Show resolved Hide resolved
@@ -15,7 +15,8 @@ use krun::utils::env::find_in_path;
use log::debug;
use rustix::process::{getrlimit, setrlimit, Resource};

fn main() -> Result<()> {
#[tokio::main]
Copy link
Collaborator

@teohhanhui teohhanhui Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go multi-threaded, some of our safety assumptions would no longer hold, e.g.

https://github.com/slp/krun/blob/44417f6e802a934f9ca164ace647f99da009546d/crates/krun/src/guest/user.rs#L21-L23

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using tokio is a requirement of the rtnetlink crate. I think we can avoid the safety issues with something like this:

    let rt = tokio::runtime::Runtime::new().unwrap();
    rt.block_on(async {
        if let Err(err) = configure_network().await {
            eprintln!("Failed to configure network, continuing without it: {err}");
        }
    });

Copy link
Collaborator

@teohhanhui teohhanhui Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use tokio's current-thread scheduler?

Apparently it could be selected like this:

#[tokio::main(flavor = "current_thread")]

But it looks like we will still need to make sure nothing uses spawn_blocking: https://users.rust-lang.org/t/why-does-tokios-current-thread-flavor-not-be-single-threaded/85129

How? Probably by using Handle::block_on? But hmm... There's not really any point in choosing the current-thread scheduler then? lol

Oh, apparently this can't work:

When this is used on a current_thread runtime, only the Runtime::block_on method can drive the IO and timer drivers, but the Handle::block_on method cannot drive them. This means that, when using this method on a current_thread runtime, anything that relies on IO or timers will not work unless there is another thread currently calling Runtime::block_on on the same runtime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aren't planning on using tokio anywhere else, I think I feel more confident isolating it's use to configure_network.

@slp slp force-pushed the avoid-dhcp branch 2 times, most recently from e79250b to 2cf1578 Compare September 27, 2024 15:44
@@ -42,7 +42,12 @@ fn main() -> Result<()> {

setup_fex()?;

configure_network()?;
let rt = tokio::runtime::Runtime::new().unwrap();
rt.block_on(async {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the docs correctly, this will NOT have the desired effect of ensuring the safety of the env::set_var calls.

This runs the given future on the current thread, blocking until it is complete, and yielding its resolved result. Any tasks or timers which the future spawns internally will be executed on the runtime.

When the multi thread scheduler is used this will allow futures to run within the io driver and timer context of the overall runtime.

Any spawned tasks will continue running after block_on returns.

https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.html#method.block_on

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. Another approach could be, since I don't think any of the other tasks krun-guest puts in motion depends on the network, simply move the network configuration to krun-server which already uses tokio and doesn't use env::set_var.

}
handle.route().add().v4().gateway(router).execute().await?;
fs::write("/etc/resolv.conf", format!("nameserver {}", router))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are assuming that the default gateway and the dns server are the same. While true on most residential networks, it is not a correct assumption in general, and even there, some of us very intentionally override the ISP dns server to something else.

Copy link
Collaborator

@teohhanhui teohhanhui Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inside the VM, so passt is doing its thing. In our current usage (dhclient), /etc/resolv.conf inside the VM would contain a nameserver entry pointing to the host's default gateway...

https://man.archlinux.org/man/passt.1.en#Handling_of_traffic_with_local_destination_and_source_addresses

So as far as I can tell, the code here is correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/etc/resolv.conf inside the VM would contain a nameserver entry pointing to the host's default gateway.

That is the problem that i am pointing out, in general nameserver != default gateway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More correct explanation here: #17 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WhatAmISupposedToPutHere Uhh, but I think I get what you mean now... The code here is only correct in the case of passt doing such DNS forwarding... If the resolver on the host is not a loopback address, this code is wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I've updated the PR to actually pick up all the DNS configuration from passt, which was the right thing to do.

Passt includes a DHCP server to allow guests to easily configure the
network without being aware passt is on the other side. But we _are_
aware of that, so we can take advantage of it.

Instead of using DHCP, read the configuration output from passt,
marshall it into some environment variables, pass it to the guest, and
have krun-guest read it and apply it using rtnetlink.

By doing that, we reduce the startup time in half, from...

$ time krun /bin/false
(...)
real    0m4,301s

... to ...

$ time krun /bin/false
(...)
real    0m1,966s

In addition of reducing the boot time, this potentially will prevent
some of the dhcp issues we've seen in the past.

Signed-off-by: Sergio Lopez <[email protected]>
@sbrivio-rh
Copy link
Contributor

sbrivio-rh commented Oct 22, 2024

Even though we're unlikely to break the output scraping you're adding here, I wonder if we could avoid that by carrying a minimal script for the ISC's dhclient. I guess dhcpcd won't cause any significant delay as it doesn't do all the checks that the default dhclient script on Fedora does.

In passt's test suite, we need to bring up guests fast, and at the same time check that DHCP works, so we use this script in the guest.

By the way, I wonder if you need to make this fast for IPv6 as well. There, you could disable neighbour solicitations in the guest, configure the address via DHCPv6 (or via NDP and SLAAC, bringing up the interface) with the nodad attribute, and then re-enable neighbour solicitations (full details at the bottom of this email).

Side note: I don't know if it works for you, but passt doesn't actually care about what address and route you're using. If the address is not the same you have on the host, it will just NAT things. You could happily use link-local addresses (even for IPv4, say, 169.254.1.1). But sure, you'll need to convey DNS information somehow.

Anyway, regardless of all this, the current patch looks fine to me.

@sbrivio-rh
Copy link
Contributor

But sure, you'll need to convey DNS information somehow.

Actually, even for that, you could hardcode things in a way similar to what Podman does with pasta(1): a single link-local address in /etc/resolv.conf in the container (here, guest) and tell passt to map that (--dns-forward) to whatever resolver you have on the host. On the other hand:

You could happily use link-local addresses (even for IPv4, say, 169.254.1.1)

this way, you would lose network transparency (making the guest look like the host in terms of addresses and routes), which might be important for some applications. I wanted to try out the tricks I suggested above to make DHCP fast, but I'm currently hitting:

   Compiling devices v0.1.0 (/home/sbrivio/libkrun/src/devices)
error: couldn't read `src/devices/src/virtio/fs/linux/../../../../../../init/init`: No such file or directory (os error 2)
  --> src/devices/src/virtio/fs/linux/passthrough.rs:33:29
   |
33 | static INIT_BINARY: &[u8] = include_bytes!("../../../../../../init/init");
   |                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: this error originates in the macro `include_bytes` (in Nightly builds, run with -Z macro-backtrace for more info)

error: could not compile `devices` (lib) due to 1 previous error

while trying to build libkrun. I might get back to it in a few days unless there's an obvious solution.

@sbrivio-rh
Copy link
Contributor

From r/Showerthoughts: netlink over vsock

@slp
Copy link
Collaborator Author

slp commented Oct 24, 2024

this way, you would lose network transparency (making the guest look like the host in terms of addresses and routes), which might be important for some applications. I wanted to try out the tricks I suggested above to make DHCP fast, but I'm currently hitting:

   Compiling devices v0.1.0 (/home/sbrivio/libkrun/src/devices)
error: couldn't read `src/devices/src/virtio/fs/linux/../../../../../../init/init`: No such file or directory (os error 2)
  --> src/devices/src/virtio/fs/linux/passthrough.rs:33:29
   |
33 | static INIT_BINARY: &[u8] = include_bytes!("../../../../../../init/init");
   |                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: this error originates in the macro `include_bytes` (in Nightly builds, run with -Z macro-backtrace for more info)

error: could not compile `devices` (lib) due to 1 previous error

while trying to build libkrun. I might get back to it in a few days unless there's an obvious solution.

You're probably running cargo build ... instead of make. Run make init/init once, and then you can rely solely con cargo.

@sbrivio-rh
Copy link
Contributor

You're probably running cargo build ... instead of make.

Oops, of course.

Run make init/init once, and then you can rely solely con cargo.

Right, it works, thanks! I'll try things out in a bit.

@sbrivio-rh
Copy link
Contributor

Right, it works, thanks! I'll try things out in a bit.

...currently driving me mad (yes, I have libkrun and libkrunfw installed):

  = note: /usr/bin/ld: /home/sbrivio/muvm/target/debug/deps/muvm-dfc2b43586ae1702.73tss62c465d1mti0x3tsc1b1.rcgu.o: in function `muvm::add_ro_disk':
          /home/sbrivio/muvm/crates/muvm/src/bin/muvm.rs:51:(.text._ZN4muvm11add_ro_disk17hc95bc1821f59bb8eE+0x362): undefined reference to `krun_add_disk'
          collect2: error: ld returned 1 exit status
          
  = note: some `extern` functions couldn't be found; some native libraries may need to be installed or have their path specified
  = note: use the `-l` flag to specify native libraries to link
  = note: use the `cargo:rustc-link-lib` directive to specify the native libraries to link with Cargo (see https://doc.rust-lang.org/cargo/reference/build-scripts.html#rustc-link-lib)

error: could not compile `muvm` (bin "muvm") due to 1 previous error

@teohhanhui
Copy link
Collaborator

teohhanhui commented Oct 25, 2024

@sbrivio-rh I think it's a cache problem. Try doing cargo clean and try again?

EDIT: Ohhhhh... #84 (comment)

@sbrivio-rh
Copy link
Contributor

EDIT: Ohhhhh... #84 (comment)

Thanks, I would have never found that!

@sbrivio-rh
Copy link
Contributor

sbrivio-rh commented Oct 25, 2024

Almost there...

NET=1, of course...

``` $ strace -f ./muvm true

[...]

[pid 4013940] execve("/usr/local/bin/passt", ["passt", "-q", "-f", "-t", "3334:3334", "--fd", "6"], 0x7ffd61f5fc80 /* 23 vars /) = -1 ENOENT (No such file or directory)
[pid 4013940] execve("/usr/bin/passt", ["passt", "-q", "-f", "-t", "3334:3334", "--fd", "6"], 0x7ffd61f5fc80 /
23 vars */ <unfinished ...>
[pid 4013939] <... clone3 resumed>) = 4013940
[pid 4013939] munmap(0x7f1bfb842000, 36864) = 0
[pid 4013939] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 4013939] fcntl(3, F_GETFD <unfinished ...>
[pid 4013940] <... execve resumed>) = 0
[pid 4013939] <... fcntl resumed>) = 0x1 (flags FD_CLOEXEC)
[pid 4013939] close(3) = 0
[pid 4013939] write(2, "Error: ", 7 <unfinished ...>
Error: [pid 4013940] brk(NULL <unfinished ...>
[pid 4013939] <... write resumed>) = 7
[pid 4013940] <... brk resumed>) = 0x563b6c492000
[pid 4013939] write(2, "Failed to configure net mode", 28Failed to configure net mode) = 28
[pid 4013939] write(2, "\n\nCaused by:", 12 <unfinished ...>

Caused by:[pid 4013940] mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 4013939] <... write resumed>) = 12
[pid 4013940] <... mmap resumed>) = 0x7f1605c01000
[pid 4013939] write(2, "\n", 1
) = 1
[pid 4013940] access("/etc/ld.so.preload", R_OK <unfinished ...>
[pid 4013939] write(2, " ", 4 <unfinished ...>
[pid 4013940] <... access resumed>) = -1 ENOENT (No such file or directory)
[pid 4013939] <... write resumed>) = 4
[pid 4013939] write(2, "Operation not supported", 23 <unfinished ...>
Operation not supported[pid 4013940] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 4013939] <... write resumed>) = 23
[pid 4013939] write(2, " (os error ", 11 <unfinished ...>
(os error [pid 4013940] <... openat resumed>) = 3
[pid 4013939] <... write resumed>) = 11
[pid 4013940] fstat(3, <unfinished ...>
[pid 4013939] write(2, "95", 295 <unfinished ...>


...but I can't see anything returning -95 / `EOPNOTSUPP`...
</details>

@sbrivio-rh
Copy link
Contributor

So, at least on my setup, this whole delay is actually caused by dhcpcd(8):

$ ./passt -f -p /tmp/muvm.pcap
$ time ./muvm --passt-socket=/tmp/passt_3.socket /bin/false 2>/dev/null

real	0m6.445s
user	0m0.916s
sys	0m0.549s
$ tshark -r /tmp/muvm.pcap 
    1   0.000000      0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x356a31d7
    2   0.000148 88.198.0.161 → 88.198.0.164 DHCP 342 DHCP Offer    - Transaction ID 0x356a31d7
    3   0.000513      0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Request  - Transaction ID 0x356a31d7
    4   0.000542 88.198.0.161 → 88.198.0.164 DHCP 342 DHCP ACK      - Transaction ID 0x356a31d7
    5   0.023915 5a:94:ef:e4:0c:ee → Broadcast    ARP 42 Who has 88.198.0.164? (ARP Probe)
    6   1.448584 5a:94:ef:e4:0c:ee → Broadcast    ARP 42 Who has 88.198.0.164? (ARP Probe)
    7   2.828025 5a:94:ef:e4:0c:ee → Broadcast    ARP 42 Who has 88.198.0.164? (ARP Probe)
    8   4.830970 5a:94:ef:e4:0c:ee → Broadcast    ARP 42 ARP Announcement for 88.198.0.164

because, by default (without --noarp), it ARP-probes the address we just assigned. If I do this:

diff --git a/crates/muvm/src/guest/net.rs b/crates/muvm/src/guest/net.rs
index b47f2e1..aa74ce0 100644
--- a/crates/muvm/src/guest/net.rs
+++ b/crates/muvm/src/guest/net.rs
@@ -41,7 +41,7 @@ pub fn configure_network() -> Result<()> {
     };
     if let Some(dhcpcd_path) = dhcpcd_path {
         let output = Command::new(dhcpcd_path)
-            .args(["-M", "--nodev", "eth0"])
+            .args(["-M", "--nodev", "eth0", "--noarp"])
             .output()
             .context("Failed to execute `dhcpcd` as child process")?;
         debug!(output:?; "dhcpcd output");

I get:

$ time ./muvm --passt-socket=/tmp/passt_3.socket /bin/false 2>/dev/null

real	0m1.404s
user	0m0.899s
sys	0m0.583s

For some reason I couldn't get IPv6 working in the guest so I didn't try that (at least, not yet).

@sbrivio-rh
Copy link
Contributor

For some reason I couldn't get IPv6 working in the guest so I didn't try that (at least, not yet).

Oh, it's simply disabled in the kernel configuration from libkrunfw (config-libkrunfw_x86_64):

# CONFIG_IPV6 is not set

and this is there starting from the very beginning, libkrunfw commit 443c03426a73. Is there a specific reason for it? Maybe because TSI didn't/doesn't support that? Or you don't need it at all in libkrun/muvm? Does it make sense that I spend a moment trying to "fix" that? @slp?

@slp
Copy link
Collaborator Author

slp commented Oct 28, 2024

and this is there starting from the very beginning, libkrunfw commit 443c03426a73. Is there a specific reason for it? Maybe because TSI didn't/doesn't support that? Or you don't need it at all in libkrun/muvm? Does it make sense that I spend a moment trying to "fix" that? @slp?

That was disabled as part of the work to slim down the guest kernel. I'm okay with enabling it if there's a user demand for the feature, but so far I haven't heard of anyone requesting it.

@teohhanhui
Copy link
Collaborator

It just feels ethically wrong to be contributing to holding back IPv6 adoption. 🙈

@asahilina
Copy link
Member

Oof, libkrunfw doesn't have CONFIG_IPV6? Yeah, we really should fix that... even if things work with IPv4 only for most people, it's not sustainable to not have any kind of v6 networking within the VM. Even if v4 only works, IPv4 infrastructure is increasingly going through CGNAT while IPv6 infrastructure is not, and that means IPv6 is often faster where it is available (that is the case for me most of the time at home).

@sbrivio-rh
Copy link
Contributor

That was disabled as part of the work to slim down the guest kernel. I'm okay with enabling it if there's a user demand for the feature, but so far I haven't heard of anyone requesting it.

There's one problem with that and passt: if passt doesn't find an IPv4 interface on the host, it disables IPv4, and only smoke signals remain at that point.

This is by default at least: you can still enable IPv4 even on an IPv6-only host, but then you need to give explicit addresses and routes.

I can try to enable CONFIG_IPV6 in libkrunfw and see what happens. I'll give it a try at some point this week.

@sbrivio-rh
Copy link
Contributor

sbrivio-rh commented Nov 15, 2024

I can try to enable CONFIG_IPV6 in libkrunfw and see what happens. I'll give it a try at some point this week.

I meant this week. IPv6 enabled in libkrunfw and everything, but at this point I would really need to get a shell in muvm to check addresses and routes, and I actually never got that far:

$ ./muvm --passt-socket=/tmp/passt_1.socket /bin/false
Using default interface naming scheme 'v255'.
/etc/udev/rules.d/60-scheduler.rules:1 The line has no effect, ignoring.
Error: Failed to set up user, bailing out

Caused by:
    0: Failed to read directory `/dev/dri`
    1: No such file or directory (os error 2)

I don't have DRI/DRM support on this system (I don't have a graphical environment or a display adapter either). How do I work around that? Never mind, I just had to comment out that part of setup_directories(), and I finally got a shell. Now looking into IPv6 support via passt.

@asahilina
Copy link
Member

asahilina commented Nov 15, 2024

That's a bug... let me fix it.

Edit: Actually, no, I think that means you built libkrun incorrectly. Regardless of the GPU/audio support in the host, libkrun should be configured with GPU and audio support, which would create those devices. Either way, I opened #106 to make muvm robust against this.

@sbrivio-rh
Copy link
Contributor

Edit: Actually, no, I think that means you built libkrun incorrectly. Regardless of the GPU/audio support in the host, libkrun should be configured with GPU and audio support, which would create those devices. Either way, I opened #106 to make muvm robust against this.

Oh, I see, I built it with BLK=1 NET=1 make, without SND=1 GPU=1. On the other hand, I don't care about audio (no audio card) or video (no display adapter) for my "usage", so I guess yes, muvm should detect that or something.

@sbrivio-rh
Copy link
Contributor

Now looking into IPv6 support via passt.

Yeah, it works almost out of the box, with a simple CONFIG_IPV6=y in libkrunfw. I just need to bring up the link before dhcpcd starts. Now trying with a few combinations (also with dhclient) and checking timing...

@slp
Copy link
Collaborator Author

slp commented Nov 15, 2024

FWIW, the latest libkrunfw (4.5.1) has CONFIG_IPV6 enabled.

@sbrivio-rh
Copy link
Contributor

FWIW, the latest libkrunfw (4.5.1) has CONFIG_IPV6 enabled.

Yes, I noticed, thanks. If I simply:

diff --git a/crates/muvm/src/guest/net.rs b/crates/muvm/src/guest/net.rs
index b47f2e1..96c4689 100644
--- a/crates/muvm/src/guest/net.rs
+++ b/crates/muvm/src/guest/net.rs
@@ -33,6 +33,8 @@ pub fn configure_network() -> Result<()> {
         sethostname(hostname.as_bytes()).context("Failed to set hostname")?;
     }
 
+    let output = Command::new("/sbin/ip").args(["link", "set", "dev", "eth0", "up"]).output();
+
     let dhcpcd_path = find_in_path("dhcpcd").context("Failed to check existence of `dhcpcd`")?;
     let dhcpcd_path = if let Some(dhcpcd_path) = dhcpcd_path {
         Some(dhcpcd_path)

...I get IPv6 addresses and routes now, but we need to disable DAD on the address (which means we need to disable neighbour advertisements on the interface, bring it up, set IFA_F_NODAD on the address, re-enable advertisements), otherwise we add two seconds there. That needs a few netlink messages, doing everything with ip(8) takes ages.

For IPv4, I'm trying out DHCP rapid commit (RFC 4039) instead.

This might take a bit, but I think we can have IPv4 and IPv6 connectivity set up fast (a couple of milliseconds) and relatively cleanly.

@sbrivio-rh
Copy link
Contributor

This might take a bit, but I think we can have IPv4 and IPv6 connectivity set up fast (a couple of milliseconds) and relatively cleanly.

I guess I can claim success:

$ time ./target/debug/muvm --passt-socket=/tmp/passt_3.socket -- /bin/false
DHCP client took 340.27µs
Network configuration took 1.5503ms
Using default interface naming scheme 'v255'.
"/bin/false" process exited with status code: 1

real	0m1.196s
user	0m0.747s
sys	0m0.502s

with both IPv4 (DHCP) and IPv6 (SLAAC) up and running, with the net.rs below plus some changes in passt implementing option 80 and honouring the "broadcast" DHCP flag (not merged yet, but it shouldn't take long before we get it in packages).

A short summary of the proposed implementation:

  • External DHCP Clients Considered Harmful
  • instead of just triggering network configuration and then running away, wait until it completes, to guarantee that we can (correctly) run whatever application directly from the command line. Perhaps this is not so important for the current (main?) usage of muvm, but I have some secret plans about using this thing for kselftests which are not secret anymore
  • for IPv4-only: use DHCP, then poll on the IFA_F_TENTATIVE flag of the link-local IPv6 address (CONFIG_IPV6=y in libkrunfw is enough to get one) to see if we can expect a global unicast address coming from passt via SLAAC
  • for IPv6-only: send the DHCP request with a short (?) timeout (100ms) to keep that case reasonably fast
  • both IPv4 and IPv6 (passt's default if both are available on the host): start SLAAC, take care of DHCP, then wait until SLAAC completed
Proposed implementation (still with "profiling" prints):

use std::fs;
use std::io::Write;
use std::net::{UdpSocket, Ipv4Addr};
use std::time::Instant;
use std::time::Duration;

use anyhow::{Context, Result};
use rustix::system::sethostname;

use neli::{
    consts::{
        nl::NlmF,
        rtnl::{Arphrd, Ifa, IfaF, Iff, Rta, RtAddrFamily, Rtm, RtmF, Rtn,
               Rtprot, RtScope, RtTable},
        socket::NlFamily,
    },
    nl::{NlPayload, Nlmsghdr},
    router::synchronous::{NlRouter, NlRouterReceiverHandle},
    rtnl::{Ifaddrmsg, IfaddrmsgBuilder, Ifinfomsg, IfinfomsgBuilder,
           RtattrBuilder, Rtmsg, RtmsgBuilder},
    utils::Groups,
    types::RtBuffer,
};

/// Set interface flags for eth0 (interface index 2) with a given mask
fn flags_eth0(rtnl: &NlRouter, mask: Iff, set: Iff) -> Result<()> {
    let ifinfomsg = IfinfomsgBuilder::default()
        .ifi_family(RtAddrFamily::Unspecified)
        .ifi_type(Arphrd::Ether).ifi_index(2)
        .ifi_change(mask).ifi_flags(set)
        .build()?;

    let _: NlRouterReceiverHandle<Rtm, Ifinfomsg> =
        rtnl.send(Rtm::Newlink, NlmF::REQUEST, NlPayload::Payload(ifinfomsg))?;

    Ok(())
}

/// Add or delete IPv4 routes for eth0 (interface index 2)
fn route4_eth0(rtnl: &NlRouter, what: Rtm, gw: Ipv4Addr) -> Result<()> {
    let rtmsg = RtmsgBuilder::default()
        .rtm_family(RtAddrFamily::Inet)
        .rtm_dst_len(0).rtm_src_len(0).rtm_tos(0)
        .rtm_table(RtTable::Main).rtm_protocol(Rtprot::Boot)
        .rtm_scope(RtScope::Universe).rtm_type(Rtn::Unicast)
        .rtm_flags(RtmF::empty())
        .rtattrs(RtBuffer::from_iter([
            RtattrBuilder::default()
                .rta_type(Rta::Oif)
                .rta_payload(2)
                .build()?,
            RtattrBuilder::default()
                .rta_type(Rta::Dst)
                .rta_payload(Ipv4Addr::UNSPECIFIED.octets().to_vec())
                .build()?,
            RtattrBuilder::default()
                .rta_type(Rta::Gateway)
                .rta_payload(gw.octets().to_vec())
                .build()?
        ]))
        .build()?;

    let _: NlRouterReceiverHandle<Rtm, Rtmsg> =
        rtnl.send(what, NlmF::CREATE | NlmF::REQUEST,
                  NlPayload::Payload(rtmsg))?;

    Ok(())
}

/// Add or delete IPv4 addresses for eth0 (interface index 2)
fn addr4_eth0(rtnl: &NlRouter, what: Rtm, addr: Ipv4Addr, prefix_len: u8)
                       -> Result<()> {
    let ifaddrmsg = IfaddrmsgBuilder::default()
        .ifa_family(RtAddrFamily::Inet)
        .ifa_prefixlen(prefix_len)
        .ifa_scope(RtScope::Universe)
        .ifa_index(2)
        .rtattrs(RtBuffer::from_iter([
            RtattrBuilder::default()
                .rta_type(Ifa::Local)
                .rta_payload(addr.octets().to_vec())
                .build()?,
            RtattrBuilder::default()
                .rta_type(Ifa::Address)
                .rta_payload(addr.octets().to_vec())
                .build()?,
        ]))
        .build()?;

    let _: NlRouterReceiverHandle<Rtm, Ifaddrmsg> =
        rtnl.send(what, NlmF::CREATE | NlmF::REQUEST,
                  NlPayload::Payload(ifaddrmsg))?;

    Ok(())
}

pub fn configure_network() -> Result<()> {

    // Allow unprivileged users to use ping, as most distros do by default.
    {
        let mut file = fs::File::options()
            .write(true)
            .open("/proc/sys/net/ipv4/ping_group_range")
            .context("Failed to open ipv4/ping_group_range for writing")?;

        file.write_all(format!("{} {}", 0, 2147483647).as_bytes())
            .context("Failed to extend ping group range")?;
    }

    {
        let hostname =
            fs::read_to_string("/etc/hostname").unwrap_or("placeholder-hostname".to_string());
        let hostname = if let Some((hostname, _)) = hostname.split_once('\n') {
            hostname.to_owned()
        } else {
            hostname
        };
        sethostname(hostname.as_bytes()).context("Failed to set hostname")?;
    }

    let start = Instant::now();
    let (rtnl, _) = NlRouter::connect(NlFamily::Route, None, Groups::empty())?;
    rtnl.enable_strict_checking(true)?;

    // Disable neighbour solicitations (dodge DAD), bring up link to start SLAAC
    {
        // IFF_NOARP | IFF_UP in one shot delays router solicitations, avoid it
        flags_eth0(&rtnl, Iff::NOARP, Iff::NOARP)?;
        flags_eth0(&rtnl, Iff::UP, Iff::UP)?;
    }

    let start_dhcp = Instant::now();
    // Configure IPv4 using DHCP with Rapid Commit (DISCOVER -> ACK)
    {
        // Temporary link-local address and route avoid the need for raw sockets
        route4_eth0(&rtnl, Rtm::Newroute, Ipv4Addr::UNSPECIFIED)?;
        addr4_eth0(&rtnl, Rtm::Newaddr, Ipv4Addr::new(169, 254, 1, 1), 16)?;

        // Send request (DHCPDISCOVER)
        let socket = UdpSocket::bind("0.0.0.0:68").expect("Failed to bind");
        let mut buf = [0; 576 /* RFC 2131, Section 2 */ ];

        const REQUEST: [u8; 300 /* From RFC 951: >= 60 B of options */ ] = [
            1 /* REQUEST */, 0x1 /* Ethernet */, 6 /* hlen */, 0 /* Hops */,
            1, 2, 3, 4 /* XID */, 0, 0 /* Seconds */, 0x80, 0x0 /* Flags */,
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* All-zero (four) addresses */
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 16B HW address: who cares */
            /* 32 bytes per row: 64B 'sname', plus 128B 'file' (RFC 1531) */
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
            0x63, 0x82, 0x53, 0x63, /* DHCP (magic) cookie, then options: */
            53, 1, 1 /* DISCOVER */, 80, 0 /* Rapid commit */, 0xff, // Done
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 /* 54B paaaadding */
        ];

        socket.set_broadcast(true)?;
        socket.send_to(&REQUEST, "255.255.255.255:67")?;

        // Keep IPv6-only fast
        let _ = socket.set_read_timeout(Some(Duration::from_millis(100)));

        // Get and process response (DHCPACK) if any
        if let Ok((len, _)) = socket.recv_from(&mut buf) {
            let msg = &mut buf[..len];

            let addr = Ipv4Addr::new(msg[16], msg[17], msg[18], msg[19]);
            let mut netmask = Ipv4Addr::UNSPECIFIED;
            let mut router = Ipv4Addr::UNSPECIFIED;
            let mut p: usize = 240;

            while p < len {
                let o = msg[p];
                let mut l: u8;
                l = msg[p + 1];

                if o == 1 {           // Option 1: Subnet Mask
                    netmask = Ipv4Addr::new(msg[p + 2], msg[p + 3],
                                            msg[p + 4], msg[p + 5]);
                } else if o == 3 {    // Option 3: Router
                    router =  Ipv4Addr::new(msg[p + 2], msg[p + 3],
                                            msg[p + 4], msg[p + 5]);
                } else if o == 0xff { // Option 255: End (of options)
                    break;
                }

                l += 2; // Length doesn't include code and length field itself
                p += l as usize;
            }

            let prefix_len : u8 = netmask.to_bits().leading_ones() as u8;

            // Drop temporary address and route, configure what we got instead
            route4_eth0(&rtnl, Rtm::Delroute, Ipv4Addr::UNSPECIFIED)?;
            addr4_eth0(&rtnl, Rtm::Deladdr, Ipv4Addr::new(169, 254, 1, 1), 16)?;

            addr4_eth0(&rtnl, Rtm::Newaddr, addr, prefix_len)?;
            route4_eth0(&rtnl, Rtm::Newroute, router)?;
        } else {
            // Clean up: we're clearly too cool for IPv4
            route4_eth0(&rtnl, Rtm::Delroute, Ipv4Addr::UNSPECIFIED)?;
            addr4_eth0(&rtnl, Rtm::Deladdr, Ipv4Addr::new(169, 254, 1, 1), 16)?;
        }
    }
    let elapsed = start_dhcp.elapsed();
    println!("DHCP client took {:?}", elapsed);

    // Wait for SLAAC to complete or fail: we're done only once network is ready
    {
        let mut global_seen = false;
        let mut global_wait = true;
        let mut ll_seen = false;

        // Busy-netlink-loop until we see a link-local address, and a global
        // unicast address as long as we might expect one (see below)
        while !ll_seen || (global_wait && !global_seen) {
            let ifaddrmsg = IfaddrmsgBuilder::default()
                .ifa_family(RtAddrFamily::Inet6)
                .ifa_prefixlen(0)
                .ifa_scope(RtScope::Universe)
                .ifa_index(2)
                .build()?;

            let recv = rtnl.send(Rtm::Getaddr, NlmF::ROOT,
                                 NlPayload::Payload(ifaddrmsg))?;

            for response in recv {
                let header: Nlmsghdr<Rtm, Ifaddrmsg> = response?;
                if let NlPayload::Payload(p) = header.nl_payload() {
                    if p.ifa_scope() == &RtScope::Link {
                        // A non-tentative link-local address implies we sent a
                        // router solicitation that didn't get any response
                        // (IPv4-only)? Stop waiting for the router in that case
                        if *p.ifa_flags() & IfaF::TENTATIVE != IfaF::TENTATIVE {
                            global_wait = false;
                        }

                        ll_seen = true;
                    } else if p.ifa_scope() == &RtScope::Universe {
                        global_seen = true;
                    }
                }
            }
        }
    }

    // Re-enable neighbour solicitations and ARP requests
    {
        flags_eth0(&rtnl, Iff::NOARP, Iff::empty())?;
    }

    let elapsed = start.elapsed();
    println!("Network configuration took {:?}", elapsed);

    Ok(())
}
Both link-local and global connectivity work right away

$ time ./target/debug/muvm --passt-socket=/tmp/passt_3.socket -- ping -c1 2a01:4f8:222:904::2
DHCP client took 571.66µs
Network configuration took 1.83646ms
Using default interface naming scheme 'v255'.
PING 2a01:4f8:222:904::2 (2a01:4f8:222:904::2) 56 data bytes
64 bytes from 2a01:4f8:222:904::2: icmp_seq=1 ttl=255 time=0.256 ms

--- 2a01:4f8:222:904::2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.256/0.256/0.256/0.000 ms

real	0m1.191s
user	0m0.746s
sys	0m0.493s

$ time ./target/debug/muvm --passt-socket=/tmp/passt_3.socket -- ping -c1 fe80::1%eth0
DHCP client took 396.18µs
Network configuration took 1.79467ms
Using default interface naming scheme 'v255'.
PING fe80::1%eth0 (fe80::1%eth0) 56 data bytes
64 bytes from fe80::1%eth0: icmp_seq=1 ttl=255 time=0.234 ms

--- fe80::1%eth0 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.234/0.234/0.234/0.000 ms

real	0m1.199s
user	0m0.713s
sys	0m0.528s
IPv6-only operation (`passt -6`)

$ time ./target/debug/muvm --passt-socket=/tmp/passt_3.socket -- /bin/false
Using default interface naming scheme 'v255'.
/etc/udev/rules.d/60-scheduler.rules:1 The line has no effect, ignoring.
DHCP client took 101.189115ms
Network configuration took 102.811555ms
"/bin/false" process exited with status code: 1

real	0m1.288s
user	0m0.811s
sys	0m0.626s
IPv4-only operation (`passt -4`)

$ time ./target/debug/muvm --passt-socket=/tmp/passt_3.socket -- /bin/false
DHCP client took 377.87µs
Network configuration took 1.77258ms
"/bin/false" process exited with status code: 1

real	0m1.158s
user	0m0.689s
sys	0m0.508s

What do you think? Should I submit this as a separate change?

@slp
Copy link
Collaborator Author

slp commented Nov 25, 2024

What do you think? Should I submit this as a separate change?

This looks great, thank you @sbrivio-rh !

Yes please, submit this as a new PR that will supersede this one. I guess we need to wait until the new passt lands in Fedora?

@sbrivio-rh
Copy link
Contributor

Yes please, submit this as a new PR that will supersede this one.

Okay! I'll do that in a bit.

I guess we need to wait until the new passt lands in Fedora?

Yes, and that we merge the series, too. But after that part, I'm usually fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants