qemu: implement boot-time checkin (via ovirt-guest-agent protocol) #458

cgwalters · 2020-07-14T13:56:10Z

Currently if FCOS fails in very early boot (e.g. in the bootloader or kernel, before switching to the initramfs), it's...hard to detect consistently. For example, we have a test for Secure Boot, but it turns out that if the kernel fails to verify then...coreos-assembler today hangs for a really long time.

We can scrape the serial console in most cases, but we configure the Live ISO not to log to a serial console...so we'd end up having instead to do something like OpenQA and do image recognition on the graphics console 😢

Now we debated this somewhat in
coreos/ignition-dracut#170
and I argued strongly that the most important thing was to cover the "failure in initramfs" case, and we could support the "stream journal in general" by injecting Ignition.

In retrospect...I think I was wrong. It would be extremely useful for us to stream the journal starting from the initramfs at least by default on qemu.

In particular, what we really want is some sort of message from the VM that it has entered the initramfs, but before we start processing Ignition. If we're doing things like reprovisioning the rootfs, it becomes difficult to define a precise "timeout bound". But I think we can e.g. reliably time out after something quite low (like 10 seconds) if we haven't seen the "entered initramfs" message.

So here's my proposal:

Add a variant of Add ignition-virtio-dump-journal.service ignition-dracut#170 that streams the journal to a differently named virtio channel
Update coreos-assembler to use it
Eventually drop the old "only dump on failure" channel and code

lucab · 2020-07-14T14:20:15Z

[Tackling only one bit out of the larger context]

In particular, what we really want is some sort of message from the VM that it has entered the initramfs, but before we start processing Ignition.

This sounds not very different from Azure boot check-in, which we are performing in a non-homogeneous way today (RHCOS does it in initramfs, FCOS does is after boot-complete.target).

We could think about consolidating it in a "first-boot initramfs reached" check-in across various platforms in Afterburn, to be run before Ignition.
However:

this places a lower bound in the initramfs service ordering, somewhere after the network is up
the lowest common denominator across clouds (i.e. Azure, Packet, qemu) seems to be a one-shot signal (i.e. a marker, not a rich streaming log)

cgwalters · 2020-07-14T16:02:35Z

(forwarding some real time discussion here)

I think you're absolutely right - we should think of this as "add a first-boot checkin to our qemu model", since that model already exists on some clouds.

And then further discussion turned up https://wiki.qemu.org/Features/GuestAgent - so we could implement the minimum there in Afterburn, and have coreos-assembler time out if the guest doesn't reply to a sync pretty early on.

this places a lower bound in the initramfs service ordering which is somewhere after the network is up

We can make the afterburn checkin not require networking on qemu, but I'm not very concerned about this TBH because qemu networking is quite fast.

cgwalters · 2020-07-14T19:49:53Z

OK, moved this issue to afterburn.

I took a quick look at implementing this. I'd like to propose that we have afterburn run itself as a systemd generator early on startup, rather than shipping static units; this would give us a clean way to order our guest startup unit invocation
After=/dev/virtio-ports/agent
(And we could also tweak the unit to not be After=network-online.target for the qemu case, etc.)

lucab · 2020-07-15T10:41:30Z

And then further discussion turned up https://wiki.qemu.org/Features/GuestAgent - so we could implement the minimum there in Afterburn, and have coreos-assembler time out if the guest doesn't reply to a sync pretty early on.

That doesn't look like a great fit. In particular, the protocol is unidirectional (host->guest) which means that we have to sit in initramfs waiting to be polled (instead of actively signaling a guest->host event, like we do on Azure and Packet), and we can't really leave the initramfs until we have been polled.
The amount of time we should be sitting there is somehow arbitrary, I guess. In the vast majority of cases there won't be anything polling, so we would be now delaying most instances first boots on qemu.

Perhaps a more suitable protocol to target would be the ovirt-guest-agent one, which seems to support sending guest->host events. Conveniently, it already defines a system-startup event.

cgwalters · 2020-07-15T16:37:43Z

You're right, once I started working on the code I noticed the inversion of control.

Discussing the oVirt protocol though gets into the much bigger topic of whether we want to try to implement more of the protocol for real as an agent on that platform, and how the platform would behave with our likely-to-be-a-subset of the functionality.

I guess as a start we could just respond to the channel on ignition.platform.id=qemu and sidestep that though.

lucab · 2020-07-29T11:01:14Z

Additional note: on Azure the firstboot check-in also ejects the Virtual CD (paging @darkmuggle for confirmation), so I fear we cannot really check in before Ignition fetching is completed.

cgwalters · 2020-08-03T13:20:56Z

Additional note: on Azure the firstboot check-in also ejects the Virtual CD (paging @darkmuggle for confirmation), so I fear we cannot really check in before Ignition fetching is completed.

Right, this comment proposes making our systemd units platform-dependent.

(Also we discussed a while changing Ignition on Azure to save the config to /boot or so)

bgilbert · 2020-08-12T10:58:41Z

Right, this comment proposes making our systemd units platform-dependent.

We don't need to do it as a generator, though, right? We can just ship some static units with ConditionKernelCommandLine=ignition.platform.id=X.

cgwalters · 2020-08-25T14:42:45Z

We don't need to do it as a generator, though, right? We can just ship some static units with ConditionKernelCommandLine=ignition.platform.id=X.

Yeah; the duplication there might get ugly but OTOH I guess we could generate the units statically i.e. as part of the build process.

AdamWill · 2020-08-25T17:18:28Z

@cgwalters pointed me to this ticket, so for the record, as he knows, in the last week I've been working on running openQA tests on Fedora CoreOS. It's not very difficult, and we have it working already, the only question mark for a 'production' deployment would be when and on what to trigger the tests.

openQA definitely does do a fairly good job of letting you know if the artifact under test boots successfully in a VM.

For now the work is living on a branch of the openQA test repo and is only deployed on my pet openQA instance which is not up all the time (it heats up my office...:>), we can do a production deployment quite easily once the triggering questions are sorted out.

cgwalters · 2020-08-31T13:53:54Z

openQA definitely does do a fairly good job of letting you know if the artifact under test boots successfully in a VM.

To be clear kola already covers this pretty well in general - we just have a few specific gaps, such as the case when a Secure Boot signature validation fails.

AdamWill · 2020-11-02T18:31:19Z

for the record once more, we did deploy the openQA CoreOS testing stuff to production. The scheduling works by just checking once an hour if any of the named streams has been updated, and scheduling tests for the new build if so. Results show up at https://openqa.fedoraproject.org/group_overview/1?limit_builds=100 , e.g. here. We can write/run more tests if desired, requests can be filed at https://pagure.io/fedora-qa/os-autoinst-distri-fedora/issues .

cgwalters transferred this issue from coreos/ignition-dracut Jul 14, 2020

cgwalters changed the title ~~streaming journal from the initramfs by default~~ implement boot-time checkin by supporting the qemu guest api Jul 14, 2020

lucab changed the title ~~implement boot-time checkin by supporting the qemu guest api~~ qemu: implement boot-time checkin (via ovirt-guest-agent protocol) Aug 21, 2020

cgwalters mentioned this issue Aug 31, 2020

[WIP] Bug 1872238: oVirt: Introduce qemu-guest-agent container openshift/machine-config-operator#2042

Closed

lucab mentioned this issue Sep 8, 2020

[RFC] providers: add qemu provider for boot check-in #488

Closed

1 task

bgilbert added kind/enhancement platform/qemu labels Jan 12, 2023

cgwalters mentioned this issue May 5, 2023

Platform Request: kubevirt coreos/fedora-coreos-tracker#1126

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qemu: implement boot-time checkin (via ovirt-guest-agent protocol) #458

qemu: implement boot-time checkin (via ovirt-guest-agent protocol) #458

cgwalters commented Jul 14, 2020

lucab commented Jul 14, 2020 •

edited

cgwalters commented Jul 14, 2020

cgwalters commented Jul 14, 2020

lucab commented Jul 15, 2020

cgwalters commented Jul 15, 2020

lucab commented Jul 29, 2020

cgwalters commented Aug 3, 2020

bgilbert commented Aug 12, 2020

cgwalters commented Aug 25, 2020

AdamWill commented Aug 25, 2020 •

edited

cgwalters commented Aug 31, 2020

AdamWill commented Nov 2, 2020

qemu: implement boot-time checkin (via ovirt-guest-agent protocol) #458

qemu: implement boot-time checkin (via ovirt-guest-agent protocol) #458

Comments

cgwalters commented Jul 14, 2020

lucab commented Jul 14, 2020 • edited

cgwalters commented Jul 14, 2020

cgwalters commented Jul 14, 2020

lucab commented Jul 15, 2020

cgwalters commented Jul 15, 2020

lucab commented Jul 29, 2020

cgwalters commented Aug 3, 2020

bgilbert commented Aug 12, 2020

cgwalters commented Aug 25, 2020

AdamWill commented Aug 25, 2020 • edited

cgwalters commented Aug 31, 2020

AdamWill commented Nov 2, 2020

lucab commented Jul 14, 2020 •

edited

AdamWill commented Aug 25, 2020 •

edited