forked from systemd/systemd
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
2440 lines (1924 loc) · 124 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Bugfixes:
* Many manager configuration settings that are only applicable to user
manager or system manager can be always set. It would be better to reject
them when parsing config.
* Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
External:
* Fedora: add an rpmlint check that verifies that all unit files in the RPM are listed in %systemd_post macros.
* dbus:
- natively watch for dbus-*.service symlinks (PENDING)
- teach dbus to activate all services it finds in /etc/systemd/services/org-*.service
* kernel: add device_type = "fb", "fbcon" to class "graphics"
* /usr/bin/service should actually show the new command line
* fedora: suggest auto-restart on failure, but not on success and not on coredump. also, ask people to think about changing the start limit logic. Also point people to RestartPreventExitStatus=, SuccessExitStatus=
* neither pkexec nor sudo initialize environ[] from the PAM environment?
* fedora: update policy to declare access mode and ownership of unit files to root:root 0644, and add an rpmlint check for it
* register catalog database signature as file magic
* zsh shell completion:
- <command> <verb> -<TAB> should complete options, but currently does not
- systemctl add-wants,add-requires
- systemctl reboot --boot-loader-entry=
* systemctl status should know about 'systemd-analyze calendar ... --iterations='
* If timer has just OnInactiveSec=..., it should fire after a specified time
after being started.
* write blog stories about:
- hwdb: what belongs into it, lsusb
- enabling dbus services
- how to make changes to sysctl and sysfs attributes
- remote access
- how to pass throw-away units to systemd, or dynamically change properties of existing units
- testing with Harald's awesome test kit
- auto-restart
- how to develop against journal browsing APIs
- the journal HTTP iface
- non-cgroup resource management
- dynamic resource management with cgroups
- refreshed, longer missions statement
- calendar time events
- init=/bin/sh vs. "emergency" mode, vs. "rescue" mode, vs. "multi-user" mode, vs. "graphical" mode, and the debug shell
- how to create your own target
- instantiated apache, dovecot and so on
- hooking a script into various stages of shutdown/rearly booot
Regularly:
* look for close() vs. close_nointr() vs. close_nointr_nofail()
* check for strerror(r) instead of strerror(-r)
* pahole
* set_put(), hashmap_put() return values check. i.e. == 0 does not free()!
* use secure_getenv() instead of getenv() where appropriate
* link up selected blog stories from man pages and unit files Documentation= fields
Janitorial Clean-ups:
* rework mount.c and swap.c to follow proper state enumeration/deserialization
semantics, like we do for device.c now
* get rid of prefix_roota() and similar, only use chase() and related
calls instead.
* get rid of basename() and replace by path_extract_filename()
* Replace our fstype_is_network() with a call to libmount's mnt_fstype_is_netfs()?
Having two lists is not nice, but maybe it's now worth making a dependency on
libmount for something so trivial.
Deprecations and removals:
* Remove any support for booting without /usr pre-mounted in the initrd entirely.
Update INITRD_INTERFACE.md accordingly.
* 2019-10 – Remove POINTINGSTICK_CONST_ACCEL references from the hwdb, see #9573
* remove cgrouspv1 support EOY 2023. As per
https://lists.freedesktop.org/archives/systemd-devel/2022-July/048120.html
and then rework cgroupsv2 support around fds, i.e. keep one fd per active
unit around, and always operate on that, instead of cgroup fs paths.
* drop support for kernels that lack ambient capabilities support (i.e. make
4.3 new baseline). Then drop support for "!!" modifier for ExecStart= which
is only supported for such old kernels.
* drop support for kernels lacking memfd_create() (i.e. make 3.17 new
baseline), then drop all pipe() based fallbacks.
* drop support for getrandom()-less kernels. (GRND_INSECURE means once kernel
5.6 becomes our baseline). See
https://github.com/systemd/systemd/pull/24101#issuecomment-1193966468 for
details. Maybe before that: at taint-flags/warn about kernels that lack
getrandom()/environments where it is blocked.
* drop support for LOOP_CONFIGURE-less loopback block devices, once kernel
baseline is 5.8.
* drop fd_is_mount_point() fallback mess once we can rely on
STATX_ATTR_MOUNT_ROOT to exist i.e. kernel baseline 5.8
* rework our PID tracking in services and so on, to be strictly based on pidfd,
once kernel baseline is 5.13.
* H2 2023: remove support for unmerged-usr
* Remove /dev/mem ACPI FPDT parsing when /sys/firmware/acpi/fpdt is ubiquitous.
That requires distros to enable CONFIG_ACPI_FPDT, and have kernels v5.12 for
x86 and v6.2 for arm.
* Once baseline is 4.13, remove support for INTERFACE_OLD= checks in "udevadm
trigger"'s waiting logic, since we can then rely on uuid-tagged uevents
* remove remaining tpm1.2 support from sd-stub
Features:
* new "systemd-pcrlock" component for dealing with PCR4. Design idea:
1. define /{etc,usr,var/lib}/pcrlock.d/<component>/<version>.pcrlock
2. these files contain list of hashes that will be measured when component is
run, per PCR
3. each component involved in the boot that is deterministically measured can
place one or more of these files in those dirs (shim, sd-boot,
sd-stub/UKI, cryptsetup, pcrphase, pcrfs, …)
4. since each component has its own dir, with multiple files in them, package
such as kernels (of which there can be multiple installed at the same
time) can be grouped together: only one of them is measured at a time.
5. whenever a new component is added or an old one removed, or the PCR lock
shall be relaxed or tightened the systemd-pcrlock tool is invoked.
6. tool iterates through all these files, orders them alphabetically by
component, then matches them up with current measurements (as per uefi
event log), identifying by hash, accepting that the "beginning" of the
measurements might not be recognizable.
7. Then calculates expected PCR values starting with the "unrecognized
head" from the event log, then continuing with all of components
defined via the .pcrlock files (but dropping out the "recognized tail"
from the uefi event log). (This might mean combinatorial explosion, if
there are multiple shims, multiple sd-boot, and so on.)
8. Generates a public/private key pair on the TPM
9. Generates a counter object in the TPM, with a policy that allows only
one-by-one increase with signature policy by the public/private key pair.
10. now signs policies of all expected PCR values with the generated keypair,
using all combinations of components defined in the .pcrlock files
restricting it to the counter + 1.
11. locks down the keypair with a signed policy with its own public key
12. generates JSON file of all these policies with their signatures, drops
them as singleton in ESP
13. increases the counter by one.
14. after boot sd-stub picks JSON up from ESP, passes it to userspace via
.extra
15. JSON contained policies can now be used to unlock disk as well as the
public/key itself for signing further policies, as well as increment for
the counter
16. whenever any of the components above is added/removed new JSON file with
signatures for counter + 1 is generated, dropped in ESP, then counter
increased. (i.e. this means the "recognized tail" of the event log is
deterministically swapped out)
17. when firmware update is expected, relaxed signed policy is generated for
next boot only valid if counter is increased (this means the
"unrecognized head" for the event log can change without losing access)
18. on every boot checks if releaxed policy is in effect, if so, new strict
policy is generated and counter increased.
Net result: Removes downgrade attack surface + Locks OS to firmware + Allows
downgrades within bounds
* add another PE section ".fname" or so that encodes the intended filename for
PE file, and validate that when loading add-ons and similar before using
it. This is particularly relevant when we load multiple add-ons and want to
sort them to apply them in a define order. The order should not be under
control of the attacker.
* also include packaging metadata (á la
https://systemd.io/ELF_PACKAGE_METADATA/) in our UEFI PE binaries, using the
same JSON format.
* make "bootctl install" + "bootctl update" useful for installing shim too. For
that introduce new dir /usr/lib/systemd/efi/extra/ which we copy mostly 1:1
into the ESP at install time. Then make the logic smart enough so that we
don't overwrite bootx64.efi with our own if the extra tree already contains
one. Also, follow symlinks when copying, so that shim rpm can symlink their
stuff into our dir (which is safe since the target ESP is generally VFAT and
thus does not have symlinks anyway). Later, teach the update logic to look at
the ELF package metadata (which we also should include in all PE files, see
above) for version info in all *.EFI files, and use it to only update if
newer.
* in sd-stub: optionally add support for a new PE section .keyring or so that
contains additional certificates to include in the Mok keyring, extending
what shim might have placed there. why? let's say I use "ukify" to build +
sign my own fedora-based UKIs, and only enroll my personal lennart key via
shim. Then, I want to include the fedora keyring in it, so that kmods work.
But I might not want to enroll the fedora key in shim, because this would
also mean that the key would be in effect whenever I boot an archlinux UKI
built the same way, signed with the same lennart key.
* resolved: take possession of some IPv6 ULA address (let's say
fd00:5353:5353:5353:5353:5353:5353:5353), and listen on port 53 on it for the
local stubs, so that we can make the stub available via ipv6 too.
* introduce a .microcode PE section for sd-stub which we'll pass as first initrd
to the kernel which will then upload it to the CPU. This should be distinct
from .initrd to guarantee right ordering. also, and maybe more importantly
support .microcode in PE add-ons, so that a microcode update can be shipped
independently of any kernel.
* Maybe add SwitchRootEx() as new bus call that takes env vars to set for new
PID 1 as argument. When adding SwitchRootEx() we should maybe also add a
flags param that allows disabling and enabling whether serialization is
requested during switch root.
* introduce a .acpitable section for early ACPI table override
* add proper .osrel matching for PE addons. i.e. refuse applying an addon
intended for a different OS. Take inspiration from how confext/sysext are
matched against OS.
* use different sbat for sd-boot and sd-stub (so that people can revoke one
without the other)
* in ukify merge sbat info from kernel (if it has any, upstream kernels so far
dont), of sd-stub and data supplied by user. Then measure sbat too in
sd-stub, explicitly.
* figure out what to do about credentials sealed to PCRs in kexec + soft-reboot
scenarios. Maybe insist sealing is done additionally against some keypair in
the TPM to which access is updated on each boot, for the next, or so?
* logind: when logging in, always take an fd to the home dir, to keep the dir
busy, so that autofs release can never happen. (this is generally a good
idea, and specifically works around the fact the autofs ignores busy by mount
namespaces)
* refuse using the switch-root operation without /etc/initrd-release. Now
that we have a concept of userspace reboot, we can clearly say: switch-root
is for transitioning from initrd to host (or initrd to next initrd), while
userspace reboot is for switching host to next version of the host.
* mount most file systems with a restrictive uidmap. e.g. mount /usr/ with a
uidmap that blocks out anything outside 0…1000 (i.e. system users) and similar.
* mount the root fs with MS_NOSUID by default, and then mount /usr/ without
both so that suid executables can only be placed there. Do this already in
the initrd. If /usr/ is not split out create a bind mount automatically.
* rework journalctl -M to be based on a machined method that generates a mount
fd of the relevant journal dirs in the container with uidmapping applied to
allow the host to read it, while making everything read-only.
* fix our various hwdb lookup keys to end with ":" again. The original idea was
that hwdb patterns can match arbitrary fields with expressions like
"*:foobar:*", to wildcard match both the start and the end of the string.
This only works safely for later extensions of the string if the strings
always end in a colon. This requires updating our udev rules, as well as
checking if the various hwdb files are fine with that.
* mount /tmp/ and /var/tmp with a uidmap applied that blocks out "nobody" user
among other things such as dynamic uid ranges for containers and so on. That
way no one can create files there with these uids and we enforce they are only
used transiently, never persistently.
* rework loopback support in fstab: when "loop" option is used, then
instantiate a new [email protected] for the source path, set the
lo_file_name field for it to something recognizable derived from the fstab
line, and then generate a mount unit for it using a udev generated symlink
based on lo_file_name.
* remove tomoyo support, it's obsolete and unmaintained apparently
* journald: add varlink service that allows subscribing to certain log events,
for example matching by message ID, or log level returns a list of journal
cursors as they happen.
* In .socket units, add ConnectStream=, ConnectDatagram=,
ConnectSequentialPacket= that create a socket, and then *connect to* rather than
listen on some socket. Then, add a new setting WriteData= that takes some
base64 data that systemd will write into the socket early on. This can then
be used to create connections to arbitrary services and issue requests into
them, as long as the data is static. This can then be combined with the
aforementioned journald subscription varlink service, to enable
activation-by-message id and similar.
* landlock: lock down RuntimeDirectory= via landlock, so that services lose
ability to write anywehere else below /run/. Similar for
StateDirectory=. Benefit would be clear delegation via unit files: services
get the directories they get, and nothing else even if they wanted to.
* landlock: for unprivileged systemd (i.e. systemd --user), use landlock to
implement ProtectSystem=, ProtectHome= and so on. Landlock does not require
privs, and we can implement pretty similar behaviour. Also, maybe add a mode
where ProtectSystem= combined with an explicit PrivateMounts=no could request
similar behaviour for system services, too.
* Add [email protected] which is instantiated for a block device and
invokes systemd-mount and exits. This is then useful to use in
ENV{SYSTEMD_WANTS} in udev rules, and a bit prettier than using RUN+=
* udevd: extend memory pressure logic: also kill any idle worker processes
* SIGRTMIN+18 and memory pressure handling should still be added to: hostnamed,
localed, oomd, timedated.
* journald: also collect CLOCK_BOOTTIME timestamps per log entry. Then, derive
"corrected" CLOCK_REALTIME information on display from that and the timestamp
info of the newest entry of the specific boot (as identified by the boot
ID). This way, if a system comes up without a valid clock but acquires a
better clock later, we can "fix" older entry timestamps on display, by
calculating backwards. We cannot use CLOCK_MONOTONIC for this, since it does
not account for suspend phases. This would then also enable us to correct the
kmsg timestamping we consume (where we erroneously assume the clock was in
CLOCK_MONOTONIC, but it actually is CLOCK_BOOTTIME as per kernel).
* sd-journal puts a limit on parallel journal files to view at once. journald
should probably honour that same limit (JOURNAL_FILES_MAX) when vacuuming to
ensure we never generate more files than we can actually view.
* in order to make binding to PCR 4 realistic:
- generate one keypair "U" and store it in a tpm2 nvindex.
- Generate another keypair "P" and store it in a second tpm2 nvindex.
- allocate a persistent counter object "C" in the tpm2
- Enroll all user objects (i.e. luks volumes, creds, …) to a tpm2 policy
signed by U.
- Lock both U and P down with a tpm2 policy signed by P (yes, P can only be
used if a signature by P itself can be provided)
- For regular reboots generate a signature for a restrictive PCR4 + counter C
based policy with key P. Place signature in EFI var, so it can be found on
next boot
- For reboots where a firmware update is expected generate a signature with a
more open policy against just counter C. Place signature in same EFI var.
- Increase C whenever switching between these two signature types.
- During early boot, use the signature from the EFI var to unlock U and P.
Use it to generate a signature for unlocking user objects given the current
PCR 4 value, store that away into /run somewhere, for user during the whole
later boot.
- When booting up automatically update the mentioned efi var so that it
contains the restrictive signature. But also generate a signature ahead of
time that could be used in case during the current boot we later detect we might
need to reboot for a firmware update. Store that in /run somewhere, so that
it can be placed in the EFI var, if needed.
* repart/gpt-auto/DDIs: maybe introduce a concept of "extension" partitions,
that have a new type uuid and can "extend" earlier partitions, to work around
the fact that systemd-repart can only grow the last partition defined. During
activation we'd simply set up a dm-linear mapping to merge them again. A
partition that is to be extended would just set a bit in the partition flags
field to indicate that there's another extension partition to look for. The
identifying UUID of the extension partition would be hashed in counter mode
from the uuid of the original partition it extends. Inspiration for this is
the "dynamic partitions" concept of new Android. This would be a minimalistic
concept of a volume manager, with the extents it manages being exposes as GPT
partitions. I a partition is extended multiple times they should probably
grow exponentially in size to ensure O(log(n)) time for finding them on
access.
* split out execute.c into new "systemd-executor" binary. Then make PID 1 fork
that off via vfork(), and then let that executor do the hard work. Ultimately
the executor then gets replaced by the real binary sooner or later. Reason:
currently the intermediary "stub" process is a CoW trap that doubles memory
usage of PID 1 on each service start. Also, strictly speaking we are not
allowed to do NSS from the stub process yet we do anyway. Next steps would
then be maybe use CLONE_INTO_CGROUP for the executor, given that we don't
need glibc anymore in the stub process then. Then, switch nspawn to just be a
frontend for this too, so that we have to ways into the executor: via unit
files/dbus/varlin through PID1 and via cmdline/OCI through nspawn.
* sd-stub: detect if we are running with uefi console output on serial, and if so
automatically add console= to kernel cmdline matching the same port.
* add a utility that can be used with the kernel's
CONFIG_STATIC_USERMODEHELPER_PATH and then handles them within pid1 so that
security, resource management and cgroup settings can be enforced properly
for all umh processes.
* systemd-shutdown: keep sending sd_notify() status updates immediately before
going down, in particular include the "reboot param" string.
* homed: when resizing an fs don't sync identity beforehand there might simply
not be enough disk space for that. try to be defensive and sync only after
resize.
* homed: if for some reason the partition ended up being much smaller than
whole disk, recover from that, and grow it again.
* in journald, write out a recognizable log record whenever the system clock is
changed ("stepped"), and in timesyncd whenever we acquire an NTP fix
("slewing"). Then, in journalctl for each boot time we come across, find
these records, and use the structured info they include to display
"corrected" wallclock time, as calculated from the monotonic timestamp in the
log record, adjusted by the delta declared in the structured log record.
* in journald: whenever we start a new journal file because the boot ID
changed, let's generate a recognizable log record containing info about old
and new ID. Then, when displaying log stream in journalctl look for these
records, to be able to order them.
* timesyncd: when saving/restoring clock try to take boot time into account.
Specifically, along with the saved clock, store the current boot ID. When
starting, check if the boot id matches. If so, don't do anything (we are on
the same boot and clock just kept running anyway). If not, then read
CLOCK_BOOTTIME (which started at boot), and add it to the saved clock
timestamp, to compensate for the time we spent booting. If EFI timestamps are
available, also include that in the calculation. With this we'll then only
miss the time spent during shutdown after timesync stopped and before the
system actually reset.
* systemd-stub: maybe store a "boot counter" in the ESP, and pass it down to
userspace to allow ordering boots (for example in journalctl). The counter
would be monotonically increased on every boot.
* pam_systemd_home: add module parameter to control whether to only accept
only password or only pcks11/fido2 auth, and then use this to hook nicely
into two of the three PAM stacks gdm provides.
See discussion at https://github.com/authselect/authselect/pull/311
* sd-boot: make boot loader spec type #1 accept http urls in "linux"
lines. Then, do the uefi http dance to download kernels and boot them. This
is then useful for network boot, by embdedding a cpio with type #1 snippets
in sd-boot, which reference remote kernels.
* maybe prohibit setuid() to the nobody user, to lock things down, via seccomp.
the nobody is not a user any code should run under, ever, as that user would
possibly get a lot of access to resources it really shouldn't be getting
access to due to the userns + nfs semantics of the user. Alternatively: use
the seccomp log action, and allow it.
* sd-boot: add a new PE section .bls or so that carries a cpio with additional
boot loader entries (both type1 and type2). Then when initializing, find this
section, iterate through it and populate menu with it. cpio is simple enough
to make a parser for this reasonably robust. use same path structures as in
the ESP. Similar add one for signature key drop-ins.
* sd-boot: also allow passing in the cpio as in the previous item via SMBIOS
* add a new EFI tool "sd-fetch" or so. It looks in a PE section ".url" for an
URL, then downloads the file from it using UEFI HTTP APIs, and executes it.
Usecase: provide a minimal ESP with sd-boot and a couple of these sd-fetch
binaries in place of UKIs, and download them on-the-fly.
* maybe: systemd-loop-generator that sets up loopback devices if requested via kernel
cmdline. usecase: include encrypted/verity root fs in UKI.
* systemd-gpt-auto-generator: add kernel cmdline option to override block
device to dissect. also support dissecting a regular file. useccase: include
encrypted/verity root fs in UKI.
* sd-stub: add ".bootcfg" section for kernel bootconfig data (as per
https://docs.kernel.org/admin-guide/bootconfig.html)
* tpm2: add (optional) support for generating a local signing key from PCR 15
state. use private key part to sign PCR 7+14 policies. stash signatures for
expected PCR7+14 policies in EFI var. use public key part in disk encryption.
generate new sigs whenever db/dbx/mok/mokx gets updated. that way we can
securely bind against SecureBoot/shim state, without having to renroll
everything on each update (but we still have to generate one sig on each
update, but that should be robust/idempotent). needs rollback protection, as
usual.
* Lennart: big blog story about DDIs
* Lennart: big blog story about building initrds
* Lennart: big blog story about "why systemd-boot"
* bpf: see if we can use BPF to solve the syslog message cgroup source problem:
one idea would be to patch source sockaddr of all AF_UNIX/SOCK_DGRAM to
implicitly contain the source cgroup id. Another idea would be to patch
sendto()/connect()/sendmsg() sockaddr on-the-fly to use a different target
sockaddr.
* bpf: see if we can address opportunistic inode sharing of immutable fs images
with BPF. i.e. if bpf gives us power to hook into openat() and return a
different inode than is requested for which we however it has same contents
then we can use that to implement opportunistic inode sharing among DDIs:
make all DDIs ship xattr on all reg files with a SHA256 hash. Then, also
dictate that DDIs should come with a top-level subdir where all reg files are
linked into by their SHA256 sum. Then, whenever an inode is opened with the
xattr set, check bpf table to find dirs with hashes for other prior DDIs and
try to use inode from there.
* extend the verity signature partition to permit multiple signatures for the
same root hash, so that people can sign a single image with multiple keys.
* consider adding a new partition type, just for /opt/ for usage in system
extensions
* gpt-auto-discovery: also use the pkcs7 signature stuff, and pass signature to
kernel. So far we only did this for the various --image= switches, but not
for the root fs or /usr/.
* dissection policy should enforce that unlocking can only take place by
certain means, i.e. only via pw, only via tpm2, or only via fido, or a
combination thereof.
* make the systemd-repart "seed" value provisionable via credentials, so that
confidential computing environments can set it and deterministically
enforce the uuids for partitions created, so that they can calculate PCR 15
ahead of time.
* systemd-repart: also derive the volume key from the seed value, for the
aforementioned purpose.
* in the initrd: derive the default machine ID to pass to the host PID 1 via
$machine_id from the same seed credential.
* Add systemd-sysupdate-initrd.service or so that runs systemd-sysupdate in the
initrd to bootstrap the initrd to populate the initial partitions. Some things
to figure out:
- Should it run on firstboot or on every boot?
- If run on every boot, should it use the sysupdate config from the host on
subsequent boots?
* hook up journald with TPMs? measure new journal records to the TPM in regular
intervals, validate the journal against current TPM state with that. (taking
inspiration from IMA log)
* provide an API (probably IPC) to apps to encrypt/decrypt
credentials. usecase: allow bluez bluetooth daemon to pass pairings to initrd
that way, without shelling out to our tools.
* revisit default PCR bindings in cryptenroll and systemd-creds. Currently they
use PCR 7 which should contain secureboot state db/dbx. Which sounded like a
safe bet, given that it should change only on policy changes, and not
software updates. But that's wrong. Recent fwupd (rightfully) contains code
for updating the dbx denylist. This means even without any active policy
change PCR 7 might change. Hence, better idea might be in systemd-creds to
default to PCR 15 at least if sd-stub is used (i.e. bind to system identity),
and in cryptsetup simply the empty list? Also, PCR 14 almost certainly should
be included as much as PCR 7 (as it contains shim's policy, which is
certainly as relevant as PCR 7 on many systems)
* To mimic the new tpm2-measure-pcr= crypttab option add the same to veritytab
(measuring the root hash) and integritytab (measuring the HMAC key if one is
used)
* We should start measuring all services, containers, and system extensions we
activate. probably into PCR 13. i.e. add --tpm2-measure-pcr= or so to
systemd-nspawn, and MeasurePCR= to unit files. Should contain a measurement
of the activated configuration and the image that is being activated (in case
verity is used, hash of the root hash).
* whenever we measure something into a TPM PCR from userspace, write a record in
TCG's "Canonical Event Log" format to some file, so that we can reason about
how PCR values we manage came to
be. https://trustedcomputinggroup.org/resource/canonical-event-log-format/
* bootspec: permit graceful "update" from type #2 to type #1. If both a type #1
and a type #2 entry exist under otherwise the exact same name, then use the
type #1 entry, and ignore the type #2 entry. This way, people can "upgrade"
from the UKI with all parameters baked in to a Type #1 .conf file with manual
parametrization, if needed. This matches our usual rule that admin config
should win over vendor defaults.
* write a "search path" spec, that documents the prefixes to search in
(i.e. the usual /etc/, /run/, /usr/lib/ dance, potentially /usr/etc/), how to
sort found entries, how masking works and overriding.
* automatic boot assessment: add one more default success check that just waits
for a bit after boot, and blesses the boot if the system stayed up that long.
* implement concept of "versioned" resources inside a dir, and write a spec for
it. Make all tools in systemd, in particular
RootImage=/RootDirectory=/--image=/--directory= implement this. Idea:
directories ending in ".v/" indicate a directory with versioned resources in
them. Versioned resources inside a .v dir are always named in the pattern
<prefix>_<version>[+<tries-left>[-<tries-done>]].<suffix>
* add support for using this .v/ logic on the root fs itself: in the initrd,
after mounting the rootfs, look for root-<arch>.v/ in the root fs, and then
apply the logic, moving the switch root logic there.
* systemd-repart: add support for generating ISO9660 images
* systemd-repart: in addition to the existing "factory reset" mode (which
simply empties existing partitions marked for that). add a mode where
partitions marked for it are entirely removed. Usecase: remove secondary OS
copy, and redundant partitions entirely, and recreate them anew.
* systemd-boot: maybe add support for collapsing menu entries of the same OS
into one item that can be opened (like in a "tree view" UI element) or
collapsed. If only a single OS is installed, disable this mode, but if
multiple OSes are installed might make sense to default to it, so that user
is not immediately bombarded with a multitude of Linux kernel versions but
only one for each OS.
* systemd-repart: if the GPT *disk* UUID (i.e. the one global for the entire
disk) is set to all FFFFF then use this as trigger for factory reset, in
addition to the existing mechanisms via EFI variables and kernel command
line. Benefit: works also on non-EFI systems, and can be requested on one
boot, for the next.
* figure out a sane way when building UKIs how to extract SBAT data from inner
kernel, extend it with component info, and add to outer kernel.
* systemd-sysupdate: make transport pluggable, so people can plug casync or
similar behind it, instead of http.
* systemd-tmpfiles: add concept for conditionalizing lines on factory reset
boot, or on first boot.
* in UKIs: add way to define allowlist of additional words that can be added to
the kernel cmdline even in SecureBoot mode
* we probably needs .pcrpkeyrd or so as additional PE section in UKIs,
which contains a separate public key for PCR values that only apply in the
initrd, i.e. in the boot phase "enter-initrd". Then, consumers in userspace
can easily bind resources to just the initrd. Similar, maybe one more for
"enter-initrd:leave-initrd" for resources that shall be accessible only
before unprivileged user code is allowed. (we only need this for .pcrpkey,
not for .pcrsig, since the latter is a list of signatures anyway). With that,
when you enroll a LUKS volume or similar, pick either the .pcrkey (for
coverage through all phases of the boot, but excluding shutdown), the
.pcrpkeyrd (for coverage in the initrd only) and .pcrpkeybt (for coverage
until users are allowed to log in).
* Once the root fs LUKS volume key is measured into PCR 15, default to binding
credentials to PCR 15 in "systemd-creds"
* add support for asymmetric LUKS2 TPM based encryption. i.e. allow preparing
an encrypted image on some host given a public key belonging to a specific
other host, so that only hosts possessing the private key in the TPM2 chip
can decrypt the volume key and activate the volume. Usecase: systemd-confext
for a central orchestrator to generate confext images securely that can only
be activated on one specific host (which can be used for installing a bunch
of creds in /etc/credstore/ for example). Extending on this: allow binding
LUKS2 TPM based encryption also to the TPM2 internal clock. Net result:
prepare a confext image that can only be activated on a specific host that
runs a specific software in a specific time window. confext would be
automatically invalidated outside of it.
* maybe add a "systemd-report" tool, that generates a TPM2-backed "report" of
current system state, i.e. a combination of PCR information, local system
time and TPM clock, running services, recent high-priority log
messages/coredumps, system load/PSI, signed by the local TPM chip, to form an
enhanced remote attestation quote. Usecase: a simple orchestrator could use
this: have the report tool upload these reports every 3min somewhere. Then
have the orchestrator collect these reports centrally over a 3min time
window, and use them to determine what which node should now start/stop what,
and generate a small confext for each node, that uses Uphold= to pin services
on each node. The confext would be encrypted using the asymmetric encryption
proposed above, so that it can only be activated on the specific host, if the
software is in a good state, and within a specific time frame. Then run a
loop on each node that sends report to orchestrator and then sysupdate to
update confext. Orchestrator would be stateless, i.e. operate on desired
config and collected reports in the last 3min time window only, and thus can
be trivially scaled up since all instances of the orchestrator should come to
the same conclusions given the same inputs of reports/desired workload info.
Could also be used to deliver Wireguard secrets and thus to clients, thus
permitting zero-trust networking: secrets are rolled over via confext updates,
and via the time window TPM logic invalidated if node doesn't keep itself
updated, or becomes corrupted in some way.
* in the initrd, once the rootfs encryption key has been measured to PCR 15,
derive default machine ID to use from it, and pass it to host PID 1.
* tree-wide: convert as much as possible over to use sd_event_set_signal_exit(), instead
of manually hooking into SIGINT/SIGTERM
* tree-wide: convert as much as possible over to SD_EVENT_SIGNAL_PROCMASK
instead of manual blocking.
* sd-boot: for each installed OS, grey out older entries (i.e. all but the
newest), to indicate they are obsolete
* automatically propagate LUKS password credential into cryptsetup from host
(i.e. SMBIOS type #11, …), so that one can unlock LUKS via VM hypervisor
supplied password.
* add ability to path_is_valid() to classify paths that refer to a dir from
those which may refer to anything, and use that in various places to filter
early. i.e. stuff ending in "/", "/." and "/.." definitely refers to a
directory, and paths ending that way can be refused early in many contexts.
* systemd-measure: allow operating with PEM certificates in addition to PEM
public keys when signing PCR values. SecureBoot and our Verity signatures
operate with certificates already, hence I guess we should also just deal for
convencience with certificates for the PCR stuff too.
* systemd-measure: add --pcrpkey-auto as an alternative to --pcrpkey=, where it
would just use the same public key specified with --public-key= (or the one
automatically derived from --private-key=).
* push people to use ".sysext.raw" as suffix for sysext DDIs (DDI =
discoverable disk images, i.e. the new name for gpt disk images following the
discoverable disk spec). [Also: just ".sysext/" for directory-based sysext]
* Add "purpose" flag to partition flags in discoverable partition spec that
indicate if partition is intended for sysext, for portable service, for
booting and so on. Then, when dissecting DDI allow specifying a purpose to
use as additional search condition. Usecase: images that combined a sysext
partition with a portable service partition in one.
* On boot, auto-generate an asymmetric key pair from the TPM,
and use it for validating DDIs and credentials. Maybe upload it to the kernel
keyring, so that the kernel does this validation for us for verity and kernel
modules
* for systemd-confext: add a tool that can generate suitable DDIs with verity +
sig using squashfs-tools-ng's library. Maybe just systemd-repart called under
a new name with a built-in config?
* lock down acceptable encrypted credentials at boot, via simple allowlist,
maybe on kernel command line:
systemd.import_encrypted_creds=foobar.waldo,tmpfiles.extra to protect locked
down kernels from credentials generated on the host with a weak kernel
* Add support for extra verity configuration options to systemd-repart (FEC,
hash type, etc)
* chase(): take inspiration from path_extract_filename() and return
O_DIRECTORY if input path contains trailing slash.
* chase(): refuse resolution if trailing slash is specified on input,
but final node is not a directory
* document in boot loader spec that symlinks in XBOOTLDR/ESP are not OK even if
non-VFAT fs is used.
* measure credentials picked up from SMBIOS to some suitable PCR
* measure GPT and LUKS headers somewhere when we use them (i.e. in
systemd-gpt-auto-generator/systemd-repart and in systemd-cryptsetup?)
* pick up creds from EFI vars
* Add and pickup tpm2 metadata for creds structure.
* sd-boot: we probably should include all BootXY EFI variable defined boot
entries in our menu, and then suppress ourselves. Benefit: instant
compatibility with all other OSes which register things there, in particular
on other disks. Always boot into them via NextBoot EFI variable, to not
affect PCR values.
* systemd-measure tool:
- pre-calculate PCR 12 (command line) + PCR 13 (sysext) the same way we can precalculate PCR 11
* in sd-boot: load EFI drivers from a new PE section. That way, one can have a
"supercharged" sd-boot binary, that could carry ext4 drivers built-in.
* sd-bus: document that sd_bus_process() only returns messages that non of the
filters/handlers installed on the connection took possession of.
* sd-device: add an API for acquiring list of child devices, given a device
objects (i.e. all child dirents that dirs or symlinks to dirs)
* sd-device: maybe pin the sysfs dir with an fd, during the entire runtime of
an sd_device, then always work based on that.
* add small wrapper around qemu that implements sd_notify/AF_VSOCK + machined and
maybe some other stuff and boots it. Should implement command line roughly
equivalent to nspawn's. Maybe be called "systemd-vmspawn". Should imply good
settings, i.e. RNG + HyperV enlightenments. Should also result in swtpm
instance, plus virtiofsd instances. Translate credentials into smbios type
11 strings. Correctly translate SIGTERM into ACPI shutdown events.
Listen to logind suspend events and turn these into suspend key pressed +
ACPI resume events.
* maybe add new flags to gpt partition tables for rootfs and usrfs indicating
purpose, i.e. whether something is supposed to be bootable in a VM, on
baremetal, on an nspawn-style container, if it is a portable service image,
or a sysext for initrd, for host os, or for portable container. Then hook
portabled/… up to udev to watch block devices coming up with the flags set, and
use it.
* sd-boot should look for information what to boot in SMBIOS, too, so that VM
managers can tell sd-boot what to boot into and suchlike
* add "systemd-sysext identify" verb, that you can point on any file in /usr/
and that determines from which overlayfs layer it originates, which image, and with
what it was signed.
* journald: generate recognizable log events whenever we shutdown journald
cleanly, and when we migrate run → var. This way tools can verify that a
previous boot terminated cleanly, because either of these two messages must
be safely written to disk, then.
* systemd-creds: extend encryption logic to support asymmetric
encryption/authentication. Idea: add new verb "systemd-creds public-key"
which generates a priv/pub key pair on the TPM2 and stores the priv key
locally in /var. It then outputs a certificate for the pub part to stdout.
This can then be copied/taken elsewhere, and can be used for encrypting creds
that only the host on its specific hw can decrypt. Then, support a drop-in
dir with certificates that can be used to authenticate credentials. Flow of
operations is then this: build image with owner certificate, then after
boot up issue "systemd-creds public-key" to acquire pubkey of the machine.
Then, when passing data to the machine, sign with privkey belonging to one of
the dropped in certs and encrypted with machine pubkey, and pass to machine.
Machine is then able to authenticate you, and confidentiality is guaranteed.
* building on top of the above, the pub/priv key pair generated on the TPM2
should probably also one you can use to get a remote attestation quote.
* Process credentials in:
• networkd/udevd: add a way to define additional .link, .network, .netdev files
via the credentials logic.
• crypttab-generator: allow defining additional crypttab-like volumes via
credentials (similar: verity-generator, integrity-generator). Use
fstab-generator logic as inspiration.
• run-generator: allow defining additional commands to run via a credential
• resolved: allow defining additional /etc/hosts entries via a credential (it
might make sense to then synthesize a new combined /etc/hosts file in /run
and bind mount it on /etc/hosts for other clients that want to read it.
• repart: allow defining additional partitions via credential
• timesyncd: pick NTP server info from credential
• portabled: read a credential "portable.extra" or so, that takes a list of
file system paths to enable on start.
• make systemd-fstab-generator look for a system credential encoding root= or
usr=
• systemd-homed: when initializing, look for a credential
systemd.homed.register or so with JSON user records to automatically
register if not registered yet. Usecase: deploy a system, and add an
account one can directly log into.
• in gpt-auto-generator: check partition uuids against such uuids supplied via
sd-stub credentials. That way, we can support parallel OS installations with
pre-built kernels.
* define a JSON format for units, separating out unit definitions from unit
runtime state. Then, expose it:
1. Add Describe() method to Unit D-Bus object that returns a JSON object
about the unit.
2. Expose this natively via Varlink, in similar style
3. Use it when invoking binaries (i.e. make PID 1 fork off systemd-executor
binary which reads the JSON definition and runs it), to address the cow
trap issue and the fact that NSS is actually forbidden in
forked-but-not-exec'ed children
4. Add varlink API to run transient units based on provided JSON definitions
* Add SUPPORT_END_URL= field to os-release with more *actionable* information
what to do if support ended
* pam_systemd: on interactive logins, maybe show SUPPORT_END information at
login time, à la motd
* sd-boot: instead of unconditionally deriving the ESP to search boot loader
spec entries in from the paths of sd-boot binary, let's optionally allow it
to be configured on sd-boot cmdline + efi var. Usecase: embed sd-boot in the
UEFI firmware (for example, ovmf supports that via qemu cmdline option), and
use it to load stuff from the ESP.
* mount /var/ from initrd, so that we can apply sysext and stuff before the
initrd transition. Specifically:
1. There should be a var= kernel cmdline option, matching root= and usr=
2. systemd-gpt-auto-generator should auto-mount /var if it finds it on disk
3. mount.x-initrd mount option in fstab should be implied for /var
* implement varlink introspection
* make persistent restarts easier by adding a new setting OpenPersistentFile=
or so, which allows opening one or more files that is "persistent" across
service restarts, hot reboot, cold reboots (depending on configuration): the
files are created empty on first invocation, and on subsequent invocations
the files are reboot. The files would be backed by tmpfs, pmem or /var
depending on desired level of persistency.
* sd-event: add ability to "chain" event sources. Specifically, add a call
sd_event_source_chain(x, y), which will automatically enable event source y
in oneshot mode once x is triggered. Use case: in src/core/mount.c implement
the /proc/self/mountinfo rescan on SIGCHLD with this: whenever a SIGCHLD is
seen, trigger the rescan defer event source automatically, and allow it to be
dispatched *before* the SIGCHLD is handled (based on priorities). Benefit:
dispatch order is strictly controlled by priorities again. (next step: chain
event sources to the ratelimit being over)
* if we fork of a service with StandardOutput=journal, and it forks off a
subprocess that quickly dies, we might not be able to identify the cgroup it
comes from, but we can still derive that from the stdin socket its output
came from. We apparently don't do that right now.
* add ability to set hostname with suffix derived from machine id at boot
* add PR_SET_DUMPABLE service setting
* homed/userdb: maybe define a "companion" dir for home directories where apps
can safely put privileged stuff in. Would not be writable by the user, but
still conceptually belong to the user. Would be included in user's quota if
possible, even if files are not owned by UID of user. Usecase: container
images that owned by arbitrary UIDs, and are owned/managed by the users, but
are not directly belonging to the user's UID. Goal: we shouldn't place more
privileged dirs inside of unprivileged dirs, and thus containers really
should not be placed inside of traditional UNIX home dirs (which are owned by
users themselves) but somewhere else, that is separate, but still close
by. Inform user code about path to this companion dir via env var, so that
container managers find it. the ~/.identity file is also a candidate for a
file to move there, since it is managed by privileged code (i.e. homed) and
not unprivileged code.
* given that /etc/ssh/ssh_config.d/ is a thing now, ship a drop-in for that
that hooks up userdbctl ssh-key stuff.
* maybe add support for binding and connecting AF_UNIX sockets in the file
system outside of the 108ch limit. When connecting, open O_PATH fd to socket
inode first, then connect to /proc/self/fd/XYZ. When binding, create symlink
to target dir in /tmp, and bind through it.
* add a proper concept of a "developer" mode, i.e. where cryptographic
protections of the root OS are weakened after interactive confirmation, to
allow hackers to allow their own stuff. idea: allow entering developer mode
only via explicit choice in boot menu: i.e. add explicit boot menu item for
it. When developer mode is entered, generate a key pair in the TPM2, and add
the public part of it automatically to keychain of valid code signature keys
on subsequent boots. Then provide a tool to sign code with the key in the
TPM2. Ensure that boot menu item is the only way to enter developer mode, by
binding it to locality/PCRs so that keys cannot be generated otherwise.
* services: add support for cryptographically unlocking per-service directories
via TPM2. Specifically, for StateDirectory= (and related dirs) use fscrypt to
set up the directory so that it can only be accessed if host and app are in
order.
* TPM2: extend unlock policy to protect against version downgrades in signed
policies: policy probably must take some nvram based generation counter into
account that can only monotonically increase and can be used to invalidate
old PCR signatures. Otherwise people could downgrade to old signed PCR sets
whenever they want.
* update HACKING.md to suggest developing systemd with the ideas from:
https://0pointer.net/blog/testing-my-system-code-in-usr-without-modifying-usr.html
https://0pointer.net/blog/running-an-container-off-the-host-usr.html
* sd-event: compat wd reuse in inotify code: keep a set of removed watch
descriptors, and clear this set piecemeal when we see the IN_IGNORED event
for it, or when read() returns EAGAIN or on IN_Q_OVERFLOW. Then, whenever we
see an inotify wd event check against this set, and if it is contained ignore
the event. (to be fully correct this would have to count the occurrences, in
case the same wd is reused multiple times before we start processing
IN_IGNORED again)
* for vendor-built signed initrds:
- kernel-install should be able to install encrypted creds automatically for
machine id, root pw, rootfs uuid, resume partition uuid, and place next to
EFI kernel, for sd-stub to pick them up. These creds should be locked to
the TPM, and bind to the right PCR the kernel is measured to.
- kernel-install should be able to pick up initrd sysexts automatically and
place them next to EFI kernel, for sd-stub to pick them up.
- systemd-fstab-generator should look for rootfs device to mount in creds
- systemd-resume-generator should look for resume partition uuid in creds
- sd-stub: automatically pick up microcode from ESP (/loader/microcode/*)
and synthesize initrd from it, and measure it. Signing is not necessary, as
microcode does that on its own. Pass as first initrd to kernel.
* Maybe extend the service protocol to support handling of some specific SIGRT
signal for setting service log level, that carries the level via the
sigqueue() data parameter. Enable this via unit file setting.
* sd_notify/vsock: maybe support binding to AF_VSOCK in Type=notify services,
then passing $NOTIFY_SOCKET and $NOTIFY_GUESTCID with PID1's cid (typically
fixed to "2", i.e. the official host cid) and the expected guest cid, for the
two sides of the channel. The latter env var could then be used in an
appropriate qemu cmdline. That way qemu payloads could talk sd_notify()
directly to host service manager.
* sd-boot: add menu item for shutdown? or hotkey?
* sd-device has an API to create an sd_device object from a device id, but has
no api to query the device id
* sd-device should return the devnum type (i.e. 'b' or 'c') via some API for an
sd_device object, so that data passed into sd_device_new_from_devnum() can
also be queried.
* sd-event: optionally, if per-event source rate limit is hit, downgrade
priority, but leave enabled, and once ratelimit window is over, upgrade
priority again. That way we can combat event source starvation without
stopping processing events from one source entirely.
* sd-event: similar to existing inotify support add fanotify support (given
that apparently new features in this area are only going to be added to the
latter).
* sd-event: add 1st class event source for clock changes
* sd-event: add 1st class event source for timezone changes
* support uefi/http boots with sd-boot: instead of looking for dropin files in
/loader/entries/ dir, look for a file /loader/entries/SHA256SUMS and use that
as directory manifest. The file would be a standard directory listing as
generated by GNU sha256sums.
* sd-boot: maybe add support for embedding the various auxiliary resources we
look for right in the sd-boot binary. i.e. take inspiration from sd-stub
logic: allow combining sd-boot via ukify with kernels to enumerate, .conf
files, drivers, keys to enroll and so on. Then, add whatever we find that way
to the menu. Usecase: allow building a single PE image you can boot into via
UEFI HTTP boot.