Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qmemman inflating a qube not in need, starving the system #9627

Open
ydirson opened this issue Dec 5, 2024 · 3 comments
Open

qmemman inflating a qube not in need, starving the system #9627

ydirson opened this issue Dec 5, 2024 · 3 comments
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: core diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.2-host-cur-test T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@ydirson
Copy link

ydirson commented Dec 5, 2024

Qubes OS release: 4.2

symptoms

After running for "a long time" (more than a week, in this case it happened after 13d uptime), it gets impossible for me to launch any new VM. When running from the GUI the situation is a bit confusing:

journalctl shows:

internal error: libxenlight failed to create new domain

With additionally various complaints about lack of memory. The following may be of interest:

Dec 05 22:59:53 dom0 libvirtd[47037]: internal error: libxenlight failed to create new domain 'disp6819'
Dec 05 22:59:53 dom0 qubesd[2378]: vm.disp6819: Start failed: internal error: libxenlight failed to create new domain 'disp6819'
Dec 05 22:59:55 dom0 qubesd[2378]: Removing appmenus for 'disp6819' in 'dom0'
Dec 05 22:59:55 dom0 qmemman.systemstate[2375]: Xen free = 515664944 too small for satisfy assignments! assigned_but_unused=520440604, domdict={'0': {'memory_current': 4240236544, 'memory_actual': 4294967296, 'memory_maximum': 4294967296, 'mem_used': 2638827520, 'id': '0', 'last_target': 4294967296, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': False}, '3': {'memory_current': 941056000, 'memory_actual': 1026011383, 'memory_maximum': 4194304000, 'mem_used': 331497472, 'id': '3', 'last_target': 1026011383, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': True}, '5': {'memory_current': 297840640, 'memory_actual': 297840640, 'memory_maximum': 314572800, 'mem_used': None, 'id': '5', 'last_target': 297795584, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': False}, '6': {'memory_current': 150994944, 'memory_actual': 150994944, 'memory_maximum': 150994944, 'mem_used': None, 'id': '6', 'last_target': 150994944, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': False}, '9': {'memory_current': 1379766272, 'memory_actual': 1467832325, 'memory_maximum': 4194304000, 'mem_used': 545517568, 'id': '9', 'last_target': 1467832325, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': True}, '11': {'memory_current': 2856976384, 'memory_actual': 2955095759, 'memory_maximum': 6291456000, 'mem_used': 1265954816, 'id': '11', 'last_target': 2955095759, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': True}, '12': {'memory_current': 4120576000, 'memory_actual': 4227287413, 'memory_maximum': 10485760000, 'mem_used': 1882210304, 'id': '12', 'last_target': 4227287413, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': True}, '51': {'memory_current': 402698240, 'memory_actual': 402698240, 'memory_maximum': 419430400, 'mem_used': None, 'id': '51', 'last_target': 402653184, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': False}, '52': {'memory_current': 150994944, 'memory_actual': 150994944, 'memory_maximum': 150994944, 'mem_used': None, 'id': '52', 'last_target': 150994944, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': False}, '53': {'memory_current': 948469760, 'memory_actual': 1036327388, 'memory_maximum': 4194304000, 'mem_used': 336494592, 'id': '53', 'last_target': 1036327388, 'use_hoplug': False, 'no_progress': False, 'slow_memset_react': False, 'use_hotplug': True}}

From the CLI however the user gets more useful feedback clearly pointing to a lack of memory:

[dom0 ~]$ qvm-run --dispvm=debian-dvm xterm
Running 'xterm' on $dispvm:debian-dvm
$dispvm:debian-dvm: Start failed: internal error: libxenlight failed to create new domain 'disp8732', see /var/log/libvirt/libxl/libxl-driver.log for details
Traceback (most recent call last):
  File "/usr/bin/qvm-run", line 5, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.11/site-packages/qubesadmin/tools/qvm_run.py", line 358, in main
    dispvm.cleanup()
  File "/usr/lib/python3.11/site-packages/qubesadmin/vm/__init__.py", line 427, in cleanup
    self.kill()
  File "/usr/lib/python3.11/site-packages/qubesadmin/vm/__init__.py", line 123, in kill
    self.qubesd_call(self._method_dest, 'admin.vm.Kill')
  File "/usr/lib/python3.11/site-packages/qubesadmin/base.py", line 76, in qubesd_call
    return self.app.qubesd_call(dest, method, arg, payload,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/qubesadmin/app.py", line 789, in qubesd_call
    return self._parse_qubesd_response(return_data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/qubesadmin/base.py", line 111, in _parse_qubesd_response
    raise exc_class(format_string, *args)
qubesadmin.exc.QubesVMNotFoundError: No such domain: 'disp8732'
2024-12-05 22:32:17.892+0000: libxl: libxl_create.c:770:libxl__domain_make: domain creation fail: Cannot allocate memory
2024-12-05 22:32:17.892+0000: libxl: libxl_create.c:1372:initiate_domain_create: cannot make domain: -3

I can't say for sure this started with the upgrade to 4.2 but this is quite systematic each time I leave the system running for such a long time.

Usually in a hurry to use the computer I only had a reboot to get things straight again.

investigation

What I see right now is one VM (a HVM one in this case) getting a lot of free RAM, but qmemman refuses to reclaim it:

[dom0 ~]$ xenstore-ls /local/domain/12/memory
static-max = "4130953"
target = "4114569"
videoram = "0"
meminfo = "1838096"
hotplug-max = "10240000"

guest$ free
               total        used        free      shared  buff/cache   available
Mem:         4051436     1393932     1792600       33256     1181824     2657504
Swap:        1048572      583520      465052

If I stop qmemman I can direct the guest's balloon driver to inflate:

[dom0 ~]$ systemctl stop qubes-qmemman
[dom0 ~]$ xenstore-write /local/domain/12/memory/target 2000000
[dom0 ~]$ xenstore-ls /local/domain/12/memory
static-max = "4130953"
target = "2000000"
videoram = "0"
meminfo = "1796012"
hotplug-max = "10240000"

guest$ free
               total        used        free      shared  buff/cache   available
Mem:         1936868     1342260      235328       32972      683056      594608
Swap:        1048572      603432      445140

But then Qubes won't start a new VM because of qmemman missing, but as soon as I start it again it reflates the balloon again and launching a new VM fails again for lack of memory:

[dom0 ~]$ systemctl start qubes-qmemman

guest$ free
               total        used        free      shared  buff/cache   available
Mem:         4011536     1345732     2306120       32972      683288     2665804
Swap:        1048572      603432      445140

When memman is running, although I can see the xenstore key keeping the value for ~1sec when I change it, that does not appear to be enough to launch a VM after qmemman sets it back.

@ydirson ydirson added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Dec 5, 2024
@marmarek
Copy link
Member

marmarek commented Dec 5, 2024

Looks very similar to #9431, let me backport the fix.

@andrewdavidwong andrewdavidwong added C: core diagnosed Technical diagnosis has been performed (see issue comments). affects-4.2 This issue affects Qubes OS 4.2. labels Dec 6, 2024
marmarek added a commit to QubesOS/qubes-core-admin that referenced this issue Dec 8, 2024
Any memory adjustments must be done while holding a lock, to not
interfere with client request handling. This is critical to prevent
memory just freed for a new VM being re-allocated elsewhere.
The domain_list_changed() function failed to do that - do_balance call
was done after releasing the lock.

It wasn't a problem for a long time because of Python's global interpreter
lock. But Python 3.13 is finally starting to support proper parallel
thread execution, and it revealed this bug.

Fixes QubesOS/qubes-issues#9431

(cherry picked from commit 2de9eb7)

Fixes QubesOS/qubes-issues#9627
@ydirson
Copy link
Author

ydirson commented Dec 10, 2024

Thanks, testing this!

@ydirson
Copy link
Author

ydirson commented Dec 11, 2024

Right now, I'm getting Not enough memory to start domain while 2 domains have 1GiB free each.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: core diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.2-host-cur-test T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

4 participants