Description of problem: Failed to migrate vm with error - unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported Version-Release number of selected component (if applicable): openstack-nova-compute-23.2.2-0.20220705171705.7074ac0.el9ost.noarch libvirt-daemon-driver-qemu-8.0.0-8.1.el9_0.x86_64 qemu-kvm-6.2.0-11.el9_0.3.x86_64 kernel: 5.14.0-70.17.1.el9_0.x86_64 How reproducible: 100% Steps to Reproduce: 1. Installed OSP17.0(RHEL9.0) with the job, the env was with local storage. custom-17.0_compact-director-rhel-9.0-virthost-3cont_2comp-ipv4-gre-lvm #35 2. Created the image, network and flavor (overcloud) [stack@undercloud-0 ~]$ openstack image create r9-qcow2 --disk-format qcow2 --container-format bare --file RHEL-9.0.0-20220429.1-x86_64.qcow2 (overcloud) [stack@undercloud-0 ~]$ openstack image list| grep r9-qcow2 | de713510-def9-46b6-a8e6-0eecb434f644 | r9-qcow2 | active | (overcloud) [stack@undercloud-0 ~]$ openstack network create private (overcloud) [stack@undercloud-0 ~]$ openstack subnet create --network private private_subnet --allocation-pool start=192.168.32.2,end=192.168.32.245 --dhcp --gateway=192.168.32.1 --subnet-range 192.168.32.0/24 (overcloud) [stack@undercloud-0 ~]$ openstack network list| grep private | b1333c24-e41c-46a4-98ed-d185d5df6d2f | private | 25793595-e580-487d-bc78-5295d2250033 | (overcloud) [stack@undercloud-0 ~]$ openstack flavor create m1.small --ram 512 --disk 10 --vcpus 1 3. Created the VM from image successfully and it was running on compute-1. (overcloud) [stack@undercloud-0 ~]$ openstack server create --flavor m1.small --image r9-qcow2 --nic net-id=b1333c24-e41c-46a4-98ed-d185d5df6d2f vm-r9 (overcloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+-------------+--------+-----------------------------------+----------+----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------+--------+-----------------------------------+----------+----------+ | b202094d-218b-4679-9850-1f072b582cd7 | vm-r9 | ACTIVE | private=192.168.32.72 | r9-qcow2 | m1.small | (overcloud) [stack@undercloud-0 ~]$ openstack server show vm-r9 +-------------------------------------+--------------------------------------------------------------------------------------+ | Field | Value | +-------------------------------------+--------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-1.redhat.local | | OS-EXT-SRV-ATTR:hostname | vm-r9 | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.redhat.local | | OS-EXT-SRV-ATTR:instance_name | instance-000001c1 | ......| | OS-EXT-STS:power_state | Running | 4. Tried to live migrate VM, the command line return "Complete", the VM is still on compute-1. Check the virtqemud.log on compute-1, there is error: "error : virNetClientProgramDispatchError:172 : internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported". More logs in files: live-migrate-source.log, live-migrate-target.log (overcloud) [stack@undercloud-0 ~]$ openstack server migrate --live-migration vm-r9 --wait The --disk-overcommit and --no-disk-overcommit options are only supported by --os-compute-api-version 2.24 or below; this will be an error in a future release Complete 5. Tried to live block migrate VM, the command line return "Complete", but the VM is still on compute-1. Check the virtqemud.log on compute-1, there is error: "error : virNetClientProgramDispatchError:172 : internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported" More logs in files: live-block-migrate-source.log, live-block-migrate-target.log (overcloud) [stack@undercloud-0 ~]$ openstack server migrate --live-migration --block-migration vm-r9 --wait The --disk-overcommit and --no-disk-overcommit options are only supported by --os-compute-api-version 2.24 or below; this will be an error in a future release Complete 6. Check the sysctl vm.unprivileged_userfaultfd settings on compute-1 and compute-0 [root@compute-1 /]# sysctl -a|grep vm.unprivileged_userfaultfd vm.unprivileged_userfaultfd = 0 [root@compute-0 /]# sysctl -a|grep vm.unprivileged_userfaultfd vm.unprivileged_userfaultfd = 0 7. Set vm.unprivileged_userfaultfd = 1 on compute-1 and compute-0 8. Live block migrate VM, the command line return "Complete", the VM is migrated from compute-1 to compute-0 successfully 9. Postcopy requries trapping page faults from kernel code, in RHEL9 we need to set vm.unprivileged_userfaultfd to 1 during postcopy phase. Libvirt enable unprivileged access to userfaultfd before starting post-copy migration, it sets the sysctl knob in runtime once post-copy migration is requested. - Bug 1945420 - [RHEL9] Setup vm.unprivileged_userfaultfd for postcopy: since libvirt-8.0.0-0rc1.1.el9 - [libvirt PATCH] qemu: Enable unprivileged userfaultfd for post-copy migration Actual results: In step4 and step5, hit error below in virtqemud.log and the VM is not migrated to target compute node. "error : virNetClientProgramDispatchError:172 : internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported" Expected results: In step4 and step5: No error in virtqemud.log In step5: VM is migrated to target compute node Additional info: Log files on source and target compute node, when run step4 and step5: - live-migrate-source.log, live-migrate-target.log - live-block-migrate-source.log, live-block-migrate-target.log
Hi, Jiri I have the testing environment, you can use it if need, will you please help to check if libvirt need to do code change or not ? Many thanks!
> Libvirt enable unprivileged access to userfaultfd before starting post-copy > migration, it sets the sysctl knob in runtime once post-copy migration is > requested. The first version of the libvirt patch was implemented this way, but the final patch which was actually pushed and is part of RHEL-9 works differently. Libvirt just installs /usr/lib/sysctl.d/60-qemu-postcopy-migration.conf files which systemd is supposed apply when the system boots. Can you check the file exists and contains "vm.unprivileged_userfaultfd = 1"? Also the settings might be overriden by something else in /usr/lib/sysctl.d/, /run/sysctl.d/, or /etc/sysctl.d/. Can you check vm.unprivileged_userfaultfd is not set there by anything but the libvirt's conf file? Also did you reboot the hosts after installing libvirt? I believe sysctl files are only applied on boot.
Oh, libvirt runs in a container here. I believe the sysctl knob should be set in the host itself rather than in a container. I guess libvirt (and the sysctl conf file) is only installed in the container, which means openstack would need to make sure the host is properly setup by itself.
Thanks Jiri ! Deployed a new env with below job with latest OSP build: RHOS-17.0-RHEL-9-20220823.n.2 custom-17.0_compact-director-rhel-9.0-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph #35 Rerun the steps in Description, the error is no longer existed. This bug is fixed in latest OSP build. The openstack packages: openstack-tripleo-heat-templates-14.3.1-0.20220719171722.feca772.el9ost.noarch - no error: compute node: "vm.unprivileged_userfaultfd = 1" openstack-tripleo-heat-templates-14.3.1-0.20220719171711.feca772.el9ost.noarch - with the error, compute node: "vm.unprivileged_userfaultfd = 0" Check the vm.unprivileged_userfaultfd on compute-0, outside of the nova_virtqemud container: [heat-admin@compute-0 ~]$ sudo sysctl -a|grep vm.unprivileged_userfaultfd vm.unprivileged_userfaultfd = 1 More details: - Step 1-3, create the VM on compute-0 Check the vm.unprivileged_userfaultfd on compute-0, outside of the nova_virtqemud container: [heat-admin@compute-0 ~]$ sudo sysctl -a|grep vm.unprivileged_userfaultfd vm.unprivileged_userfaultfd = 1 heat-admin@compute-0 ~]$ ls /usr/lib/sysctl.d/ 10-default-yama-scope.conf 50-coredump.conf 50-default.conf 50-libkcapi-optmem_max.conf 50-pid-max.conf 50-redhat.conf README - Step4: Live migrate the VM successfully, VM migrated to compute-1 (overcloud) [stack@undercloud-0 ~]$ openstack server migrate --live-migration vm-r9 --wait The --disk-overcommit and --no-disk-overcommit options are only supported by --os-compute-api-version 2.24 or below; this will be an error in a future release Complete (overcloud) [stack@undercloud-0 ~]$ openstack server show vm-r9 +-------------------------------------+--------------------------------------------------------------------------------------+ | Field | Value | +-------------------------------------+--------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-1.redhat.local | | OS-EXT-SRV-ATTR:hostname | vm-r9 | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.redhat.local - Step5: Live block migrate VM, the VM is still running on source compute node, get expected error in "/var/log/containers/nova/nova-compute.log": "default default] Exception during message handling: nova.exception.InvalidLocalStorage: compute-1.redhat.local is not on local storage: Block migration can not be used with shared storage." (overcloud) [stack@undercloud-0 ~]$ openstack server migrate --live-migration --block-migration vm-r9 --wait The --disk-overcommit and --no-disk-overcommit options are only supported by --os-compute-api-version 2.24 or below; this will be an error in a future release Complete (overcloud) [stack@undercloud-0 ~]$ openstack server show vm-r9 +-------------------------------------+--------------------------------------------------------------------------------------+ | Field | Value | +-------------------------------------+--------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-1.redhat.local | | OS-EXT-SRV-ATTR:hostname | vm-r9 | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.redhat.local | | OS-EXT-SRV-ATTR:instance_name | instance-000001ac | | OS-EXT-SRV-ATTR:kernel_id | | | OS-EXT-SRV-ATTR:launch_index | 0 | | OS-EXT-SRV-ATTR:ramdisk_id | | | OS-EXT-SRV-ATTR:reservation_id | r-uzg0xobd | | OS-EXT-SRV-ATTR:root_device_name | /dev/vda | | OS-EXT-SRV-ATTR:user_data | None | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active |
*** This bug has been marked as a duplicate of bug 2110556 ***