Bug 1755801
Summary: | [Cinderlib] Managed Block Storage: Live Migration Problems with ceph | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Dan Poltawski <dan.poltawski> | ||||||||
Component: | General | Assignee: | shani <sleviim> | ||||||||
Status: | CLOSED WORKSFORME | QA Contact: | sshmulev | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 4.30.34 | CC: | ahadas, bugs, bzlotnik, dan.poltawski, sfishbai, sleviim, tnisan | ||||||||
Target Milestone: | ovirt-4.5.0 | Flags: | pm-rhel:
ovirt-4.5?
|
||||||||
Target Release: | 4.50.0.9 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | vdsm-4.50.0.9 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2022-03-16 10:05:33 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Dan Poltawski
2019-09-26 08:47:15 UTC
Created attachment 1619420 [details]
/varlog/messages
Created attachment 1619421 [details]
vdsm.log
I've attached various logs, but I am afraid there is a lot of noise frm other isuses we're workign through Note sure if https://bugzilla.redhat.com/show_bug.cgi?id=1750417 might be related tot his Benny, if I'm not mistaken LSM is not supported with MBS, isn't it? (In reply to Tal Nisan from comment #5) > Benny, if I'm not mistaken LSM is not supported with MBS, isn't it? yes, the bug is about VM live migration which is supported, it can be resolved by blacklisting rbd devices from multipath so I'm moving this to vdsm Just to mention that when I add this blacklist configuration with one specific guest (a fortinet appliance, which I think has a lvm-ish like filesystem) we are unable to boot the vm. We also have a faily persisent problem with rbd devices continuining to be mounted after live migration even with this blacklist configuration applied. (In reply to Dan Poltawski from comment #7) > Just to mention that when I add this blacklist configuration with one > specific guest (a fortinet appliance, which I think has a lvm-ish like > filesystem) we are unable to boot the vm. > > We also have a faily persisent problem with rbd devices continuining to be > mounted after live migration even with this blacklist configuration applied. Can you share more details what errors do you see when trying to boot? Do you see errors about a failed unmap (supervdsm/vdsm log) when the rbd device remains? > Can you share more details what errors do you see when trying to boot? It also causes migrations to fail, for the record. My suspicion from the logs is its related to an LVM scan which is somehow prevented by the mulitpath errors? When it fails, in the Event log UI I get: VM fmg00.tnp.net.uk was started by poltawski@internal-authz (Host: tails.ma1.tnp.infra). User <UNKNOWN> got disconnected from VM fmg00.tnp.net.uk. VM fmg00.tnp.net.uk is down with error. Exit message: 'path'. Failed to run VM fmg00.tnp.net.uk on Host tails.ma1.tnp.infra. Failed to run VM fmg00.tnp.net.uk (User: poltawski@internal-authz). From vdsm.log: 2019-12-04 08:33:50,208+0000 ERROR (vm/efd6c1f5) [storage.TaskManager.Task] (Task='97393408-ac4e-46f9-b29d-b86a12803381') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in appropriateDevice File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3158, in appropriateDevice timeout=QEMU_READABLE_TIMEOUT) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 58, in retry return func() File "/usr/lib/python2.7/site-packages/vdsm/storage/fileUtils.py", line 153, in validateQemuReadable raise OSError(errno.EACCES, os.strerror(errno.EACCES)) OSError: [Errno 13] Permission denied 2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [storage.Dispatcher] FINISH appropriateDevice error=[Errno 13] Permission denied (dispatcher:87) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper return m(self, *a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare raise self.error OSError: [Errno 13] Permission denied 2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [virt.vm] (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') The vm start process failed (vm:933) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in _run self._devices = self._make_devices() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in _make_devices disk_objs = self._perform_host_local_adjustment() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in _perform_host_local_adjustment self._preparePathsForDrives(disk_params) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in _preparePathsForDrives drive, self.id, path=path File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 473, in prepareVolumePath volPath = res['path'] KeyError: 'path' 2019-12-04 08:33:50,231+0000 WARN (jsonrpc/4) [virt.vm] (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') trying to set state to Powering down when already Down (vm:625) 2019-12-04 08:33:50,233+0000 WARN (jsonrpc/4) [root] File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-guest-agent.0 already removed (fileutils:54) 2019-12-04 08:33:50,234+0000 WARN (jsonrpc/4) [root] File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.guest_agent.0 already removed (fileutils:54) In /var/log/messages I see the suscpicious lvm scans: [root@tails vdsm]# grep -v -E '(slice|Session)' /var/log/messages | tail -n 18 Dec 4 08:33:07 tails kernel: rbd0: p1 Dec 4 08:33:07 tails kernel: rbd: rbd0: capacity 2147483648 features 0x5 Dec 4 08:33:15 tails kernel: rbd: rbd1: capacity 85899345920 features 0x5 Dec 4 08:33:16 tails systemd: Starting LVM2 PV scan on device 252:16... Dec 4 08:33:16 tails systemd: Started LVM2 PV scan on device 252:16. Dec 4 08:33:19 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';') Dec 4 08:33:19 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';') Dec 4 08:33:19 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';') Dec 4 08:33:28 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';') Dec 4 08:33:28 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';') Dec 4 08:33:28 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';') Dec 4 08:33:50 tails vdsm[25931]: WARN File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-guest-agent.0 already removed Dec 4 08:33:50 tails vdsm[25931]: WARN File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.guest_agent.0 already removed Dec 4 08:33:52 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';') Dec 4 08:33:52 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';') Dec 4 08:33:52 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';') Dec 4 08:33:57 tails systemd: Stopping LVM2 PV scan on device 252:16... Dec 4 08:33:57 tails systemd: Stopped LVM2 PV scan on device 252:16. In contrast, when I remove the blacklist configuration and it sucessfully boots, I still get the same udev errors, but it seems like I don't get the LVM scans: Dec 4 08:40:24 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';') Dec 4 08:40:24 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';') Dec 4 08:40:24 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';') Dec 4 08:40:30 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';') Dec 4 08:40:30 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';') Dec 4 08:40:30 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';') Dec 4 08:40:31 tails multipathd: rbd1: add path (uevent) Dec 4 08:40:31 tails multipathd: rbd1: spurious uevent, path already in pathvec Dec 4 08:40:31 tails multipathd: rbd1: HDIO_GETGEO failed with 25 Dec 4 08:40:31 tails multipathd: rbd1: failed to get path uid Dec 4 08:40:31 tails multipathd: uevent trigger error Dec 4 08:40:32 tails multipathd: rbd0: add path (uevent) Dec 4 08:40:32 tails multipathd: rbd0: spurious uevent, path already in pathvec Dec 4 08:40:32 tails multipathd: rbd0: HDIO_GETGEO failed with 25 Dec 4 08:40:32 tails multipathd: rbd0: failed to get path uid Dec 4 08:40:32 tails multipathd: uevent trigger error > Do you see errors about a failed unmap (supervdsm/vdsm log) when the rbd device remains? Yes: VDSM sonic.ma1.tnp.infra command DetachManagedBlockStorageVolumeVDS failed: Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=\'\' err=\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmp9BmATh/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 288261\\nTraceback (most recent call last):\\n File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n sys.exit(main(sys.argv[1:]))\\n File "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n args.command(args)\\n File "/usr/libexec/vdsm/managedvolume-helper", line 149, in detach\\n ignore_errors=False)\\n File "/usr/lib/python2.7/site-packages/vdsm/storage/nos_brick.py", line 121, in disconnect_volume\\n run_as_root=True)\\n File "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in _execute\\n result = self.__execute(*args, **kwargs)\\n File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute\\n return execute_root(*cmd, **kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 241, in _wrap\\n return self.channel.remote_call(name, args, kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 203, in remote_call\\n raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/rbd/volume-fda410ef-8d50-48ad-9243-415be3f69460 --conf /tmp/brickrbd_bbFJeM --id ovirt --mon_host 172.16.10.13:3300 --mon_host 172.16.10.14:3300 --mon_host 172.16.10.12:6789\\nExit code: 16\\nStdout: u\\\'\\\'\\nStderr: u\\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',) 12/4/198:27:39 AM (In reply to Dan Poltawski from comment #9) > > Can you share more details what errors do you see when trying to boot? > > It also causes migrations to fail, for the record. My suspicion from the > logs is its related to an LVM scan which is somehow prevented by the > mulitpath errors? > > When it fails, in the Event log UI I get: > > VM fmg00.tnp.net.uk was started by poltawski@internal-authz (Host: > tails.ma1.tnp.infra). > User <UNKNOWN> got disconnected from VM fmg00.tnp.net.uk. > VM fmg00.tnp.net.uk is down with error. Exit message: 'path'. > Failed to run VM fmg00.tnp.net.uk on Host tails.ma1.tnp.infra. > Failed to run VM fmg00.tnp.net.uk (User: poltawski@internal-authz). > > > From vdsm.log: > > 2019-12-04 08:33:50,208+0000 ERROR (vm/efd6c1f5) [storage.TaskManager.Task] > (Task='97393408-ac4e-46f9-b29d-b86a12803381') Unexpected error (task:875) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in > _run > return fn(*args, **kargs) > File "<string>", line 2, in appropriateDevice > File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in > method > ret = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3158, in > appropriateDevice > timeout=QEMU_READABLE_TIMEOUT) > File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 58, > in retry > return func() > File "/usr/lib/python2.7/site-packages/vdsm/storage/fileUtils.py", line > 153, in validateQemuReadable > raise OSError(errno.EACCES, os.strerror(errno.EACCES)) > OSError: [Errno 13] Permission denied > 2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [storage.Dispatcher] FINISH > appropriateDevice error=[Errno 13] Permission denied (dispatcher:87) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line > 74, in wrapper > result = ctask.prepare(func, *args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in > wrapper > return m(self, *a, **kw) > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, > in prepare > raise self.error > OSError: [Errno 13] Permission denied > 2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [virt.vm] > (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') The vm start process failed > (vm:933) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in > _startUnderlyingVm > self._run() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in _run > self._devices = self._make_devices() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in > _make_devices > disk_objs = self._perform_host_local_adjustment() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in > _perform_host_local_adjustment > self._preparePathsForDrives(disk_params) > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in > _preparePathsForDrives > drive, self.id, path=path > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 473, in > prepareVolumePath > volPath = res['path'] > KeyError: 'path' > 2019-12-04 08:33:50,231+0000 WARN (jsonrpc/4) [virt.vm] > (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') trying to set state to > Powering down when already Down (vm:625) > 2019-12-04 08:33:50,233+0000 WARN (jsonrpc/4) [root] File: > /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt- > guest-agent.0 already removed (fileutils:54) > 2019-12-04 08:33:50,234+0000 WARN (jsonrpc/4) [root] File: > /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu. > guest_agent.0 already removed (fileutils:54) > > > In /var/log/messages I see the suscpicious lvm scans: > > > [root@tails vdsm]# grep -v -E '(slice|Session)' /var/log/messages | tail -n > 18 > Dec 4 08:33:07 tails kernel: rbd0: p1 > Dec 4 08:33:07 tails kernel: rbd: rbd0: capacity 2147483648 features 0x5 > Dec 4 08:33:15 tails kernel: rbd: rbd1: capacity 85899345920 features 0x5 > Dec 4 08:33:16 tails systemd: Starting LVM2 PV scan on device 252:16... > Dec 4 08:33:16 tails systemd: Started LVM2 PV scan on device 252:16. > Dec 4 08:33:19 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character > 26 (';') > Dec 4 08:33:19 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character > 29 (';') > Dec 4 08:33:19 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character > 25 (';') > Dec 4 08:33:28 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character > 26 (';') > Dec 4 08:33:28 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character > 29 (';') > Dec 4 08:33:28 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character > 25 (';') > Dec 4 08:33:50 tails vdsm[25931]: WARN File: > /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt- > guest-agent.0 already removed > Dec 4 08:33:50 tails vdsm[25931]: WARN File: > /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu. > guest_agent.0 already removed > Dec 4 08:33:52 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character > 26 (';') > Dec 4 08:33:52 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character > 29 (';') > Dec 4 08:33:52 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character > 25 (';') > Dec 4 08:33:57 tails systemd: Stopping LVM2 PV scan on device 252:16... > Dec 4 08:33:57 tails systemd: Stopped LVM2 PV scan on device 252:16. > > > > In contrast, when I remove the blacklist configuration and it sucessfully > boots, I still get the same udev errors, but it seems like I don't get the > LVM scans: > > Dec 4 08:40:24 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character > 26 (';') > Dec 4 08:40:24 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character > 29 (';') > Dec 4 08:40:24 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character > 25 (';') > Dec 4 08:40:30 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character > 26 (';') > Dec 4 08:40:30 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character > 29 (';') > Dec 4 08:40:30 tails systemd-udevd: invalid key/value pair in file > /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character > 25 (';') > Dec 4 08:40:31 tails multipathd: rbd1: add path (uevent) > Dec 4 08:40:31 tails multipathd: rbd1: spurious uevent, path already in > pathvec > Dec 4 08:40:31 tails multipathd: rbd1: HDIO_GETGEO failed with 25 > Dec 4 08:40:31 tails multipathd: rbd1: failed to get path uid > Dec 4 08:40:31 tails multipathd: uevent trigger error > Dec 4 08:40:32 tails multipathd: rbd0: add path (uevent) > Dec 4 08:40:32 tails multipathd: rbd0: spurious uevent, path already in > pathvec > Dec 4 08:40:32 tails multipathd: rbd0: HDIO_GETGEO failed with 25 > Dec 4 08:40:32 tails multipathd: rbd0: failed to get path uid > Dec 4 08:40:32 tails multipathd: uevent trigger error > This quite strange, it seems to fail chowning the rbd device, does this only happen on the fortinet host? 252:16 is the rbd device? I am not familiar with fortientat all, maybe adding an lvm filter will help? > > Do you see errors about a failed unmap (supervdsm/vdsm log) when the rbd device remains? > > Yes: > > VDSM sonic.ma1.tnp.infra command DetachManagedBlockStorageVolumeVDS failed: > Managed Volume Helper failed.: ('Error executing helper: Command > [\'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 > out=\'\' err=\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', > \\\'privsep-helper\\\', \\\'--privsep_context\\\', > \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', > \\\'/tmp/tmp9BmATh/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new > privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon > starting\\noslo.privsep.daemon: privsep process running with uid/gid: > 0/0\\noslo.privsep.daemon: privsep process running with capabilities > (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: > privsep daemon running as pid 288261\\nTraceback (most recent call last):\\n > File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n > sys.exit(main(sys.argv[1:]))\\n File > "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n > args.command(args)\\n File "/usr/libexec/vdsm/managedvolume-helper", line > 149, in detach\\n ignore_errors=False)\\n File > "/usr/lib/python2.7/site-packages/vdsm/storage/nos_brick.py", line 121, in > disconnect_volume\\n run_as_root=True)\\n File > "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in > _execute\\n result = self.__execute(*args, **kwargs)\\n File > "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line > 169, in execute\\n return execute_root(*cmd, **kwargs)\\n File > "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 241, > in _wrap\\n return self.channel.remote_call(name, args, kwargs)\\n File > "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 203, in > remote_call\\n raise > exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: > Unexpected error while running command.\\nCommand: rbd unmap > /dev/rbd/rbd/volume-fda410ef-8d50-48ad-9243-415be3f69460 --conf > /tmp/brickrbd_bbFJeM --id ovirt --mon_host 172.16.10.13:3300 --mon_host > 172.16.10.14:3300 --mon_host 172.16.10.12:6789\\nExit code: 16\\nStdout: > u\\\'\\\'\\nStderr: u\\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) > Device or resource busy\\\\n\\\'\\n\'',) > 12/4/198:27:39 AM Can you share you configuration? (If there's anything special in rbd driver parameters you use) I'll try to reproduce this > I am not familiar with fortientat all, maybe adding an lvm filter will help? I'm not familiar with the fortinet guest either. But my thinking from on ovirt point of view, should there not be some sort of filter to prevent any lvm operations happening at all on any *guest* rbd devices? > Can you share you configuration? (If there's anything special in rbd driver parameters you use) > I'll try to reproduce this Not sure how to extract the config? But I have two managed block storage domains with two different rbd pools, which match almost exactly the example in https://www.ovirt.org/develop/release-management/features/storage/cinderlib-integration.html I note that 'use_multipath_for_image_xfer' option is set, which I blindly set following the example in the above link. The ceph cluster is nautilus running on centos and nothing funky, ceph.conf just defines the mon hosts near enoguh. (In reply to Dan Poltawski from comment #11) > > I am not familiar with fortientat all, maybe adding an lvm filter will help? > > I'm not familiar with the fortinet guest either. But my thinking from on > ovirt point of view, should there not be some sort of filter to prevent any > lvm operations happening at all on any *guest* rbd devices? No, because oVirt cannot know what are you doing with the host. The only way that can work is to have lvm filter that allow the host to access only the devices needed by the host. This filter will reject any other devices like rbd devices. oVirt provides a tool to configure strict lvm filter, see: https://blogs.ovirt.org/2017/12/lvm-configuration-the-easy-way/ The issue still reproduces. According to the description of the step I couldn't reproduce it, I think the missing part in the steps were to make a filesystem on the ceph-client node and then to mount it, just then I can see the error of ‘nrbd: unmap failed: (16) Device or resource busy’. Here are my steps to reproduce the issue: Versions: vdsm-4.50.0.9-1.el8ev ovirt-engine-4.5.0-600.dbc0589fce78.233.el8ev ceph-common-14.2.11-208.el8cp.x86_64 1. Enable ovirt with managed block storage on ceph 2. Create a VM from template and attach the MBD disk to it and run it. 3. Creating a file system on the ceph-client which the VM is running on: mkfs.ext4 -m0 /dev/rbd/rhev_cinderlib/volume-b80f83ce-3f41-49b2-b029-10da608c669d mke2fs 1.45.6 (20-Mar-2020) Discarding device blocks: done Creating filesystem with 1310720 4k blocks and 327680 inodes Filesystem UUID: 937b0a76-af0a-4329-a45e-22346708abd0 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736 Allocating group tables: done Writing inode tables: done Creating journal (16384 blocks): done Writing superblocks and filesystem accounting information: done 4. Mount the file system on the ceph-client node: mkdir /mnt/ceph-block-device mount /dev/rbd/rhev_cinderlib/volume-b80f83ce-3f41-49b2-b029-10da608c669d /mnt/ceph-block-device # mount | grep rbd /dev/rbd0 on /mnt/ceph-block-device type ext4 (rw,relatime,seclabel,stripe=16) 3. Make live migration from one host to another Actual result: # tail -F /var/log/vdsm/vdsm.log | grep ERROR 2022-03-15 13:46:21,449+0200 ERROR (jsonrpc/2) [api] FINISH detach_volume error=Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpq8b2_zhx/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 151874\\nTraceback (most recent call last):\\n File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n sys.exit(main(sys.argv[1:]))\\n File "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n args.command(args)\\n File "/usr/libexec/vdsm/managedvolume-helper", line 149, in detach\\n ignore_errors=False)\\n File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 121, in disconnect_volume\\n run_as_root=True)\\n File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 52, in _execute\\n result = self.__execute(*args, **kwargs)\\n File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute\\n return execute_root(*cmd, **kwargs)\\n File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 245, in _wrap\\n return self.channel.remote_call(name, args, kwargs)\\n File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 224, in remote_call\\n raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/rhev_cinderlib/volume-b80f83ce-3f41-49b2-b029-10da608c669d --conf /tmp/brickrbd_1nz0wfg3 --id rhev-qe --mon_host 10.46.12.3:6789 --mon_host 10.46.12.4:6789 --mon_host 10.46.12.9:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',) (api:131) Hi Dan, I think maybe we miss some steps to reproduce the issue from the bug description. According to steps 1-3 I can't reproduce it, although If I mount to the rbd disk I get the same error ‘nrbd: unmap failed: (16) Device or resource busy’. Can you recheck that, please? Apologies, we long since moved away from ovirt due to these ceph issues. So I’m unable to reproduce this issue any longer. Live migration back and forth works on the master branch (oVirt 4.5) |