Bug 1755801

Summary:

[Cinderlib] Managed Block Storage: Live Migration Problems with ceph

Product:

[oVirt] vdsm

Reporter:

Dan Poltawski <dan.poltawski>

Component:

General

Assignee:

shani <sleviim>

Status:

CLOSED WORKSFORME

QA Contact:

sshmulev

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.30.34

CC:

ahadas, bugs, bzlotnik, dan.poltawski, sfishbai, sleviim, tnisan

Target Milestone:

ovirt-4.5.0

Flags:

pm-rhel: ovirt-4.5?

Target Release:

4.50.0.9

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

vdsm-4.50.0.9

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-03-16 10:05:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine.log	none
/varlog/messages	none
vdsm.log	none

Description Dan Poltawski 2019-09-26 08:47:15 UTC

Created attachment 1619419 [details]
engine.log

Description of problem:

On ovirt 4.3.5 we are seeing various problems related to the rbd device staying mapped
after a guest has been live migrated. This causes problems migrating the guest back, as
well as rebooting the guest when it starts back up on the original host. The error
returned is ‘nrbd: unmap failed: (16) Device or resource busy’. 

I posted on the mailing list about this issue here:
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/PVGTQPXCTEQI4LUUSXDRLSIH3GXXQC2N/

The thread brought attention to:
https://tracker.ceph.com/issues/12763


Version-Release number of selected component (if applicable):

python2-cinderlib.noarch                 1:0.9.0-1.el7    centos-openstack-stein
vdsm.x86_64                              4.30.24-1.el7    @ovirt-4.3            
device-mapper-multipath.x86_64           0.4.9-127.el7    @base                 

How reproducible:
90% of time

Steps to Reproduce:
1. Enable ovirt with managed block storage on ceph
2. Migrate vms between one host and another
3. VM succesfully migrates but an error about detach_volume is generated

Actual results:

The migrated rbd device remains mounted on the source host

Expected results:

The rbd device only mounted on desstination host, where vm is running


Additional info:
ceph version 14.2.4
centos 7

Comment 1 Dan Poltawski 2019-09-26 08:47:55 UTC

Created attachment 1619420 [details]
/varlog/messages

Comment 2 Dan Poltawski 2019-09-26 08:48:27 UTC

Created attachment 1619421 [details]
vdsm.log

Comment 3 Dan Poltawski 2019-09-26 08:49:16 UTC

I've attached various logs, but I am afraid there is a lot of noise frm other isuses we're workign through

Comment 4 Dan Poltawski 2019-09-27 20:38:09 UTC

Note sure if https://bugzilla.redhat.com/show_bug.cgi?id=1750417 might be related tot his

Comment 5 Tal Nisan 2019-10-28 15:37:37 UTC

Benny, if I'm not mistaken LSM is not supported with MBS, isn't it?

Comment 6 Benny Zlotnik 2019-10-28 18:58:27 UTC

(In reply to Tal Nisan from comment #5)
> Benny, if I'm not mistaken LSM is not supported with MBS, isn't it?

yes, the bug is about VM live migration which is supported, it can be resolved by blacklisting rbd devices from multipath so I'm moving this to vdsm

Comment 7 Dan Poltawski 2019-12-03 23:26:40 UTC

Just to mention that when I add this blacklist configuration with one specific guest (a fortinet appliance, which I think has a lvm-ish like filesystem) we are unable to boot the vm.

We also have a faily persisent problem with rbd devices continuining to be mounted after live migration even with this blacklist configuration applied.

Comment 8 Benny Zlotnik 2019-12-04 07:07:00 UTC

(In reply to Dan Poltawski from comment #7)
> Just to mention that when I add this blacklist configuration with one
> specific guest (a fortinet appliance, which I think has a lvm-ish like
> filesystem) we are unable to boot the vm.
> 
> We also have a faily persisent problem with rbd devices continuining to be
> mounted after live migration even with this blacklist configuration applied.

Can you share more details what errors do you see when trying to boot?

Do you see errors about a failed unmap (supervdsm/vdsm log) when the rbd device remains?

Comment 9 Dan Poltawski 2019-12-04 08:53:12 UTC

> Can you share more details what errors do you see when trying to boot?

It also causes migrations to fail, for the record. My suspicion from the logs is its related to an LVM scan which is somehow prevented by the mulitpath errors?

When it fails, in the Event log UI I get:

VM fmg00.tnp.net.uk was started by poltawski@internal-authz (Host: tails.ma1.tnp.infra).
User <UNKNOWN> got disconnected from VM fmg00.tnp.net.uk.
VM fmg00.tnp.net.uk is down with error. Exit message: 'path'.
Failed to run VM fmg00.tnp.net.uk on Host tails.ma1.tnp.infra.
Failed to run VM fmg00.tnp.net.uk (User: poltawski@internal-authz).


From vdsm.log:

2019-12-04 08:33:50,208+0000 ERROR (vm/efd6c1f5) [storage.TaskManager.Task] (Task='97393408-ac4e-46f9-b29d-b86a12803381') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in appropriateDevice
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3158, in appropriateDevice
    timeout=QEMU_READABLE_TIMEOUT)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 58, in retry
    return func()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileUtils.py", line 153, in validateQemuReadable
    raise OSError(errno.EACCES, os.strerror(errno.EACCES))
OSError: [Errno 13] Permission denied
2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [storage.Dispatcher] FINISH appropriateDevice error=[Errno 13] Permission denied (dispatcher:87)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare
    raise self.error
OSError: [Errno 13] Permission denied
2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [virt.vm] (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') The vm start process failed (vm:933)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in _run
    self._devices = self._make_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in _make_devices
    disk_objs = self._perform_host_local_adjustment()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in _perform_host_local_adjustment
    self._preparePathsForDrives(disk_params)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in _preparePathsForDrives
    drive, self.id, path=path
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 473, in prepareVolumePath
    volPath = res['path']
KeyError: 'path'
2019-12-04 08:33:50,231+0000 WARN  (jsonrpc/4) [virt.vm] (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') trying to set state to Powering down when already Down (vm:625)
2019-12-04 08:33:50,233+0000 WARN  (jsonrpc/4) [root] File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-guest-agent.0 already removed (fileutils:54)
2019-12-04 08:33:50,234+0000 WARN  (jsonrpc/4) [root] File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.guest_agent.0 already removed (fileutils:54)


In /var/log/messages I see the suscpicious lvm scans:


[root@tails vdsm]# grep -v -E '(slice|Session)' /var/log/messages | tail -n 18
Dec  4 08:33:07 tails kernel: rbd0: p1
Dec  4 08:33:07 tails kernel: rbd: rbd0: capacity 2147483648 features 0x5
Dec  4 08:33:15 tails kernel: rbd: rbd1: capacity 85899345920 features 0x5
Dec  4 08:33:16 tails systemd: Starting LVM2 PV scan on device 252:16...
Dec  4 08:33:16 tails systemd: Started LVM2 PV scan on device 252:16.
Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:33:50 tails vdsm[25931]: WARN File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-guest-agent.0 already removed
Dec  4 08:33:50 tails vdsm[25931]: WARN File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.guest_agent.0 already removed
Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:33:57 tails systemd: Stopping LVM2 PV scan on device 252:16...
Dec  4 08:33:57 tails systemd: Stopped LVM2 PV scan on device 252:16.



In contrast, when I remove the blacklist configuration and it sucessfully boots, I still get the same udev errors, but it seems like I don't get the LVM scans:

Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:40:31 tails multipathd: rbd1: add path (uevent)
Dec  4 08:40:31 tails multipathd: rbd1: spurious uevent, path already in pathvec
Dec  4 08:40:31 tails multipathd: rbd1: HDIO_GETGEO failed with 25
Dec  4 08:40:31 tails multipathd: rbd1: failed to get path uid
Dec  4 08:40:31 tails multipathd: uevent trigger error
Dec  4 08:40:32 tails multipathd: rbd0: add path (uevent)
Dec  4 08:40:32 tails multipathd: rbd0: spurious uevent, path already in pathvec
Dec  4 08:40:32 tails multipathd: rbd0: HDIO_GETGEO failed with 25
Dec  4 08:40:32 tails multipathd: rbd0: failed to get path uid
Dec  4 08:40:32 tails multipathd: uevent trigger error

> Do you see errors about a failed unmap (supervdsm/vdsm log) when the rbd device remains?

Yes:

VDSM sonic.ma1.tnp.infra command DetachManagedBlockStorageVolumeVDS failed: Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=\'\' err=\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmp9BmATh/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 288261\\nTraceback (most recent call last):\\n File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n sys.exit(main(sys.argv[1:]))\\n File "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n args.command(args)\\n File "/usr/libexec/vdsm/managedvolume-helper", line 149, in detach\\n ignore_errors=False)\\n File "/usr/lib/python2.7/site-packages/vdsm/storage/nos_brick.py", line 121, in disconnect_volume\\n run_as_root=True)\\n File "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in _execute\\n result = self.__execute(*args, **kwargs)\\n File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute\\n return execute_root(*cmd, **kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 241, in _wrap\\n return self.channel.remote_call(name, args, kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 203, in remote_call\\n raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/rbd/volume-fda410ef-8d50-48ad-9243-415be3f69460 --conf /tmp/brickrbd_bbFJeM --id ovirt --mon_host 172.16.10.13:3300 --mon_host 172.16.10.14:3300 --mon_host 172.16.10.12:6789\\nExit code: 16\\nStdout: u\\\'\\\'\\nStderr: u\\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',)
12/4/198:27:39 AM

Comment 10 Benny Zlotnik 2019-12-04 20:44:04 UTC

(In reply to Dan Poltawski from comment #9)
> > Can you share more details what errors do you see when trying to boot?
> 
> It also causes migrations to fail, for the record. My suspicion from the
> logs is its related to an LVM scan which is somehow prevented by the
> mulitpath errors?
> 
> When it fails, in the Event log UI I get:
> 
> VM fmg00.tnp.net.uk was started by poltawski@internal-authz (Host:
> tails.ma1.tnp.infra).
> User <UNKNOWN> got disconnected from VM fmg00.tnp.net.uk.
> VM fmg00.tnp.net.uk is down with error. Exit message: 'path'.
> Failed to run VM fmg00.tnp.net.uk on Host tails.ma1.tnp.infra.
> Failed to run VM fmg00.tnp.net.uk (User: poltawski@internal-authz).
> 
> 
> From vdsm.log:
> 
> 2019-12-04 08:33:50,208+0000 ERROR (vm/efd6c1f5) [storage.TaskManager.Task]
> (Task='97393408-ac4e-46f9-b29d-b86a12803381') Unexpected error (task:875)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
> _run
>     return fn(*args, **kargs)
>   File "<string>", line 2, in appropriateDevice
>   File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in
> method
>     ret = func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3158, in
> appropriateDevice
>     timeout=QEMU_READABLE_TIMEOUT)
>   File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 58,
> in retry
>     return func()
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/fileUtils.py", line
> 153, in validateQemuReadable
>     raise OSError(errno.EACCES, os.strerror(errno.EACCES))
> OSError: [Errno 13] Permission denied
> 2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [storage.Dispatcher] FINISH
> appropriateDevice error=[Errno 13] Permission denied (dispatcher:87)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line
> 74, in wrapper
>     result = ctask.prepare(func, *args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in
> wrapper
>     return m(self, *a, **kw)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189,
> in prepare
>     raise self.error
> OSError: [Errno 13] Permission denied
> 2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [virt.vm]
> (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') The vm start process failed
> (vm:933)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in
> _startUnderlyingVm
>     self._run()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in _run
>     self._devices = self._make_devices()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in
> _make_devices
>     disk_objs = self._perform_host_local_adjustment()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in
> _perform_host_local_adjustment
>     self._preparePathsForDrives(disk_params)
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in
> _preparePathsForDrives
>     drive, self.id, path=path
>   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 473, in
> prepareVolumePath
>     volPath = res['path']
> KeyError: 'path'
> 2019-12-04 08:33:50,231+0000 WARN  (jsonrpc/4) [virt.vm]
> (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') trying to set state to
> Powering down when already Down (vm:625)
> 2019-12-04 08:33:50,233+0000 WARN  (jsonrpc/4) [root] File:
> /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-
> guest-agent.0 already removed (fileutils:54)
> 2019-12-04 08:33:50,234+0000 WARN  (jsonrpc/4) [root] File:
> /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.
> guest_agent.0 already removed (fileutils:54)
> 
> 
> In /var/log/messages I see the suscpicious lvm scans:
> 
> 
> [root@tails vdsm]# grep -v -E '(slice|Session)' /var/log/messages | tail -n
> 18
> Dec  4 08:33:07 tails kernel: rbd0: p1
> Dec  4 08:33:07 tails kernel: rbd: rbd0: capacity 2147483648 features 0x5
> Dec  4 08:33:15 tails kernel: rbd: rbd1: capacity 85899345920 features 0x5
> Dec  4 08:33:16 tails systemd: Starting LVM2 PV scan on device 252:16...
> Dec  4 08:33:16 tails systemd: Started LVM2 PV scan on device 252:16.
> Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:33:50 tails vdsm[25931]: WARN File:
> /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-
> guest-agent.0 already removed
> Dec  4 08:33:50 tails vdsm[25931]: WARN File:
> /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.
> guest_agent.0 already removed
> Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:33:57 tails systemd: Stopping LVM2 PV scan on device 252:16...
> Dec  4 08:33:57 tails systemd: Stopped LVM2 PV scan on device 252:16.
> 
> 
> 
> In contrast, when I remove the blacklist configuration and it sucessfully
> boots, I still get the same udev errors, but it seems like I don't get the
> LVM scans:
> 
> Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:40:31 tails multipathd: rbd1: add path (uevent)
> Dec  4 08:40:31 tails multipathd: rbd1: spurious uevent, path already in
> pathvec
> Dec  4 08:40:31 tails multipathd: rbd1: HDIO_GETGEO failed with 25
> Dec  4 08:40:31 tails multipathd: rbd1: failed to get path uid
> Dec  4 08:40:31 tails multipathd: uevent trigger error
> Dec  4 08:40:32 tails multipathd: rbd0: add path (uevent)
> Dec  4 08:40:32 tails multipathd: rbd0: spurious uevent, path already in
> pathvec
> Dec  4 08:40:32 tails multipathd: rbd0: HDIO_GETGEO failed with 25
> Dec  4 08:40:32 tails multipathd: rbd0: failed to get path uid
> Dec  4 08:40:32 tails multipathd: uevent trigger error
> 

This quite strange, it seems to fail chowning the rbd device, does this only happen on the fortinet host?
252:16 is the rbd device?
I am not familiar with fortientat all, maybe adding an lvm filter will help?

> > Do you see errors about a failed unmap (supervdsm/vdsm log) when the rbd device remains?
> 
> Yes:
> 
> VDSM sonic.ma1.tnp.infra command DetachManagedBlockStorageVolumeVDS failed:
> Managed Volume Helper failed.: ('Error executing helper: Command
> [\'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1
> out=\'\' err=\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\',
> \\\'privsep-helper\\\', \\\'--privsep_context\\\',
> \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\',
> \\\'/tmp/tmp9BmATh/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new
> privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon
> starting\\noslo.privsep.daemon: privsep process running with uid/gid:
> 0/0\\noslo.privsep.daemon: privsep process running with capabilities
> (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon:
> privsep daemon running as pid 288261\\nTraceback (most recent call last):\\n
> File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n
> sys.exit(main(sys.argv[1:]))\\n File
> "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n
> args.command(args)\\n File "/usr/libexec/vdsm/managedvolume-helper", line
> 149, in detach\\n ignore_errors=False)\\n File
> "/usr/lib/python2.7/site-packages/vdsm/storage/nos_brick.py", line 121, in
> disconnect_volume\\n run_as_root=True)\\n File
> "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in
> _execute\\n result = self.__execute(*args, **kwargs)\\n File
> "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line
> 169, in execute\\n return execute_root(*cmd, **kwargs)\\n File
> "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 241,
> in _wrap\\n return self.channel.remote_call(name, args, kwargs)\\n File
> "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 203, in
> remote_call\\n raise
> exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError:
> Unexpected error while running command.\\nCommand: rbd unmap
> /dev/rbd/rbd/volume-fda410ef-8d50-48ad-9243-415be3f69460 --conf
> /tmp/brickrbd_bbFJeM --id ovirt --mon_host 172.16.10.13:3300 --mon_host
> 172.16.10.14:3300 --mon_host 172.16.10.12:6789\\nExit code: 16\\nStdout:
> u\\\'\\\'\\nStderr: u\\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16)
> Device or resource busy\\\\n\\\'\\n\'',)
> 12/4/198:27:39 AM

Can you share you configuration? (If there's anything special in rbd driver parameters you use)
I'll try to reproduce this

Comment 11 Dan Poltawski 2019-12-10 12:21:10 UTC

> I am not familiar with fortientat all, maybe adding an lvm filter will help?

I'm not familiar with the fortinet guest either. But my thinking from on ovirt point of view, should there not be some sort of filter to prevent any lvm operations happening at all on any *guest* rbd devices?


> Can you share you configuration? (If there's anything special in rbd driver parameters you use)
> I'll try to reproduce this

Not sure how to extract the config? But I have two managed block storage domains with two different rbd pools, which match almost exactly the example in  https://www.ovirt.org/develop/release-management/features/storage/cinderlib-integration.html 

I note that 'use_multipath_for_image_xfer' option is set, which I blindly set following the example in the above link. 

The ceph cluster is nautilus running on centos  and nothing funky, ceph.conf just defines the mon hosts near enoguh.

Comment 12 Nir Soffer 2019-12-10 18:22:30 UTC

(In reply to Dan Poltawski from comment #11)
> > I am not familiar with fortientat all, maybe adding an lvm filter will help?
> 
> I'm not familiar with the fortinet guest either. But my thinking from on
> ovirt point of view, should there not be some sort of filter to prevent any
> lvm operations happening at all on any *guest* rbd devices?

No, because oVirt cannot know what are you doing with the host. The only
way that can work is to have lvm filter that allow the host to access
only the devices needed by the host. This filter will reject any other 
devices like rbd devices.

oVirt provides a tool to configure strict lvm filter, see:
https://blogs.ovirt.org/2017/12/lvm-configuration-the-easy-way/

Comment 14 sshmulev 2022-03-15 12:01:36 UTC

The issue still reproduces.

According to the description of the step I couldn't reproduce it, I think the missing part in the steps were to make a filesystem on the ceph-client node and then to mount it, just then I can see the error of  ‘nrbd: unmap failed: (16) Device or resource busy’.

Here are my steps to reproduce the issue:



Versions:
vdsm-4.50.0.9-1.el8ev
ovirt-engine-4.5.0-600.dbc0589fce78.233.el8ev
ceph-common-14.2.11-208.el8cp.x86_64

1. Enable ovirt with managed block storage on ceph
2. Create a VM from template and attach the MBD disk to it  and run it.
3. Creating a file system on the ceph-client which the VM is running on:
mkfs.ext4 -m0 /dev/rbd/rhev_cinderlib/volume-b80f83ce-3f41-49b2-b029-10da608c669d
mke2fs 1.45.6 (20-Mar-2020)
Discarding device blocks: done                            
Creating filesystem with 1310720 4k blocks and 327680 inodes
Filesystem UUID: 937b0a76-af0a-4329-a45e-22346708abd0
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

4. Mount the file system on the ceph-client node:
mkdir /mnt/ceph-block-device
mount /dev/rbd/rhev_cinderlib/volume-b80f83ce-3f41-49b2-b029-10da608c669d /mnt/ceph-block-device

# mount | grep rbd
/dev/rbd0 on /mnt/ceph-block-device type ext4 (rw,relatime,seclabel,stripe=16)

3. Make live migration from one host to another


Actual result:
# tail -F /var/log/vdsm/vdsm.log | grep ERROR
2022-03-15 13:46:21,449+0200 ERROR (jsonrpc/2) [api] FINISH detach_volume error=Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpq8b2_zhx/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 151874\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 149, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 121, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 52, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 245, in _wrap\\n    return self.channel.remote_call(name, args, kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 224, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/rhev_cinderlib/volume-b80f83ce-3f41-49b2-b029-10da608c669d --conf /tmp/brickrbd_1nz0wfg3 --id rhev-qe --mon_host 10.46.12.3:6789 --mon_host 10.46.12.4:6789 --mon_host 10.46.12.9:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',) (api:131)

Comment 16 sshmulev 2022-03-15 14:09:05 UTC

Hi Dan,
I think maybe we miss some steps to reproduce the issue from the bug description.
According to steps 1-3 I can't reproduce it, although If I mount to the rbd disk I get the same error ‘nrbd: unmap failed: (16) Device or resource busy’.
Can you recheck that, please?

Comment 17 Dan Poltawski 2022-03-15 14:13:22 UTC

Apologies, we long since moved away from ovirt due to these ceph issues. So I’m unable to reproduce this issue any longer.

Comment 18 Arik 2022-03-16 10:05:33 UTC

Live migration back and forth works on the master branch (oVirt 4.5)