Bug 1755801 - Managed Block Storage: Live Migration Problems with ceph
Summary: Managed Block Storage: Live Migration Problems with ceph
Keywords:
Status: NEW
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.30.34
Hardware: x86_64
OS: Linux
unspecified
high vote
Target Milestone: ovirt-4.4.1
: ---
Assignee: Nir Soffer
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-26 08:47 UTC by Dan Poltawski
Modified: 2020-01-13 09:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Storage
sbonazzo: ovirt-4.4?


Attachments (Terms of Use)
engine.log (579.78 KB, application/gzip)
2019-09-26 08:47 UTC, Dan Poltawski
no flags Details
/varlog/messages (579.78 KB, application/gzip)
2019-09-26 08:47 UTC, Dan Poltawski
no flags Details
vdsm.log (900.00 KB, application/gzip)
2019-09-26 08:48 UTC, Dan Poltawski
no flags Details

Description Dan Poltawski 2019-09-26 08:47:15 UTC
Created attachment 1619419 [details]
engine.log

Description of problem:

On ovirt 4.3.5 we are seeing various problems related to the rbd device staying mapped
after a guest has been live migrated. This causes problems migrating the guest back, as
well as rebooting the guest when it starts back up on the original host. The error
returned is ‘nrbd: unmap failed: (16) Device or resource busy’. 

I posted on the mailing list about this issue here:
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/PVGTQPXCTEQI4LUUSXDRLSIH3GXXQC2N/

The thread brought attention to:
https://tracker.ceph.com/issues/12763


Version-Release number of selected component (if applicable):

python2-cinderlib.noarch                 1:0.9.0-1.el7    centos-openstack-stein
vdsm.x86_64                              4.30.24-1.el7    @ovirt-4.3            
device-mapper-multipath.x86_64           0.4.9-127.el7    @base                 

How reproducible:
90% of time

Steps to Reproduce:
1. Enable ovirt with managed block storage on ceph
2. Migrate vms between one host and another
3. VM succesfully migrates but an error about detach_volume is generated

Actual results:

The migrated rbd device remains mounted on the source host

Expected results:

The rbd device only mounted on desstination host, where vm is running


Additional info:
ceph version 14.2.4
centos 7

Comment 1 Dan Poltawski 2019-09-26 08:47:55 UTC
Created attachment 1619420 [details]
/varlog/messages

Comment 2 Dan Poltawski 2019-09-26 08:48:27 UTC
Created attachment 1619421 [details]
vdsm.log

Comment 3 Dan Poltawski 2019-09-26 08:49:16 UTC
I've attached various logs, but I am afraid there is a lot of noise frm other isuses we're workign through

Comment 4 Dan Poltawski 2019-09-27 20:38:09 UTC
Note sure if https://bugzilla.redhat.com/show_bug.cgi?id=1750417 might be related tot his

Comment 5 Tal Nisan 2019-10-28 15:37:37 UTC
Benny, if I'm not mistaken LSM is not supported with MBS, isn't it?

Comment 6 Benny Zlotnik 2019-10-28 18:58:27 UTC
(In reply to Tal Nisan from comment #5)
> Benny, if I'm not mistaken LSM is not supported with MBS, isn't it?

yes, the bug is about VM live migration which is supported, it can be resolved by blacklisting rbd devices from multipath so I'm moving this to vdsm

Comment 7 Dan Poltawski 2019-12-03 23:26:40 UTC
Just to mention that when I add this blacklist configuration with one specific guest (a fortinet appliance, which I think has a lvm-ish like filesystem) we are unable to boot the vm.

We also have a faily persisent problem with rbd devices continuining to be mounted after live migration even with this blacklist configuration applied.

Comment 8 Benny Zlotnik 2019-12-04 07:07:00 UTC
(In reply to Dan Poltawski from comment #7)
> Just to mention that when I add this blacklist configuration with one
> specific guest (a fortinet appliance, which I think has a lvm-ish like
> filesystem) we are unable to boot the vm.
> 
> We also have a faily persisent problem with rbd devices continuining to be
> mounted after live migration even with this blacklist configuration applied.

Can you share more details what errors do you see when trying to boot?

Do you see errors about a failed unmap (supervdsm/vdsm log) when the rbd device remains?

Comment 9 Dan Poltawski 2019-12-04 08:53:12 UTC
> Can you share more details what errors do you see when trying to boot?

It also causes migrations to fail, for the record. My suspicion from the logs is its related to an LVM scan which is somehow prevented by the mulitpath errors?

When it fails, in the Event log UI I get:

VM fmg00.tnp.net.uk was started by poltawski@internal-authz (Host: tails.ma1.tnp.infra).
User <UNKNOWN> got disconnected from VM fmg00.tnp.net.uk.
VM fmg00.tnp.net.uk is down with error. Exit message: 'path'.
Failed to run VM fmg00.tnp.net.uk on Host tails.ma1.tnp.infra.
Failed to run VM fmg00.tnp.net.uk (User: poltawski@internal-authz).


From vdsm.log:

2019-12-04 08:33:50,208+0000 ERROR (vm/efd6c1f5) [storage.TaskManager.Task] (Task='97393408-ac4e-46f9-b29d-b86a12803381') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in appropriateDevice
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3158, in appropriateDevice
    timeout=QEMU_READABLE_TIMEOUT)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 58, in retry
    return func()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileUtils.py", line 153, in validateQemuReadable
    raise OSError(errno.EACCES, os.strerror(errno.EACCES))
OSError: [Errno 13] Permission denied
2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [storage.Dispatcher] FINISH appropriateDevice error=[Errno 13] Permission denied (dispatcher:87)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare
    raise self.error
OSError: [Errno 13] Permission denied
2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [virt.vm] (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') The vm start process failed (vm:933)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in _run
    self._devices = self._make_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in _make_devices
    disk_objs = self._perform_host_local_adjustment()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in _perform_host_local_adjustment
    self._preparePathsForDrives(disk_params)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in _preparePathsForDrives
    drive, self.id, path=path
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 473, in prepareVolumePath
    volPath = res['path']
KeyError: 'path'
2019-12-04 08:33:50,231+0000 WARN  (jsonrpc/4) [virt.vm] (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') trying to set state to Powering down when already Down (vm:625)
2019-12-04 08:33:50,233+0000 WARN  (jsonrpc/4) [root] File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-guest-agent.0 already removed (fileutils:54)
2019-12-04 08:33:50,234+0000 WARN  (jsonrpc/4) [root] File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.guest_agent.0 already removed (fileutils:54)


In /var/log/messages I see the suscpicious lvm scans:


[root@tails vdsm]# grep -v -E '(slice|Session)' /var/log/messages | tail -n 18
Dec  4 08:33:07 tails kernel: rbd0: p1
Dec  4 08:33:07 tails kernel: rbd: rbd0: capacity 2147483648 features 0x5
Dec  4 08:33:15 tails kernel: rbd: rbd1: capacity 85899345920 features 0x5
Dec  4 08:33:16 tails systemd: Starting LVM2 PV scan on device 252:16...
Dec  4 08:33:16 tails systemd: Started LVM2 PV scan on device 252:16.
Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:33:50 tails vdsm[25931]: WARN File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-guest-agent.0 already removed
Dec  4 08:33:50 tails vdsm[25931]: WARN File: /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.guest_agent.0 already removed
Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:33:57 tails systemd: Stopping LVM2 PV scan on device 252:16...
Dec  4 08:33:57 tails systemd: Stopped LVM2 PV scan on device 252:16.



In contrast, when I remove the blacklist configuration and it sucessfully boots, I still get the same udev errors, but it seems like I don't get the LVM scans:

Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character 26 (';')
Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character 29 (';')
Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character 25 (';')
Dec  4 08:40:31 tails multipathd: rbd1: add path (uevent)
Dec  4 08:40:31 tails multipathd: rbd1: spurious uevent, path already in pathvec
Dec  4 08:40:31 tails multipathd: rbd1: HDIO_GETGEO failed with 25
Dec  4 08:40:31 tails multipathd: rbd1: failed to get path uid
Dec  4 08:40:31 tails multipathd: uevent trigger error
Dec  4 08:40:32 tails multipathd: rbd0: add path (uevent)
Dec  4 08:40:32 tails multipathd: rbd0: spurious uevent, path already in pathvec
Dec  4 08:40:32 tails multipathd: rbd0: HDIO_GETGEO failed with 25
Dec  4 08:40:32 tails multipathd: rbd0: failed to get path uid
Dec  4 08:40:32 tails multipathd: uevent trigger error

> Do you see errors about a failed unmap (supervdsm/vdsm log) when the rbd device remains?

Yes:

VDSM sonic.ma1.tnp.infra command DetachManagedBlockStorageVolumeVDS failed: Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=\'\' err=\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmp9BmATh/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 288261\\nTraceback (most recent call last):\\n File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n sys.exit(main(sys.argv[1:]))\\n File "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n args.command(args)\\n File "/usr/libexec/vdsm/managedvolume-helper", line 149, in detach\\n ignore_errors=False)\\n File "/usr/lib/python2.7/site-packages/vdsm/storage/nos_brick.py", line 121, in disconnect_volume\\n run_as_root=True)\\n File "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in _execute\\n result = self.__execute(*args, **kwargs)\\n File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute\\n return execute_root(*cmd, **kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 241, in _wrap\\n return self.channel.remote_call(name, args, kwargs)\\n File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 203, in remote_call\\n raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/rbd/volume-fda410ef-8d50-48ad-9243-415be3f69460 --conf /tmp/brickrbd_bbFJeM --id ovirt --mon_host 172.16.10.13:3300 --mon_host 172.16.10.14:3300 --mon_host 172.16.10.12:6789\\nExit code: 16\\nStdout: u\\\'\\\'\\nStderr: u\\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',)
12/4/198:27:39 AM

Comment 10 Benny Zlotnik 2019-12-04 20:44:04 UTC
(In reply to Dan Poltawski from comment #9)
> > Can you share more details what errors do you see when trying to boot?
> 
> It also causes migrations to fail, for the record. My suspicion from the
> logs is its related to an LVM scan which is somehow prevented by the
> mulitpath errors?
> 
> When it fails, in the Event log UI I get:
> 
> VM fmg00.tnp.net.uk was started by poltawski@internal-authz (Host:
> tails.ma1.tnp.infra).
> User <UNKNOWN> got disconnected from VM fmg00.tnp.net.uk.
> VM fmg00.tnp.net.uk is down with error. Exit message: 'path'.
> Failed to run VM fmg00.tnp.net.uk on Host tails.ma1.tnp.infra.
> Failed to run VM fmg00.tnp.net.uk (User: poltawski@internal-authz).
> 
> 
> From vdsm.log:
> 
> 2019-12-04 08:33:50,208+0000 ERROR (vm/efd6c1f5) [storage.TaskManager.Task]
> (Task='97393408-ac4e-46f9-b29d-b86a12803381') Unexpected error (task:875)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
> _run
>     return fn(*args, **kargs)
>   File "<string>", line 2, in appropriateDevice
>   File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in
> method
>     ret = func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3158, in
> appropriateDevice
>     timeout=QEMU_READABLE_TIMEOUT)
>   File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 58,
> in retry
>     return func()
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/fileUtils.py", line
> 153, in validateQemuReadable
>     raise OSError(errno.EACCES, os.strerror(errno.EACCES))
> OSError: [Errno 13] Permission denied
> 2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [storage.Dispatcher] FINISH
> appropriateDevice error=[Errno 13] Permission denied (dispatcher:87)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line
> 74, in wrapper
>     result = ctask.prepare(func, *args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in
> wrapper
>     return m(self, *a, **kw)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189,
> in prepare
>     raise self.error
> OSError: [Errno 13] Permission denied
> 2019-12-04 08:33:50,211+0000 ERROR (vm/efd6c1f5) [virt.vm]
> (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') The vm start process failed
> (vm:933)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in
> _startUnderlyingVm
>     self._run()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in _run
>     self._devices = self._make_devices()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in
> _make_devices
>     disk_objs = self._perform_host_local_adjustment()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in
> _perform_host_local_adjustment
>     self._preparePathsForDrives(disk_params)
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in
> _preparePathsForDrives
>     drive, self.id, path=path
>   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 473, in
> prepareVolumePath
>     volPath = res['path']
> KeyError: 'path'
> 2019-12-04 08:33:50,231+0000 WARN  (jsonrpc/4) [virt.vm]
> (vmId='efd6c1f5-7aa2-4ade-8949-e42814262caa') trying to set state to
> Powering down when already Down (vm:625)
> 2019-12-04 08:33:50,233+0000 WARN  (jsonrpc/4) [root] File:
> /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-
> guest-agent.0 already removed (fileutils:54)
> 2019-12-04 08:33:50,234+0000 WARN  (jsonrpc/4) [root] File:
> /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.
> guest_agent.0 already removed (fileutils:54)
> 
> 
> In /var/log/messages I see the suscpicious lvm scans:
> 
> 
> [root@tails vdsm]# grep -v -E '(slice|Session)' /var/log/messages | tail -n
> 18
> Dec  4 08:33:07 tails kernel: rbd0: p1
> Dec  4 08:33:07 tails kernel: rbd: rbd0: capacity 2147483648 features 0x5
> Dec  4 08:33:15 tails kernel: rbd: rbd1: capacity 85899345920 features 0x5
> Dec  4 08:33:16 tails systemd: Starting LVM2 PV scan on device 252:16...
> Dec  4 08:33:16 tails systemd: Started LVM2 PV scan on device 252:16.
> Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:33:19 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:33:28 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:33:50 tails vdsm[25931]: WARN File:
> /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.ovirt-
> guest-agent.0 already removed
> Dec  4 08:33:50 tails vdsm[25931]: WARN File:
> /var/lib/libvirt/qemu/channels/efd6c1f5-7aa2-4ade-8949-e42814262caa.org.qemu.
> guest_agent.0 already removed
> Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:33:52 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:33:57 tails systemd: Stopping LVM2 PV scan on device 252:16...
> Dec  4 08:33:57 tails systemd: Stopped LVM2 PV scan on device 252:16.
> 
> 
> 
> In contrast, when I remove the blacklist configuration and it sucessfully
> boots, I still get the same udev errors, but it seems like I don't get the
> LVM scans:
> 
> Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:40:24 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 10, starting at character
> 26 (';')
> Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 11, starting at character
> 29 (';')
> Dec  4 08:40:30 tails systemd-udevd: invalid key/value pair in file
> /usr/lib/udev/rules.d/59-fc-wwpn-id.rules on line 12, starting at character
> 25 (';')
> Dec  4 08:40:31 tails multipathd: rbd1: add path (uevent)
> Dec  4 08:40:31 tails multipathd: rbd1: spurious uevent, path already in
> pathvec
> Dec  4 08:40:31 tails multipathd: rbd1: HDIO_GETGEO failed with 25
> Dec  4 08:40:31 tails multipathd: rbd1: failed to get path uid
> Dec  4 08:40:31 tails multipathd: uevent trigger error
> Dec  4 08:40:32 tails multipathd: rbd0: add path (uevent)
> Dec  4 08:40:32 tails multipathd: rbd0: spurious uevent, path already in
> pathvec
> Dec  4 08:40:32 tails multipathd: rbd0: HDIO_GETGEO failed with 25
> Dec  4 08:40:32 tails multipathd: rbd0: failed to get path uid
> Dec  4 08:40:32 tails multipathd: uevent trigger error
> 

This quite strange, it seems to fail chowning the rbd device, does this only happen on the fortinet host?
252:16 is the rbd device?
I am not familiar with fortientat all, maybe adding an lvm filter will help?

> > Do you see errors about a failed unmap (supervdsm/vdsm log) when the rbd device remains?
> 
> Yes:
> 
> VDSM sonic.ma1.tnp.infra command DetachManagedBlockStorageVolumeVDS failed:
> Managed Volume Helper failed.: ('Error executing helper: Command
> [\'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1
> out=\'\' err=\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\',
> \\\'privsep-helper\\\', \\\'--privsep_context\\\',
> \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\',
> \\\'/tmp/tmp9BmATh/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new
> privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon
> starting\\noslo.privsep.daemon: privsep process running with uid/gid:
> 0/0\\noslo.privsep.daemon: privsep process running with capabilities
> (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon:
> privsep daemon running as pid 288261\\nTraceback (most recent call last):\\n
> File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n
> sys.exit(main(sys.argv[1:]))\\n File
> "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n
> args.command(args)\\n File "/usr/libexec/vdsm/managedvolume-helper", line
> 149, in detach\\n ignore_errors=False)\\n File
> "/usr/lib/python2.7/site-packages/vdsm/storage/nos_brick.py", line 121, in
> disconnect_volume\\n run_as_root=True)\\n File
> "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in
> _execute\\n result = self.__execute(*args, **kwargs)\\n File
> "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line
> 169, in execute\\n return execute_root(*cmd, **kwargs)\\n File
> "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 241,
> in _wrap\\n return self.channel.remote_call(name, args, kwargs)\\n File
> "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 203, in
> remote_call\\n raise
> exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError:
> Unexpected error while running command.\\nCommand: rbd unmap
> /dev/rbd/rbd/volume-fda410ef-8d50-48ad-9243-415be3f69460 --conf
> /tmp/brickrbd_bbFJeM --id ovirt --mon_host 172.16.10.13:3300 --mon_host
> 172.16.10.14:3300 --mon_host 172.16.10.12:6789\\nExit code: 16\\nStdout:
> u\\\'\\\'\\nStderr: u\\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16)
> Device or resource busy\\\\n\\\'\\n\'',)
> 12/4/198:27:39 AM

Can you share you configuration? (If there's anything special in rbd driver parameters you use)
I'll try to reproduce this

Comment 11 Dan Poltawski 2019-12-10 12:21:10 UTC
> I am not familiar with fortientat all, maybe adding an lvm filter will help?

I'm not familiar with the fortinet guest either. But my thinking from on ovirt point of view, should there not be some sort of filter to prevent any lvm operations happening at all on any *guest* rbd devices?


> Can you share you configuration? (If there's anything special in rbd driver parameters you use)
> I'll try to reproduce this

Not sure how to extract the config? But I have two managed block storage domains with two different rbd pools, which match almost exactly the example in  https://www.ovirt.org/develop/release-management/features/storage/cinderlib-integration.html 

I note that 'use_multipath_for_image_xfer' option is set, which I blindly set following the example in the above link. 

The ceph cluster is nautilus running on centos  and nothing funky, ceph.conf just defines the mon hosts near enoguh.

Comment 12 Nir Soffer 2019-12-10 18:22:30 UTC
(In reply to Dan Poltawski from comment #11)
> > I am not familiar with fortientat all, maybe adding an lvm filter will help?
> 
> I'm not familiar with the fortinet guest either. But my thinking from on
> ovirt point of view, should there not be some sort of filter to prevent any
> lvm operations happening at all on any *guest* rbd devices?

No, because oVirt cannot know what are you doing with the host. The only
way that can work is to have lvm filter that allow the host to access
only the devices needed by the host. This filter will reject any other 
devices like rbd devices.

oVirt provides a tool to configure strict lvm filter, see:
https://blogs.ovirt.org/2017/12/lvm-configuration-the-easy-way/


Note You need to log in before you can comment on or make changes to this bug.