Bug 1881832 - [CinderLib] - remove managed block disks from storage domain which have a VM snapshot fails - EVENT_ID: USER_FINISHED_FAILED_REMOVE_DISK(2,015)
Summary: [CinderLib] - remove managed block disks from storage domain which have a VM ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.40.1
Hardware: Unspecified
OS: Linux
unspecified
unspecified
Target Milestone: ovirt-4.5.0
: 4.50.0.10
Assignee: shani
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-23 07:44 UTC by mandre38
Modified: 2022-04-28 09:26 UTC (History)
5 users (show)

Fixed In Version: vdsm-4.50.0.10
Clone Of:
Environment:
Last Closed: 2022-04-28 09:26:34 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?
pm-rhel: devel_ack+


Attachments (Terms of Use)
ovirt engine log (3.05 KB, text/plain)
2020-09-24 08:12 UTC, mandre38
no flags Details
cinderlib log file (1.79 KB, text/plain)
2020-10-07 07:17 UTC, mandre38
no flags Details
ovirt engine log (889.88 KB, application/gzip)
2020-10-07 08:25 UTC, mandre38
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt vdsm pull 90 0 None open mpathconf.py: add a blacklist for rbd devices 2022-03-03 19:06:47 UTC

Description mandre38 2020-09-23 07:44:48 UTC
Description of problem:

Version-Release number of selected component (if applicable):

oVirt Engine ver. 4.4.1, 4.4.2

Remove managed block disks from Cinderlib Storage domain (after making snapshots of VM) fails with EVENT_ID: USER_FINISHED_FAILED_REMOVE_DISK(2,015)

2020-09-23 08:59:29,144+02 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (default task-9) [442dbd9c-3e72-4bc1-9964-675c7c836da4] Lock Acquired to object 'EngineLock:{exclusiveLocks='[63759b73-5198-415a-b81a-a5f881fdea02=DISK]', sharedLocks=''}'
2020-09-23 08:59:29,250+02 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engine-Thread-15297) [442dbd9c-3e72-4bc1-9964-675c7c836da4] Running command: RemoveDiskCommand internal: false. Entities affected :  ID: 63759b73-5198-415a-b81a-a5f881fdea02 Type: DiskAction group DELETE_DISK with role type USER
2020-09-23 08:59:29,314+02 INFO  [org.ovirt.engine.core.bll.storage.disk.managedblock.RemoveManagedBlockStorageDiskCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [442dbd9c-3e72-4bc1-9964-675c7c836da4] Running command: RemoveManagedBlockStorageDiskCommand internal: true.
2020-09-23 08:59:43,709+02 ERROR [org.ovirt.engine.core.common.utils.cinderlib.CinderlibExecutor] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [442dbd9c-3e72-4bc1-9964-675c7c836da4] cinderlib execution failed: 
2020-09-23 08:59:43,759+02 ERROR [org.ovirt.engine.core.bll.storage.disk.managedblock.RemoveManagedBlockStorageDiskCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [442dbd9c-3e72-4bc1-9964-675c7c836da4] Transaction rolled-back for command 'org.ovirt.engine.core.bll.storage.disk.managedblock.RemoveManagedBlockStorageDiskCommand'.
2020-09-23 08:59:43,870+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-15297) [442dbd9c-3e72-4bc1-9964-675c7c836da4] EVENT_ID: USER_FINISHED_FAILED_REMOVE_DISK(2,015), Failed to remove disk SimpleBackup_Disk1 from storage domain cinderlib1 (User: admin@internal-authz).
2020-09-23 08:59:43,890+02 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engine-Thread-15297) [442dbd9c-3e72-4bc1-9964-675c7c836da4] Lock freed to object 'EngineLock:{exclusiveLocks='[63759b73-5198-415a-b81a-a5f881fdea02=DISK]', sharedLocks=''}'


How reproducible:


Steps to Reproduce:
1. Make snapshot of VM with disk on Managed Block Storage domain
2. Remove snapshot of VM
3. Start removing disk from Managed Block Storage domain

Actual results:

Failed to remove disk SimpleBackup_Disk1 from storage domain cinderlib1 (User: admin@internal-authz).

Expected results:

Disk "Disk name" was successfully removed from domain cinderlib1 (User admin@internal-authz).

Additional info:

Comment 1 mandre38 2020-09-24 08:12:24 UTC
Created attachment 1716237 [details]
ovirt engine log

Comment 2 RHEL Program Management 2020-10-05 14:27:31 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 3 Benny Zlotnik 2020-10-06 12:53:50 UTC
Thanks for the report. Can you confirm that this happened on VM that was previously running?
It looks like the issue occurs when there was a failure in detaching, so the volume will still have a reference in the volume_attachments table.
Are you using ceph?
Can you share /var/log/ovirt-engine/cinderlib/cinderlib.log

Comment 4 mandre38 2020-10-07 07:17:33 UTC
Created attachment 1719602 [details]
cinderlib log file

Cenderlib log while trying remove disk after made snapshot

Comment 5 mandre38 2020-10-07 07:21:14 UTC
Yes - that happened on VM that was previously running, and yes we using CEPH 14.2
I add log file from cinderlib.
WBR
Mike

Comment 6 Benny Zlotnik 2020-10-07 07:21:49 UTC
(In reply to mandre38 from comment #4)
> Created attachment 1719602 [details]
> cinderlib log file
> 
> Cenderlib log while trying remove disk after made snapshot
The issue seems different, it looks like snapshot failed to be removed. Can you share the full engine and cinderlib logs, I can't see the snapshot removal failure in the attached logs.

Additionally, which cinderlib version are you using? Did you install it with the rpm from the a recent openstack repo (train and above) or with pip?

Comment 7 mandre38 2020-10-07 08:25:39 UTC
Created attachment 1719652 [details]
ovirt engine log

Hmmmm...I just test it (make new managed block disk->add it to VM->make snapshot->remove snapshot->remove disk via oVirt Web-interface)once again and removing disk was OK!!! Before open this bug, I tested it on 4 different VMs on 2 different hosts and removing disk was failed with same error....
Including full log anyway...
For info:
cinderlib was installed via PIP and version is (2.0.0)

Comment 8 Benny Zlotnik 2020-10-07 08:56:14 UTC
(In reply to mandre38 from comment #7)
> Created attachment 1719652 [details]
> ovirt engine log
> 
> Hmmmm...I just test it (make new managed block disk->add it to VM->make
> snapshot->remove snapshot->remove disk via oVirt Web-interface)once again
> and removing disk was OK!!! Before open this bug, I tested it on 4 different
> VMs on 2 different hosts and removing disk was failed with same error....
> Including full log anyway...
> For info:
> cinderlib was installed via PIP and version is (2.0.0)

Can you start the VM and power it off before removing the snapshot? I believe the issue is related to failure when unmapping the volume from the host

Since you are using Ceph, you might be hitting bug 1755801, where multipath prevents the rbd device from being unmapped, it can be worked around by adding rbd to the multipath, like this:
# cat /etc/multipath.conf.d/rbd.conf
blacklist {
       devnode "^(rbd)[0-9]*"
}

But we need to confirm this is indeed the issue

Comment 9 Benny Zlotnik 2020-10-07 09:01:11 UTC
OK, now I actually looked at the logs you attached, and the logs show this is indeed the issue:

2020-09-24 16:45:14,935+02 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.DetachManagedBlockStorageVolumeVDSCommand] (ForkJoinPool-1-worker-19) [7e6b1dc1] Unexpected return value: Status [code=926, message=Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpbz8pnmf7/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 9725\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 154, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 77, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 149, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 121, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/local/lib/python3.6/site-packages/os_brick/executor.py", line 52, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/local/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/local/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 247, in _wrap\\n    return self.channel.remote_call(name, args, kwargs)\\n  File "/usr/local/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 224, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/rbdpool/volume-c3ccd5c8-ecd0-439b-8071-bb8c3277364a --conf /tmp/brickrbd_ja992li0 --id admin --mon_host 149.148.216.38:6789 --mon_host 149.148.216.39:6789 --mon_host 149.148.216.37:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',)]


We have the bug mentioned in the previous comment to add the blacklisting as part of vdsm's configuration. I will use this bug to improve the workflow: Add better logs, and add an additional detach attempt if needed when removing to make retries possible.

Comment 10 mandre38 2020-10-07 09:48:00 UTC
I made blacklist in /etc/multipath.conf.d/rbd.conf , start the VM next power it off, next remove snapshot, next remove disk without problem:

2020-10-07 11:41:24,281+02 INFO  [org.ovirt.engine.core.common.utils.cinderlib.CinderlibExecutor] (default task-69) [dbf360da-109a-41b3-a222-e956f73e42e5] cinderlib output: 
2020-10-07 11:41:24,353+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-69) [dbf360da-109a-41b3-a222-e956f73e42e5] EVENT_ID: USER_REMOVE_SNAPSHOT(342), Snapshot 'Snap5' deletion for VM 'TestVM_cinderlib' was initiated by admin@internal-authz.
2020-10-07 11:41:26,665+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [dbf360da-109a-41b3-a222-e956f73e42e5] Command 'RemoveSnapshot' id: 'd7835ce6-e670-4af8-92d0-f7695cc5dd9f' child commands '[6a495488-8093-46ca-82d1-5a428aa516ea]' executions were completed, status 'SUCCEEDED'
2020-10-07 11:41:27,685+02 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [dbf360da-109a-41b3-a222-e956f73e42e5] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand' successfully.
2020-10-07 11:41:27,710+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [dbf360da-109a-41b3-a222-e956f73e42e5] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_SUCCESS(356), Snapshot 'Snap5' deletion for VM 'TestVM_cinderlib' has been completed.
2020-10-07 11:41:52,008+02 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (default task-69) [8d8c12e3-2cf4-4990-8e60-6516497186ed] Lock Acquired to object 'EngineLock:{exclusiveLocks='[a3d65025-d2ae-4602-83fc-51cf50c2778a=DISK]', sharedLocks='[97f6a75f-7693-440c-b337-b3470a4fc94c=VM]'}'
2020-10-07 11:41:52,558+02 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engine-Thread-48149) [8d8c12e3-2cf4-4990-8e60-6516497186ed] Running command: RemoveDiskCommand internal: false. Entities affected :  ID: a3d65025-d2ae-4602-83fc-51cf50c2778a Type: DiskAction group DELETE_DISK with role type USER
2020-10-07 11:41:52,580+02 INFO  [org.ovirt.engine.core.bll.storage.disk.managedblock.RemoveManagedBlockStorageDiskCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-10) [] Running command: RemoveManagedBlockStorageDiskCommand internal: true.
2020-10-07 11:41:57,132+02 INFO  [org.ovirt.engine.core.common.utils.cinderlib.CinderlibExecutor] (EE-ManagedExecutorService-commandCoordinator-Thread-10) [] cinderlib output: 
2020-10-07 11:41:57,181+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-48149) [8d8c12e3-2cf4-4990-8e60-6516497186ed] EVENT_ID: USER_FINISHED_REMOVE_DISK_ATTACHED_TO_VMS(2,042), Disk TestVM_cinderlib_Disk22gb associated to the VMs TestVM_cinderlib was successfully removed from domain cinderlib1 (User admin@internal-authz).
2020-10-07 11:41:57,202+02 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engine-Thread-48149) [8d8c12e3-2cf4-4990-8e60-6516497186ed] Lock freed to object 'EngineLock:{exclusiveLocks='[a3d65025-d2ae-4602-83fc-51cf50c2778a=DISK]', sharedLocks='[97f6a75f-7693-440c-b337-b3470a4fc94c=VM]'}'
2020-10-07 11:41:58,754+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [8d8c12e3-2cf4-4990-8e60-6516497186ed] Command 'RemoveManagedBlockStorageDisk' id: '7f3829cb-83d1-479a-acad-62af7d6c64b3' child commands '[]' executions were completed, status 'SUCCEEDED'
2020-10-07 11:41:59,770+02 INFO  [org.ovirt.engine.core.bll.storage.disk.managedblock.RemoveManagedBlockStorageDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [8d8c12e3-2cf4-4990-8e60-6516497186ed] Ending command 'org.ovirt.engine.core.bll.storage.disk.managedblock.RemoveManagedBlockStorageDiskCommand' successfully.

Comment 18 Sandro Bonazzola 2022-04-28 09:26:34 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.