Bug 2123008 - engine qemu-nbd lock virtual disk even with process failed
Summary: engine qemu-nbd lock virtual disk even with process failed
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backup-Restore.Engine
Version: 4.5.2.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.5.3
: ---
Assignee: Benny Zlotnik
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-31 14:36 UTC by D. Ercolani
Modified: 2022-09-19 14:31 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.5.3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-19 14:31:02 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 630 0 None open core: fix transferClientType deserialization 2022-09-01 13:52:38 UTC
Red Hat Issue Tracker RHV-47865 0 None None None 2022-08-31 14:43:53 UTC

Description D. Ercolani 2022-08-31 14:36:00 UTC
Engine Release is ovirt-engine-4.5.2.4-1.el8.noarch

The problem faced with the deny of the remove of the snapshot created by the hybrid backup process.

The full description is in the ovirt users list:
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/MNVW4FT3Y24ATI2KLXIW3KFMJBWJA2VX/#MNVW4FT3Y24ATI2KLXIW3KFMJBWJA2VX


The hint is reported here:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/MNVW4FT3Y24ATI2KLXIW3KFMJBWJA2VX/
by  Benny Zlotnik
"I see that it happened after restarting, so it looks like it messed up
the cleanup sequence and did not close the nbd server."

in the thread are available links to all the log files.

Comment 1 Casper (RHV QE bot) 2022-08-31 15:04:30 UTC
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).

Comment 2 Benny Zlotnik 2022-08-31 16:31:57 UTC
Reproducer:
1. Start downloading a disk
2. During download stop ovirt-engine and the ovirt-imageio service
3. Start both again

The problem is in line[1]:
                if (getParameters().getTransferClientType().isBrowserTransfer()) {

Where getTransferClientType() is null (probably after reloading the parameters after engine restart)
[1] https://github.com/oVirt/ovirt-engine/blob/d318da21f71653ea095cfa5b30552d6ea0c74787/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/image/TransferDiskImageCommand.java#L1357

Comment 3 RHEL Program Management 2022-08-31 16:32:03 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 4 D. Ercolani 2022-09-06 22:06:42 UTC
(In reply to Benny Zlotnik from comment #2)
> Reproducer:
> 1. Start downloading a disk
> 2. During download stop ovirt-engine and the ovirt-imageio service
> 3. Start both again
> 
> The problem is in line[1]:
>                 if
> (getParameters().getTransferClientType().isBrowserTransfer()) {
> 
> Where getTransferClientType() is null (probably after reloading the
> parameters after engine restart)
> [1]
> https://github.com/oVirt/ovirt-engine/blob/
> d318da21f71653ea095cfa5b30552d6ea0c74787/backend/manager/modules/bll/src/
> main/java/org/ovirt/engine/core/bll/storage/disk/image/
> TransferDiskImageCommand.java#L1357

I think I understood how I faced this situation: I'm registering continue "hangs" of the ovirt-engine related to some lock in the gluster implementation, in vdsm.log I have many:
2022-09-06 20:33:48,960+0000 ERROR (qgapoller/0) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7fb17130e4a8>> ope
ration failed (periodic:204)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 202, in __call__
    self._func()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 493, in _poller
    vm_id, self._qga_call_get_vcpus(vm_obj))
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 814, in _qga_call_get_vcpus
    if 'online' in vcpus:
TypeError: argument of type 'NoneType' is not iterable
2022-09-06 20:33:51,358+0000 ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata (monitor:51
1)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata', 1, 'Read timeout')
2022-09-06 20:33:51,358+0000 INFO  (check/loop) [storage.monitor] Domain 3577c21e-f757-4405-97d1-0f827c9b4e22 became INVALID (monitor:482)

If this happen while I'm going to backup a vm, the hang can bring up this problem.

Comment 5 Casper (RHV QE bot) 2022-09-19 14:31:02 UTC
This bug has low overall severity and passed an automated regression suite, and is not going to be further verified by QE. If you believe special care is required, feel free to re-open to ON_QA status.


Note You need to log in before you can comment on or make changes to this bug.