Bug 1665020 - In case of a vdsm failure to perform prepareImage as part of transfer disk image (download image), TransferDiskImageCommand fails with NullPointerException
Summary: In case of a vdsm failure to perform prepareImage as part of transfer disk im...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.3.0
Hardware: x86_64
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.4.5
: ---
Assignee: Bella Khizgiyaev
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-10 09:55 UTC by Elad
Modified: 2020-12-30 10:15 UTC (History)
8 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-12-30 10:15:17 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+
pm-rhel: ovirt-4.5?
aoconnor: blocker-


Attachments (Terms of Use)
engine and vdsm logs (1020.56 KB, application/gzip)
2019-01-10 09:55 UTC, Elad
no flags Details

Description Elad 2019-01-10 09:55:19 UTC
Created attachment 1519715 [details]
engine and vdsm logs

Description of problem:
Following a failure in vdsm to do prepareImage that is being done as part of download image, engine's TransferDiskImageCommand fails with NPE.

Version-Release number of selected component (if applicable):
ovirt-engine-4.3.0-0.6.alpha2.el7.noarch
vdsm-4.30.4-1.el7ev.x86_64


How reproducible:
Encountered once

Steps to Reproduce:
1. VDSM should fail to download an image disk during prepareImage


Actual results:

Failure in vdsm:

2019-01-09 21:10:18,302+0200 INFO  (ioprocess/10854) [IOProcess] (Global) Starting ioprocess (__init__:434)
2019-01-09 21:10:18,305+0200 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='e7a55210-3fa7-49b7-9d4b-a0b4e174539d') e7a55210-3fa7-49b7-9d4b-a0b4e174539d: load from /rhev/data-center/b3f3778b-0a12-49fe-97e6-06bb5072f104/mastersd/master/tasks, ext '' (task:729)
2019-01-09 21:10:18,311+0200 ERROR (tasks/1) [storage.TaskManager.Task] Unexpected error (task:652)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 631, in _loadMetaFile
    for line in getProcPool().readLines(filename):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 332, in readLines
    return ioproc.readlines(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 555, in readlines
    return self.readfile(path, direct).splitlines()
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 543, in readfile
    "direct": direct}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 448, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
2019-01-09 21:10:18,311+0200 ERROR (tasks/1) [storage.TaskManager] taskManager: Skipping directory: e7a55210-3fa7-49b7-9d4b-a0b4e174539d (taskManager:222)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 214, in loadDumpedTasks
    t = Task.loadTask(store, taskID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1162, in loadTask
    t._load(store, ext)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 737, in _load
    self._loadTaskMetaFile(taskDir)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 688, in _loadTaskMetaFile
    self._loadMetaFile(taskFile, self, Task.fields)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 653, in _loadMetaFile
    raise se.TaskMetaDataLoadError(filename)
TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/b3f3778b-0a12-49fe-97e6-06bb5072f104/mastersd/master/tasks/e7a55210-3fa7-49b7-9d4b-a0b4e174539d/e7a55210-3fa7-49b7-9d4b-a0b4e174539d.task',)
2019-01-09 21:10:18,312+0200 DEBUG (tasks/1) [storage.ThreadPool] Enter - name: mailbox-spm, numThreads: 5, waitTimeout: 3, maxTasks: 500 (threadPool:36)


Still not sure regarding the reason for the failure. Seems like disconnection from the storage


NullPointerException failure in engine for TransferDiskImageCommand:

2019-01-09 22:16:53,297+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-115) [14f33570-b7b4-4643-b129-3abc90cbf986] Failed to prepare image for transfer session: {}
: org.ovirt.engine.core.common.errors.EngineException: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:118) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2059) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.prepareImage(TransferDiskImageCommand.java:146) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.addImageTicketToDaemon(TransferDiskImageCommand.java:785) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.startImageTransferSession(TransferDiskImageCommand.java:757) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.handleImageIsReadyForTransfer(TransferDiskImageCommand.java:452) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.executeCommand(TransferDiskImageCommand.java:310) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1141) [bll.jar:]

.
.
.

2019-01-09 22:16:55,621+02 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [14f33570-b7b4-4643-b129-3abc90cbf986] Exception: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedConstructorAccessor421.newInstance(Unknown Source) [:1.8.0_191]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [rt.jar:1.8.0_191]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423) [rt.jar:1.8.0_191]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.instantiateInjectedCommand(ResourceManager.java:363) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.createCommand(ResourceManager.java:347) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:393) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand$$super(Unknown Source) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor178.invoke(Unknown Source) [:1.8.0_191]


Expected results:
Engine should handle the vdsm failure correctly

Additional info:

Comment 1 Raz Tamir 2019-01-11 11:14:54 UTC
Adding a regression flag based on https://bugzilla.redhat.com/show_bug.cgi?id=1493184

Comment 2 Red Hat Bugzilla Rules Engine 2019-01-14 09:17:18 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 RHEL Program Management 2020-03-20 10:27:19 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 4 Lukas Svaty 2020-03-20 12:01:20 UTC
Can we test this flow on el8?

Comment 5 Avihai 2020-03-21 18:40:12 UTC
(In reply to Lukas Svaty from comment #4)
> Can we test this flow on el8?

As proxy-imageio is not available I'm not sure it's the same as original bug which definitely used it.
We can test download via WA without proxy-imageio but I dought it's the same flow.

Comment 6 Lukas Svaty 2020-03-24 12:36:18 UTC
Targetting to 4.4.0 as Regressions flag is present. Let's retest this and close if not reproducible.

Comment 9 Lukas Svaty 2020-04-06 15:08:41 UTC
Removing Regression keyword -> low reproducibility

Comment 10 Bella Khizgiyaev 2020-12-27 16:52:26 UTC
Seems that the original bug is not reproduced on 4.4 and not relevant for elp8 since the flow changed a bit.
Also from the logs, the bug looks related to some connectivity issues in the env.

Since I don't think there is anything that can be changed and that the bug is not reproducible I'm closing the bug.

Comment 11 Bella Khizgiyaev 2020-12-30 10:15:17 UTC
(In reply to Bella Khizgiyaev from comment #10)
> Seems that the original bug is not reproduced on 4.4 and not relevant for
> elp8 since the flow changed a bit.

I've tried to raise the exception manually from the VDSM to see the engine logs in that case,
errors regarding the issue were spotted but no NPEs were reported as in the logs.
seems that the issue was fixed and it's not reproducing anymore.

> 
> Since I don't think there is anything that can be changed and that the bug
> is not reproducible I'm closing the bug.


Note You need to log in before you can comment on or make changes to this bug.