Created attachment 1519715 [details] engine and vdsm logs Description of problem: Following a failure in vdsm to do prepareImage that is being done as part of download image, engine's TransferDiskImageCommand fails with NPE. Version-Release number of selected component (if applicable): ovirt-engine-4.3.0-0.6.alpha2.el7.noarch vdsm-4.30.4-1.el7ev.x86_64 How reproducible: Encountered once Steps to Reproduce: 1. VDSM should fail to download an image disk during prepareImage Actual results: Failure in vdsm: 2019-01-09 21:10:18,302+0200 INFO (ioprocess/10854) [IOProcess] (Global) Starting ioprocess (__init__:434) 2019-01-09 21:10:18,305+0200 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='e7a55210-3fa7-49b7-9d4b-a0b4e174539d') e7a55210-3fa7-49b7-9d4b-a0b4e174539d: load from /rhev/data-center/b3f3778b-0a12-49fe-97e6-06bb5072f104/mastersd/master/tasks, ext '' (task:729) 2019-01-09 21:10:18,311+0200 ERROR (tasks/1) [storage.TaskManager.Task] Unexpected error (task:652) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 631, in _loadMetaFile for line in getProcPool().readLines(filename): File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 332, in readLines return ioproc.readlines(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 555, in readlines return self.readfile(path, direct).splitlines() File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 543, in readfile "direct": direct}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 448, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 2] No such file or directory 2019-01-09 21:10:18,311+0200 ERROR (tasks/1) [storage.TaskManager] taskManager: Skipping directory: e7a55210-3fa7-49b7-9d4b-a0b4e174539d (taskManager:222) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 214, in loadDumpedTasks t = Task.loadTask(store, taskID) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1162, in loadTask t._load(store, ext) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 737, in _load self._loadTaskMetaFile(taskDir) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 688, in _loadTaskMetaFile self._loadMetaFile(taskFile, self, Task.fields) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 653, in _loadMetaFile raise se.TaskMetaDataLoadError(filename) TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/b3f3778b-0a12-49fe-97e6-06bb5072f104/mastersd/master/tasks/e7a55210-3fa7-49b7-9d4b-a0b4e174539d/e7a55210-3fa7-49b7-9d4b-a0b4e174539d.task',) 2019-01-09 21:10:18,312+0200 DEBUG (tasks/1) [storage.ThreadPool] Enter - name: mailbox-spm, numThreads: 5, waitTimeout: 3, maxTasks: 500 (threadPool:36) Still not sure regarding the reason for the failure. Seems like disconnection from the storage NullPointerException failure in engine for TransferDiskImageCommand: 2019-01-09 22:16:53,297+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-115) [14f33570-b7b4-4643-b129-3abc90cbf986] Failed to prepare image for transfer session: {} : org.ovirt.engine.core.common.errors.EngineException: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001) at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:118) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2059) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.prepareImage(TransferDiskImageCommand.java:146) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.addImageTicketToDaemon(TransferDiskImageCommand.java:785) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.startImageTransferSession(TransferDiskImageCommand.java:757) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.handleImageIsReadyForTransfer(TransferDiskImageCommand.java:452) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.executeCommand(TransferDiskImageCommand.java:310) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1141) [bll.jar:] . . . 2019-01-09 22:16:55,621+02 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [14f33570-b7b4-4643-b129-3abc90cbf986] Exception: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor421.newInstance(Unknown Source) [:1.8.0_191] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [rt.jar:1.8.0_191] at java.lang.reflect.Constructor.newInstance(Constructor.java:423) [rt.jar:1.8.0_191] at org.ovirt.engine.core.vdsbroker.ResourceManager.instantiateInjectedCommand(ResourceManager.java:363) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.createCommand(ResourceManager.java:347) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:393) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand$$super(Unknown Source) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor178.invoke(Unknown Source) [:1.8.0_191] Expected results: Engine should handle the vdsm failure correctly Additional info:
Adding a regression flag based on https://bugzilla.redhat.com/show_bug.cgi?id=1493184
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Can we test this flow on el8?
(In reply to Lukas Svaty from comment #4) > Can we test this flow on el8? As proxy-imageio is not available I'm not sure it's the same as original bug which definitely used it. We can test download via WA without proxy-imageio but I dought it's the same flow.
Targetting to 4.4.0 as Regressions flag is present. Let's retest this and close if not reproducible.
Removing Regression keyword -> low reproducibility
Seems that the original bug is not reproduced on 4.4 and not relevant for elp8 since the flow changed a bit. Also from the logs, the bug looks related to some connectivity issues in the env. Since I don't think there is anything that can be changed and that the bug is not reproducible I'm closing the bug.
(In reply to Bella Khizgiyaev from comment #10) > Seems that the original bug is not reproduced on 4.4 and not relevant for > elp8 since the flow changed a bit. I've tried to raise the exception manually from the VDSM to see the engine logs in that case, errors regarding the issue were spotted but no NPEs were reported as in the logs. seems that the issue was fixed and it's not reproducing anymore. > > Since I don't think there is anything that can be changed and that the bug > is not reproducible I'm closing the bug.