Created attachment 1646278 [details] Host packages as of 2019/12/13 Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1646279 [details] Host packages as of 2019/12/15
vdsmd 4.4.0 throws an exception in asyncore.py while updating OVF data. First exception occured on December 14th and happens every hour since then. An hour before the first exception, we updated engine's and host's rpms packages. We did not run engine-backup so we cannot restore oVirt database if it got corrupted. Currently, both engine (RHEL 7.7) and host (RHEL 8.1) have the most recent packages (and were rebooted). Neither the updates, nor reboots resolve the issue. In general, oVirt is working. We can run virtual machines with disks in the storage domain. We can create new virtual machines. But some functionality is not available, for example, we cannot move disks between domains. The storage domain in question is on NFS. After the issue appeared, we successfully created Glusterfs domain but its OVF update also failed. Interesting that it was possible to move disks to it before OVF failure. Here is the vdsmd traceback for your convinience 2019-12-17 16:36:58,393-0600 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:172.20.1.142', 38002, 0, 0) at 0x7fbda865ed30> ( <class 'TypeError'>:object of type 'NoneType' has no len() [/usr/lib64/python3.6/asyncore.py|readwrite|108] [/usr/lib64/python3.6/asyncore.py|handle_read_event|423] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_read|71] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py|handle_read|115] ) (betterAsyncore:179) ------ Current Versions (that have the exception) ------ Engine ovirt-engine-4.4.0-0.0.master.20191204120550.git04d5d05.el7.noarch Host vdsm-4.40.0-1363.gitf6a1ba0a0.el8.x86_64 vdsm-python-4.40.0-1363.gitf6a1ba0a0.el8.noarch vdsm-yajsonrpc-4.40.0-1363.gitf6a1ba0a0.el8.noarch python3-libs-3.6.8-15.1.el8.x86_64 ------ December 13th Versions (that did not have the exception) ------ Engine ovirt-engine-4.4.0-0.0.master.20191204120550.git04d5d05.el7.noarch --- not sure, but probably the same as now Host vdsm-4.40.0-1360.git821afbbc2.el8.x86_64 vdsm-python-4.40.0-1360.git821afbbc2.el8.noarch vdsm-yajsonrpc-4.40.0-1360.git821afbbc2.el8.noarch python3-libs-3.6.8-15.1.el8.x86_64
Created attachment 1646280 [details] Difference in host packages
Created attachment 1646281 [details] Engine packages
Created attachment 1646283 [details] Engine and host logs (debug is enabled) Here are most recent errors Host vdsm.log 2019-12-18 21:34:01,367-0600 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:172.16.0.142', 45878, 0, 0) at 0x7f70fc644e80> (<class 'TypeError'>:object of type 'NoneType' has no len() [/usr/lib64/python3.6/asyncore.py|readwrite|108] [/usr/lib64/python3.6/asyncore.py|handle_read_event|423] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_read|71] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py|handle_read|115]) (betterAsyncore:179) Engine engine.log 2019-12-18 21:34:01,368-06 DEBUG [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Exception: java.util.concurrent.ExecutionException: javax.net.ssl.SSLException: Connection reset Caused by: javax.net.ssl.SSLException: Connection reset Caused by: java.net.SocketException: Connection reset 2019-12-18 21:34:01,376-06 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Command 'UploadStreamVDSCommand(HostName = harrier, UploadStreamVDSCommandParameters:{hostId='3ccc2c46-378f-4106-b4d4-6656b79088cc'})' execution failed: java.net.SocketException: Connection reset 2019-12-18 21:34:01,376-06 DEBUG [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketException: Connection reset Caused by: java.net.SocketException: Connection reset 2019-12-18 21:34:01,378-06 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketException: Connection reset (Failed with error VDS_NETWORK_ERROR and code 5022)
Not sure if I am looking at a Storage issue: engine.log indicates failing communication with a host called "harrier" by its IPv4 address 2019-12-17 14:21:55,734-06 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to harrier.castle.aimk.com/172.16.0.106 2019-12-17 14:21:55,736-06 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [] Unable to RefreshCapabilities: ConnectException: Connection refused ... 2019-12-18 20:37:16,089-06 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to harrier.castle.aimk.com/172.16.0.106 ... 2019-12-18 21:34:01,378-06 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketException: Connection reset (Failed with error VDS_NETWORK_ERROR and code 5022) Yet according to vdsm.log, the host "harrier" sets an IPv6 socket for handling multiprotocol messages: 2019-12-18 20:35:02,029-0600 INFO (MainThread) [vds] (PID: 7516) I am the actual vdsm 4.40.0.1363.gitf6a1ba0a0 harrier.castle.aimk.com (4.18.0-147.3.1.el8_1.x86_64) (vdsmd:152) ... 2019-12-18 20:35:02,516-0600 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Creating socket (host='::', port=54321, family=10, socketype=1, proto=6) (protocoldetector:225) 2019-12-18 20:35:02,519-0600 INFO (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:183) 2019-12-18 20:35:02,542-0600 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Adding detector <vdsm.rpc.http.HttpDetector object at 0x7f713014b438> (protocoldetector:210) 2019-12-18 20:35:02,550-0600 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Adding detector <yajsonrpc.stompserver.StompDetector object at 0x7f7120147f98> (protocoldetector:210) ... 2019-12-18 21:34:01,367-0600 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:172.16.0.142', 45878, 0, 0) at 0x7f70fc644e80> (<class 'TypeError'>:object of type 'NoneType' has no len() [/usr/lib64/python3.6/asyncore.py|readwrite|108] [/usr/lib64/python3.6/asyncore.py|handle_read_event|423] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_read|71] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py|handle_read|115]) (betterAsyncore:179) The unexpected None return type from sock.recv(...) at protocoldetector.py:115 seems like an anomaly which should be handled in problematic connectivity cases, there are prior examples for it in stomp.py code for handle_read().
> 2019-12-17 14:21:55,736-06 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [] Unable to RefreshCapabilities: ConnectException: Connection refused The host was under maintenance and brought back (including vdsmd.service and supervdsmd.service) around 2019-12-18 20:35:01. So previously timestamped messages are, probably, not relevant. > Yet according to vdsm.log, the host "harrier" sets an IPv6 socket for handling multiprotocol messages: > 2019-12-18 20:35:02,516-0600 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Creating socket (host='::', port=54321, family=10, socketype=1, proto=6) (protocoldetector:225) It seems that vdsmd creates and listen on AF_INET6 socket that can accept IPv4 and IPv6 connections. Just in case, I removed all global engine and host IPv6 addresses from interfaces and DNS. The only IPv6 address that the host has is ::1. I rebooted the engine and the host. It does not help. See new logfiles for errors (this time engine log is also in debug mode). Host 2019-12-19 15:49:25,965-0600 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:172.16.0.142', 34880, 0, 0) at 0x7f40a0711a20> (<class 'TypeError'>:object of type 'NoneType' has no len() [/usr/lib64/python3.6/asyncore.py|readwrite|108] [/usr/lib64/python3.6/asyncore.py|handle_read_event|423] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_read|71] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py|handle_read|115]) (betterAsyncore:179) Engine 2019-12-19 15:49:25,966-06 DEBUG [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-19) [21693ecf] Exception: java.util.concurrent.ExecutionException: javax.net.ssl.SSLException: Connection reset # lsof -nPp 14971 | grep IP vdsmd 14971 vdsm 13u IPv4 107255 0t0 UDP *:42578 vdsmd 14971 vdsm 20u IPv6 122070 0t0 TCP [::1]:54321->[::1]:35162 (ESTABLISHED) vdsmd 14971 vdsm 30u IPv6 107354 0t0 TCP *:54321 (LISTEN) vdsmd 14971 vdsm 45u IPv6 321565 0t0 TCP 172.16.0.106:54321->172.16.0.142:34094 (ESTABLISHED) vdsmd 14971 vdsm 66u IPv6 397971 0t0 TCP 172.16.0.106:54321->172.16.0.142:34240 (ESTABLISHED) As you can see all connections are TCPv4. tcpdump also does not show any TCPv6 connection attempts. > ... like an anomaly which should be handled in problematic connectivity cases ... What is problematic with the connectivity? It looks like the TCP connection is established, the exception happens on the host, and the host resets the connection. It means that RPC request has attributes that are wrong, does it not? Is there any workaround for it?
Created attachment 1646704 [details] Engine and host logs as of 2019/12/19 with error at 15:49:25
Alexander, it looks like reading from socket returns None when the socket is not ready. Few years ago we had code to handle this but it was removed from some reason. I'm not sure why this happens on your setup, but this patch may fix the issue: diff --git a/lib/vdsm/protocoldetector.py b/lib/vdsm/protocoldetector.py index ee8f676e2..49cd2480f 100644 --- a/lib/vdsm/protocoldetector.py +++ b/lib/vdsm/protocoldetector.py @@ -112,7 +112,7 @@ class _ProtocolDetector(object): dispatcher.handle_error() return - if len(data) < self._required_size: + if data is None or len(data) < self._required_size: return for detector in self._detectors: It would be useful if you can try this, since this is not reproducible in our tests.
Created attachment 1646851 [details] Engine and host logs as of 2019/12/20 with Task-Id error at 09:49:49
After the patch, the exception in vdsmd does not happen. But now there is an exception in the engine (see the new logs for details). 2019-12-20 09:49:49,406-06 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) [5e3bd80e] Command 'UploadStreamVDSCommand(HostName = harrier, UploadStreamVDSCommandParameters:{hostId='3ccc2c46-378f-4106-b4d4-6656b79088cc'})' execution failed: VDSGenericException: VDSErrorException: response was missing the following header: Task-Id On plus side, there are no more "Host harrier is not responding" host events. But the storage domain still gets the error events Failed to update OVF disks fcc661df-b2e3-4625-be40-52b65033c6d7, OVF data isn't updated on those OVF stores (Data Center Default, Storage Domain vm3).
(In reply to Alexander Murashkin from comment #11) > After the patch, the exception in vdsmd does not happen. But now there is an > exception in the engine (see the new logs for details). > > 2019-12-20 09:49:49,406-06 ERROR > [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) > [5e3bd80e] Command 'UploadStreamVDSCommand(HostName = harrier, > UploadStreamVDSCommandParameters:{hostId='3ccc2c46-378f-4106-b4d4- > 6656b79088cc'})' execution failed: > VDSGenericException: VDSErrorException: response was missing the > following header: Task-Id There is a missing fix from the Engine version you are currently using made for another OVF store update issue (BZ 1779085): commit 692dadf69fbddcafbaa5ab8a637656c099a481a9 Author: Artur Socha <asocha> Date: Tue Dec 17 17:33:56 2019 +0100 core: missing Task-id header fix Change-Id: I2879bb11a1e56f319a73c54606ad253eedc5a45f Bug-Url: https://bugzilla.redhat.com/1779085 Signed-off-by: Artur Socha <asocha>
These two fixes resolve the issue. Our lab has NFS storage domain over Ethernet and Glusterfs storage domain over Infiniband. There are no OVF problems so far. Thank you for you help.
We have updated the host today. The patch is still not in the nightly snapshot repository. The current vdsm package is vdsm-python-4.40.0-1436.git6e7278919.el8.noarch
Per analysis here and Nir's input also on proposed patch discussion [1], issue is categorized as a jsonrpc infra issue effecting the OVF update requests handling under SSL. Alexander, till we decide whether to merge [1], you can patch your vdsm locally per comment #9 [1] https://gerrit.ovirt.org/#/c/105860/1/lib/vdsm/protocoldetector.py@108
Please provide verification steps for this BZ.
Unfortunately I can't think of a way to verify this besides trying the steps described by Alexander in [1]. This was a very low-level issue and there's no easy scenario one can try to reach the previously erroneous path. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1785061#c2
(In reply to Marcin Sobczyk from comment #19) > Unfortunately I can't think of a way to verify this besides trying the steps > described by Alexander in [1]. > This was a very low-level issue and there's no easy scenario one can try to > reach the previously erroneous path. > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1785061#c2 Looking at https://bugzilla.redhat.com/show_bug.cgi?id=1785061#c2 I see no steps or clear scenarios. We still require a clear scenario to verify this bug. Can you please specify a flow to verify this? We can force update OVF or wait for the scheduled one(each hour) and see if this issue appears? Is this enough?
Then please try to: - prepare an env ready for the ovf update, - make a snapshot of the env - revert the change made to vdsm [1] (trivial) - try doing the ovf update and see if it fails - revert to snapshot - try doing the ovf update and see if it succeeds [1] https://gerrit.ovirt.org/105860
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
*** Bug 2090638 has been marked as a duplicate of this bug. ***