Bug 1785061 - vdsmd 4.4.0 throws an exception in asyncore.py while updating OVF data
Summary: vdsmd 4.4.0 throws an exception in asyncore.py while updating OVF data
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.40.0
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ovirt-4.4.0
: ---
Assignee: Marcin Sobczyk
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
: 2090638 (view as bug list)
Depends On:
Blocks: 2072100
TreeView+ depends on / blocked
 
Reported: 2019-12-19 02:47 UTC by Alexander Murashkin
Modified: 2022-06-09 08:50 UTC (History)
8 users (show)

Fixed In Version: rhv-4.4.0-29
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2072100 (view as bug list)
Environment:
Last Closed: 2020-05-20 20:00:19 UTC
oVirt Team: Infra
Embargoed:
sbonazzo: ovirt-4.4?


Attachments (Terms of Use)
Host packages as of 2019/12/13 (66.72 KB, text/plain)
2019-12-19 02:47 UTC, Alexander Murashkin
no flags Details
Host packages as of 2019/12/15 (67.19 KB, text/plain)
2019-12-19 02:49 UTC, Alexander Murashkin
no flags Details
Difference in host packages (8.90 KB, text/plain)
2019-12-19 02:53 UTC, Alexander Murashkin
no flags Details
Engine packages (57.71 KB, text/plain)
2019-12-19 02:53 UTC, Alexander Murashkin
no flags Details
Engine and host logs (debug is enabled) (1.30 MB, application/gzip)
2019-12-19 03:47 UTC, Alexander Murashkin
no flags Details
Engine and host logs as of 2019/12/19 with error at 15:49:25 (2.25 MB, application/gzip)
2019-12-20 00:35 UTC, Alexander Murashkin
no flags Details
Engine and host logs as of 2019/12/20 with Task-Id error at 09:49:49 (3.69 MB, application/gzip)
2019-12-20 16:17 UTC, Alexander Murashkin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44396 0 None None None 2021-12-22 06:40:14 UTC
oVirt gerrit 105860 0 master MERGED protocoldetector: Handle None return value from sock.recv() 2021-02-09 19:49:40 UTC

Description Alexander Murashkin 2019-12-19 02:47:10 UTC
Created attachment 1646278 [details]
Host packages as of 2019/12/13

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Alexander Murashkin 2019-12-19 02:49:46 UTC
Created attachment 1646279 [details]
Host packages as of 2019/12/15

Comment 2 Alexander Murashkin 2019-12-19 02:51:33 UTC
vdsmd 4.4.0 throws an exception in asyncore.py while updating OVF data. First exception occured on December 14th and happens every hour since then.

An hour before the first exception, we updated engine's and host's rpms packages. We did not run engine-backup so we cannot restore oVirt database if it got corrupted.  

Currently, both engine (RHEL 7.7) and host (RHEL 8.1) have the most recent packages (and were rebooted). Neither the updates, nor reboots resolve the issue.

In general, oVirt is working. We can run virtual machines with disks in the storage domain. We can create new virtual machines. But some functionality is not available, for example, we cannot move disks between domains.

The storage domain in question is on NFS. After the issue appeared, we successfully created Glusterfs domain but its OVF update also failed. Interesting that it was possible to move disks to it before OVF failure.  

Here is the vdsmd traceback for your convinience

2019-12-17 16:36:58,393-0600 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel 
<yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:172.20.1.142', 38002, 0, 0) at 0x7fbda865ed30> (
  <class 'TypeError'>:object of type 'NoneType' has no len() 
  [/usr/lib64/python3.6/asyncore.py|readwrite|108]
  [/usr/lib64/python3.6/asyncore.py|handle_read_event|423] 
  [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_read|71] 
  [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] 
  [/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py|handle_read|115]
) 
(betterAsyncore:179)

------ Current Versions (that have the exception) ------

Engine
	ovirt-engine-4.4.0-0.0.master.20191204120550.git04d5d05.el7.noarch

Host
	vdsm-4.40.0-1363.gitf6a1ba0a0.el8.x86_64
	vdsm-python-4.40.0-1363.gitf6a1ba0a0.el8.noarch
	vdsm-yajsonrpc-4.40.0-1363.gitf6a1ba0a0.el8.noarch
	python3-libs-3.6.8-15.1.el8.x86_64

------ December 13th Versions (that did not have the exception) ------

Engine 
	ovirt-engine-4.4.0-0.0.master.20191204120550.git04d5d05.el7.noarch  --- not sure, but probably the same as now

Host
	vdsm-4.40.0-1360.git821afbbc2.el8.x86_64
        vdsm-python-4.40.0-1360.git821afbbc2.el8.noarch
	vdsm-yajsonrpc-4.40.0-1360.git821afbbc2.el8.noarch
	python3-libs-3.6.8-15.1.el8.x86_64

Comment 3 Alexander Murashkin 2019-12-19 02:53:08 UTC
Created attachment 1646280 [details]
Difference in host packages

Comment 4 Alexander Murashkin 2019-12-19 02:53:56 UTC
Created attachment 1646281 [details]
Engine packages

Comment 5 Alexander Murashkin 2019-12-19 03:47:04 UTC
Created attachment 1646283 [details]
Engine and host logs (debug is enabled)

Here are most recent errors 

Host vdsm.log

2019-12-18 21:34:01,367-0600 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:172.16.0.142', 45878, 0, 0) at 0x7f70fc644e80> (<class 'TypeError'>:object of type 'NoneType' has no len() [/usr/lib64/python3.6/asyncore.py|readwrite|108] [/usr/lib64/python3.6/asyncore.py|handle_read_event|423] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_read|71] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py|handle_read|115]) (betterAsyncore:179)

Engine engine.log

2019-12-18 21:34:01,368-06 DEBUG [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Exception: java.util.concurrent.ExecutionException: javax.net.ssl.SSLException: Connection reset
Caused by: javax.net.ssl.SSLException: Connection reset
Caused by: java.net.SocketException: Connection reset
2019-12-18 21:34:01,376-06 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Command 'UploadStreamVDSCommand(HostName = harrier, UploadStreamVDSCommandParameters:{hostId='3ccc2c46-378f-4106-b4d4-6656b79088cc'})' execution failed: java.net.SocketException: Connection reset
2019-12-18 21:34:01,376-06 DEBUG [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketException: Connection reset
Caused by: java.net.SocketException: Connection reset
2019-12-18 21:34:01,378-06 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketException: Connection reset (Failed with error VDS_NETWORK_ERROR and code 5022)

Comment 6 Amit Bawer 2019-12-19 17:26:09 UTC
Not sure if I am looking at a Storage issue:

engine.log indicates failing communication with a host called "harrier" by its IPv4 address 

2019-12-17 14:21:55,734-06 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to harrier.castle.aimk.com/172.16.0.106
2019-12-17 14:21:55,736-06 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [] Unable to RefreshCapabilities: ConnectException: Connection refused
...

2019-12-18 20:37:16,089-06 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to harrier.castle.aimk.com/172.16.0.106
...
2019-12-18 21:34:01,378-06 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [4ee193fe] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketException: Connection reset (Failed with error VDS_NETWORK_ERROR and code 5022)


Yet according to vdsm.log, the host "harrier" sets an IPv6 socket for handling multiprotocol messages:

2019-12-18 20:35:02,029-0600 INFO  (MainThread) [vds] (PID: 7516) I am the actual vdsm 4.40.0.1363.gitf6a1ba0a0 harrier.castle.aimk.com (4.18.0-147.3.1.el8_1.x86_64) (vdsmd:152)
...
2019-12-18 20:35:02,516-0600 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Creating socket (host='::', port=54321, family=10, socketype=1, proto=6) (protocoldetector:225)
2019-12-18 20:35:02,519-0600 INFO  (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:183)
2019-12-18 20:35:02,542-0600 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Adding detector <vdsm.rpc.http.HttpDetector object at 0x7f713014b438> (protocoldetector:210)
2019-12-18 20:35:02,550-0600 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Adding detector <yajsonrpc.stompserver.StompDetector object at 0x7f7120147f98> (protocoldetector:210)
...
2019-12-18 21:34:01,367-0600 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:172.16.0.142', 45878, 0, 0) at 0x7f70fc644e80> (<class 'TypeError'>:object of type 'NoneType' has no len() [/usr/lib64/python3.6/asyncore.py|readwrite|108] [/usr/lib64/python3.6/asyncore.py|handle_read_event|423] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_read|71] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py|handle_read|115]) (betterAsyncore:179)

The unexpected None return type from sock.recv(...) at protocoldetector.py:115 seems like an anomaly which should be handled in problematic connectivity cases, there are prior examples for it in stomp.py code for handle_read().

Comment 7 Alexander Murashkin 2019-12-20 00:33:01 UTC
> 2019-12-17 14:21:55,736-06 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [] Unable to RefreshCapabilities: ConnectException: Connection refused

The host was under maintenance and brought back (including vdsmd.service and supervdsmd.service) around 2019-12-18 20:35:01. So previously timestamped messages are, probably, not relevant.

> Yet according to vdsm.log, the host "harrier" sets an IPv6 socket for handling multiprotocol messages:
> 2019-12-18 20:35:02,516-0600 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Creating socket (host='::', port=54321, family=10, socketype=1, proto=6) (protocoldetector:225)

It seems that vdsmd creates and listen on AF_INET6 socket that can accept IPv4 and IPv6 connections. 

Just in case, I removed all global engine and host IPv6 addresses from interfaces and DNS. The only IPv6 address that the host has is ::1. I rebooted the engine and the host. It does not help. 

See new logfiles for errors (this time engine log is also in debug mode). 

Host

2019-12-19 15:49:25,965-0600 ERROR (Reactor thread) [vds.dispatcher] uncaptured python exception, closing channel <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:172.16.0.142', 34880, 0, 0) at 0x7f40a0711a20> (<class 'TypeError'>:object of type 'NoneType' has no len() [/usr/lib64/python3.6/asyncore.py|readwrite|108] [/usr/lib64/python3.6/asyncore.py|handle_read_event|423] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|handle_read|71] [/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py|_delegate_call|168] [/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py|handle_read|115]) (betterAsyncore:179)

Engine

2019-12-19 15:49:25,966-06 DEBUG [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-19) [21693ecf] Exception: java.util.concurrent.ExecutionException: javax.net.ssl.SSLException: Connection reset

# lsof -nPp 14971  | grep IP
vdsmd   14971 vdsm   13u     IPv4             107255      0t0      UDP *:42578 
vdsmd   14971 vdsm   20u     IPv6             122070      0t0      TCP [::1]:54321->[::1]:35162 (ESTABLISHED)
vdsmd   14971 vdsm   30u     IPv6             107354      0t0      TCP *:54321 (LISTEN)
vdsmd   14971 vdsm   45u     IPv6             321565      0t0      TCP 172.16.0.106:54321->172.16.0.142:34094 (ESTABLISHED)
vdsmd   14971 vdsm   66u     IPv6             397971      0t0      TCP 172.16.0.106:54321->172.16.0.142:34240 (ESTABLISHED)

As you can see all connections are TCPv4. tcpdump also does not show any TCPv6 connection attempts.

> ... like an anomaly which should be handled in problematic connectivity cases ...

What is problematic with the connectivity? It looks like the TCP connection is established, the exception happens on the host, and the host resets the connection. 

It means that RPC request has attributes that are wrong, does it not?

Is there any workaround for it?

Comment 8 Alexander Murashkin 2019-12-20 00:35:28 UTC
Created attachment 1646704 [details]
Engine and host logs as of 2019/12/19 with error at 15:49:25

Comment 9 Nir Soffer 2019-12-20 02:42:32 UTC
Alexander, it looks like reading from socket returns None when
the socket is not ready. Few years ago we had code to handle this
but it was removed from some reason. I'm not sure why this happens
on your setup, but this patch may fix the issue:

diff --git a/lib/vdsm/protocoldetector.py b/lib/vdsm/protocoldetector.py
index ee8f676e2..49cd2480f 100644
--- a/lib/vdsm/protocoldetector.py
+++ b/lib/vdsm/protocoldetector.py
@@ -112,7 +112,7 @@ class _ProtocolDetector(object):
             dispatcher.handle_error()
             return
 
-        if len(data) < self._required_size:
+        if data is None or len(data) < self._required_size:
             return
 
         for detector in self._detectors:

It would be useful if you can try this, since this is not reproducible in
our tests.

Comment 10 Alexander Murashkin 2019-12-20 16:17:52 UTC
Created attachment 1646851 [details]
Engine and host logs as of 2019/12/20 with Task-Id error at 09:49:49

Comment 11 Alexander Murashkin 2019-12-20 16:26:26 UTC
After the patch, the exception in vdsmd does not happen. But now there is an exception in the engine (see the new logs for details).

    2019-12-20 09:49:49,406-06 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) 
    [5e3bd80e] Command 'UploadStreamVDSCommand(HostName = harrier, UploadStreamVDSCommandParameters:{hostId='3ccc2c46-378f-4106-b4d4-6656b79088cc'})' execution failed: 
    VDSGenericException: VDSErrorException: response was missing the following header: Task-Id

On plus side, there are no more "Host harrier is not responding" host events. But the storage domain still gets the error events 

    Failed to update OVF disks fcc661df-b2e3-4625-be40-52b65033c6d7, OVF data isn't updated on those OVF stores (Data Center Default, Storage Domain vm3).

Comment 12 Amit Bawer 2019-12-20 20:11:10 UTC
(In reply to Alexander Murashkin from comment #11)
> After the patch, the exception in vdsmd does not happen. But now there is an
> exception in the engine (see the new logs for details).
> 
>     2019-12-20 09:49:49,406-06 ERROR
> [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) 
>     [5e3bd80e] Command 'UploadStreamVDSCommand(HostName = harrier,
> UploadStreamVDSCommandParameters:{hostId='3ccc2c46-378f-4106-b4d4-
> 6656b79088cc'})' execution failed: 
>     VDSGenericException: VDSErrorException: response was missing the
> following header: Task-Id

There is a missing fix from the Engine version you are currently using made for another OVF store update issue (BZ 1779085):

commit 692dadf69fbddcafbaa5ab8a637656c099a481a9
Author: Artur Socha <asocha>
Date:   Tue Dec 17 17:33:56 2019 +0100

    core: missing Task-id header fix
    
    Change-Id: I2879bb11a1e56f319a73c54606ad253eedc5a45f
    Bug-Url: https://bugzilla.redhat.com/1779085
    Signed-off-by: Artur Socha <asocha>

Comment 13 Alexander Murashkin 2019-12-22 18:39:19 UTC
These two fixes resolve the issue. Our lab has NFS storage domain over Ethernet and Glusterfs storage domain over Infiniband. There are no OVF problems so far.  

Thank you for you help.

Comment 14 Alexander Murashkin 2019-12-23 18:01:35 UTC
We have updated the host today. The patch is still not in the nightly snapshot repository.

The current vdsm package is vdsm-python-4.40.0-1436.git6e7278919.el8.noarch

Comment 15 Amit Bawer 2019-12-24 10:42:21 UTC
Per analysis here and Nir's input also on proposed patch discussion [1], issue is categorized as a jsonrpc infra issue effecting the OVF update requests  handling under SSL.

Alexander, till we decide whether to merge [1], you can patch your vdsm locally per comment #9

[1] https://gerrit.ovirt.org/#/c/105860/1/lib/vdsm/protocoldetector.py@108

Comment 18 Ilan Zuckerman 2020-04-02 07:44:05 UTC
Please provide verification steps for this BZ.

Comment 19 Marcin Sobczyk 2020-04-06 09:17:15 UTC
Unfortunately I can't think of a way to verify this besides trying the steps described by Alexander in [1].
This was a very low-level issue and there's no easy scenario one can try to reach the previously erroneous path. 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1785061#c2

Comment 20 Avihai 2020-04-06 10:04:24 UTC
(In reply to Marcin Sobczyk from comment #19)
> Unfortunately I can't think of a way to verify this besides trying the steps
> described by Alexander in [1].
> This was a very low-level issue and there's no easy scenario one can try to
> reach the previously erroneous path. 
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1785061#c2

Looking at https://bugzilla.redhat.com/show_bug.cgi?id=1785061#c2 I see no steps or clear scenarios.
We still require a clear scenario to verify this bug.

Can you please specify a flow to verify this?
We can force update OVF or wait for the scheduled one(each hour) and see if this issue appears?
Is this enough?

Comment 21 Marcin Sobczyk 2020-04-06 10:13:24 UTC
Then please try to:
- prepare an env ready for the ovf update,
- make a snapshot of the env
- revert the change made to vdsm [1] (trivial)
- try doing the ovf update and see if it fails
- revert to snapshot
- try doing the ovf update and see if it succeeds

[1] https://gerrit.ovirt.org/105860

Comment 24 Sandro Bonazzola 2020-05-20 20:00:19 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 25 Kobi Hakimi 2022-06-01 10:14:44 UTC
*** Bug 2090638 has been marked as a duplicate of this bug. ***

Comment 26 Michal Skrivanek 2022-06-09 08:50:39 UTC
*** Bug 2090638 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.