Bug 1115044 - Host stuck in "Unassinged" state when using jsonrpc and disconnection from pool failed
Summary: Host stuck in "Unassinged" state when using jsonrpc and disconnection from po...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Piotr Kliczewski
QA Contact: Antonin Pagac
URL:
Whiteboard: infra
Depends On:
Blocks: 1073943
TreeView+ depends on / blocked
 
Reported: 2014-07-01 13:16 UTC by Nir Soffer
Modified: 2016-02-10 19:34 UTC (History)
7 users (show)

Fixed In Version: ovirt-3.5.0_rc1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-17 12:45:07 UTC
oVirt Team: Infra


Attachments (Terms of Use)
engine log (380.68 KB, application/x-xz)
2014-07-01 13:17 UTC, Nir Soffer
no flags Details
vdsm.log (418.60 KB, application/x-xz)
2014-07-01 13:18 UTC, Nir Soffer
no flags Details
screenshot on engine hosts tab (14.29 KB, image/png)
2014-07-01 13:19 UTC, Nir Soffer
no flags Details
vdsm log showing how first message from engine fails (55.84 KB, application/x-xz)
2014-07-01 13:55 UTC, Nir Soffer
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 29651 master ABANDONED jsonrpc: Disconnecting from vdsm for maintanace Never
oVirt gerrit 29652 master MERGED jsonrpc: Accept all stomp commands Never
oVirt gerrit 29653 master MERGED Close channel and keep stomp client open Never
oVirt gerrit 29881 ovirt-3.5 MERGED jsonrpc: Accept all stomp commands Never
oVirt gerrit 30741 ovirt-3.5 MERGED stomp: Server side heartbeat implementation Never
oVirt gerrit 30990 ovirt-engine-3.5 MERGED jsonrpc: client side heartbeat implementation Never

Description Nir Soffer 2014-07-01 13:16:26 UTC
Description of problem:

After vdsm fail to disconnect from pool (see bug 1115033), the host get stuck in "Unassigned" state.

In vdsm log, we can see that the host does not get any more requests from engine - looks like engine cannot handle this failure on the vdsm side.

The ohter hosts on this cluster is running normally, using xmlrpc.

Version-Release number of selected component (if applicable):
3.5.0-0.0.master.20140630172346.git994996b.fc19

How reproducible:
Always.

To reproduce:

1. Add host and enable jsonrpc in General>Advanced Settings
2. Activate host
3. Move host to maintenance

Comment 1 Nir Soffer 2014-07-01 13:17:29 UTC
Created attachment 913762 [details]
engine log

Comment 2 Nir Soffer 2014-07-01 13:18:24 UTC
Created attachment 913763 [details]
vdsm.log

Comment 3 Nir Soffer 2014-07-01 13:19:15 UTC
Created attachment 913764 [details]
screenshot on engine hosts tab

Comment 4 Nir Soffer 2014-07-01 13:32:49 UTC
After restarting engine, problem disappeared - lowering severity.

Comment 5 Nir Soffer 2014-07-01 13:54:58 UTC
Problem came back. Seem that the correct way to reproduce it is:

1. Put host to maintenance - (fails in the vdsm side because of bug 1115033)
2. Restart vdsm
3. Activate host
 
The first message sent from engine fails, and engine does not try to send more messages:

Detector thread::DEBUG::2014-07-01 16:40:54,424::protocoldetector::184::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 10.35.1.31:35724
Detector thread::DEBUG::2014-07-01 16:40:54,425::protocoldetector::195::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 10.35.1.31:35724
Detector thread::WARNING::2014-07-01 16:40:54,425::protocoldetector::216::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'UNSUBSC'

See second vdsm log showing this issue twice.

This is 100% reproducible, increasing severity.

Comment 6 Nir Soffer 2014-07-01 13:55:49 UTC
Created attachment 913770 [details]
vdsm log showing how first message from engine fails

Comment 7 Nir Soffer 2014-07-01 14:18:08 UTC
This explain why vdsm fail when disconnecting pool - engine does not send all the required data according to the schema:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 480, in _serveRequest
    res = method(**params)
  File "/usr/share/vdsm/rpc/Bridge.py", line 252, in _dynamicMethod
    raise InvalidCall(fn, methodArgs, e)
InvalidCall: Attempt to call function <bound method StoragePool.disconnect of <API.StoragePool object at 0x7ff6405a9c50>> with arguments (1, u'2440ff3d-275f-42e6-b204-7d055b26b174'): disconnect() takes exactly 4 arguments (3 given)

Comment 8 Nir Soffer 2014-07-01 15:19:36 UTC
Fixing bug 1115033 eliminate this error, but we should prevent the unrecoverable state in the engine.

Comment 9 Piotr Kliczewski 2014-07-07 07:43:19 UTC
I am not able to find Unrecognized protocol: 'UNSUBSC' in attached vdsm log but looking at the description here what is happening.

When putting to maintenance mode there was a method failure which triggered engine to retry. Vdsm is restarted but the engine is not aware that the ssl socket channel is stale and it is still retrying. After vdsm is started again the engine reconnects to vdsm and sends pending message (UNSUBSCRIBE). This command type is not recognized by protocol detection.

We need to make sure that protocol detection is able to recognize all stomp commands and the engine retry logic will make sure that a host is behaving properly.

Comment 10 Piotr Kliczewski 2014-07-08 14:16:45 UTC
There is initial timeout set to 180 seconds so the user would wait this time for host to recover after loosing connection. We reduced client timeout so during verification of the fix we need to make sure that shorter timeout do not break any long lasting requests.

Comment 11 Antonin Pagac 2014-09-23 13:00:22 UTC
(In reply to Nir Soffer from comment #5)
> Problem came back. Seem that the correct way to reproduce it is:
> 
> 1. Put host to maintenance - (fails in the vdsm side because of bug 1115033)
> 2. Restart vdsm
> 3. Activate host
>  

Verified. Host is almost immediately in the 'Up' state.

RHEVM Version: 3.5.0-0.11.beta.el6ev.


(In reply to Piotr Kliczewski from comment #10)
> There is initial timeout set to 180 seconds so the user would wait this time
> for host to recover after loosing connection. We reduced client timeout so
> during verification of the fix we need to make sure that shorter timeout do
> not break any long lasting requests.

Hi Piotr,

can you please provide example of such long-lasting request? Thank you.

Comment 12 Piotr Kliczewski 2014-09-23 13:10:44 UTC
Creating SD take a bit of time (not sure exactly how long it takes). When I reduced the timeout to 20 seconds I was unable to configure system to create a VM.

Comment 13 Sandro Bonazzola 2014-10-17 12:45:07 UTC
oVirt 3.5 has been released and should include the fix for this issue.


Note You need to log in before you can comment on or make changes to this bug.