Bug 1115044

Summary: Host stuck in "Unassinged" state when using jsonrpc and disconnection from pool failed
Product: [Retired] oVirt Reporter: Nir Soffer <nsoffer>
Component: ovirt-engine-coreAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED CURRENTRELEASE QA Contact: Antonin Pagac <apagac>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5CC: amureini, gklein, iheim, oourfali, pkliczew, rbalakri, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: ovirt-3.5.0_rc1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:45:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1073943    
Attachments:
Description Flags
engine log
none
vdsm.log
none
screenshot on engine hosts tab
none
vdsm log showing how first message from engine fails none

Description Nir Soffer 2014-07-01 13:16:26 UTC
Description of problem:

After vdsm fail to disconnect from pool (see bug 1115033), the host get stuck in "Unassigned" state.

In vdsm log, we can see that the host does not get any more requests from engine - looks like engine cannot handle this failure on the vdsm side.

The ohter hosts on this cluster is running normally, using xmlrpc.

Version-Release number of selected component (if applicable):
3.5.0-0.0.master.20140630172346.git994996b.fc19

How reproducible:
Always.

To reproduce:

1. Add host and enable jsonrpc in General>Advanced Settings
2. Activate host
3. Move host to maintenance

Comment 1 Nir Soffer 2014-07-01 13:17:29 UTC
Created attachment 913762 [details]
engine log

Comment 2 Nir Soffer 2014-07-01 13:18:24 UTC
Created attachment 913763 [details]
vdsm.log

Comment 3 Nir Soffer 2014-07-01 13:19:15 UTC
Created attachment 913764 [details]
screenshot on engine hosts tab

Comment 4 Nir Soffer 2014-07-01 13:32:49 UTC
After restarting engine, problem disappeared - lowering severity.

Comment 5 Nir Soffer 2014-07-01 13:54:58 UTC
Problem came back. Seem that the correct way to reproduce it is:

1. Put host to maintenance - (fails in the vdsm side because of bug 1115033)
2. Restart vdsm
3. Activate host
 
The first message sent from engine fails, and engine does not try to send more messages:

Detector thread::DEBUG::2014-07-01 16:40:54,424::protocoldetector::184::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 10.35.1.31:35724
Detector thread::DEBUG::2014-07-01 16:40:54,425::protocoldetector::195::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 10.35.1.31:35724
Detector thread::WARNING::2014-07-01 16:40:54,425::protocoldetector::216::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'UNSUBSC'

See second vdsm log showing this issue twice.

This is 100% reproducible, increasing severity.

Comment 6 Nir Soffer 2014-07-01 13:55:49 UTC
Created attachment 913770 [details]
vdsm log showing how first message from engine fails

Comment 7 Nir Soffer 2014-07-01 14:18:08 UTC
This explain why vdsm fail when disconnecting pool - engine does not send all the required data according to the schema:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 480, in _serveRequest
    res = method(**params)
  File "/usr/share/vdsm/rpc/Bridge.py", line 252, in _dynamicMethod
    raise InvalidCall(fn, methodArgs, e)
InvalidCall: Attempt to call function <bound method StoragePool.disconnect of <API.StoragePool object at 0x7ff6405a9c50>> with arguments (1, u'2440ff3d-275f-42e6-b204-7d055b26b174'): disconnect() takes exactly 4 arguments (3 given)

Comment 8 Nir Soffer 2014-07-01 15:19:36 UTC
Fixing bug 1115033 eliminate this error, but we should prevent the unrecoverable state in the engine.

Comment 9 Piotr Kliczewski 2014-07-07 07:43:19 UTC
I am not able to find Unrecognized protocol: 'UNSUBSC' in attached vdsm log but looking at the description here what is happening.

When putting to maintenance mode there was a method failure which triggered engine to retry. Vdsm is restarted but the engine is not aware that the ssl socket channel is stale and it is still retrying. After vdsm is started again the engine reconnects to vdsm and sends pending message (UNSUBSCRIBE). This command type is not recognized by protocol detection.

We need to make sure that protocol detection is able to recognize all stomp commands and the engine retry logic will make sure that a host is behaving properly.

Comment 10 Piotr Kliczewski 2014-07-08 14:16:45 UTC
There is initial timeout set to 180 seconds so the user would wait this time for host to recover after loosing connection. We reduced client timeout so during verification of the fix we need to make sure that shorter timeout do not break any long lasting requests.

Comment 11 Antonin Pagac 2014-09-23 13:00:22 UTC
(In reply to Nir Soffer from comment #5)
> Problem came back. Seem that the correct way to reproduce it is:
> 
> 1. Put host to maintenance - (fails in the vdsm side because of bug 1115033)
> 2. Restart vdsm
> 3. Activate host
>  

Verified. Host is almost immediately in the 'Up' state.

RHEVM Version: 3.5.0-0.11.beta.el6ev.


(In reply to Piotr Kliczewski from comment #10)
> There is initial timeout set to 180 seconds so the user would wait this time
> for host to recover after loosing connection. We reduced client timeout so
> during verification of the fix we need to make sure that shorter timeout do
> not break any long lasting requests.

Hi Piotr,

can you please provide example of such long-lasting request? Thank you.

Comment 12 Piotr Kliczewski 2014-09-23 13:10:44 UTC
Creating SD take a bit of time (not sure exactly how long it takes). When I reduced the timeout to 20 seconds I was unable to configure system to create a VM.

Comment 13 Sandro Bonazzola 2014-10-17 12:45:07 UTC
oVirt 3.5 has been released and should include the fix for this issue.