1115044 – Host stuck in "Unassinged" state when using jsonrpc and disconnection from pool failed

Bug 1115044 - Host stuck in "Unassinged" state when using jsonrpc and disconnection from pool failed

Summary: Host stuck in "Unassinged" state when using jsonrpc and disconnection from po...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Piotr Kliczewski
QA Contact:	Antonin Pagac
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:	1073943
TreeView+	depends on / blocked

Reported:	2014-07-01 13:16 UTC by Nir Soffer
Modified:	2016-02-10 19:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-3.5.0_rc1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-10-17 12:45:07 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
engine log (380.68 KB, application/x-xz) 2014-07-01 13:17 UTC, Nir Soffer	no flags	Details
vdsm.log (418.60 KB, application/x-xz) 2014-07-01 13:18 UTC, Nir Soffer	no flags	Details
screenshot on engine hosts tab (14.29 KB, image/png) 2014-07-01 13:19 UTC, Nir Soffer	no flags	Details
vdsm log showing how first message from engine fails (55.84 KB, application/x-xz) 2014-07-01 13:55 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	29651	master	ABANDONED	jsonrpc: Disconnecting from vdsm for maintanace	Never
oVirt gerrit	29652	master	MERGED	jsonrpc: Accept all stomp commands	Never
oVirt gerrit	29653	master	MERGED	Close channel and keep stomp client open	Never
oVirt gerrit	29881	ovirt-3.5	MERGED	jsonrpc: Accept all stomp commands	Never
oVirt gerrit	30741	ovirt-3.5	MERGED	stomp: Server side heartbeat implementation	Never
oVirt gerrit	30990	ovirt-engine-3.5	MERGED	jsonrpc: client side heartbeat implementation	Never

Description Nir Soffer 2014-07-01 13:16:26 UTC

Description of problem:

After vdsm fail to disconnect from pool (see bug 1115033), the host get stuck in "Unassigned" state.

In vdsm log, we can see that the host does not get any more requests from engine - looks like engine cannot handle this failure on the vdsm side.

The ohter hosts on this cluster is running normally, using xmlrpc.

Version-Release number of selected component (if applicable):
3.5.0-0.0.master.20140630172346.git994996b.fc19

How reproducible:
Always.

To reproduce:

1. Add host and enable jsonrpc in General>Advanced Settings
2. Activate host
3. Move host to maintenance

Comment 1 Nir Soffer 2014-07-01 13:17:29 UTC

Created attachment 913762 [details]
engine log

Comment 2 Nir Soffer 2014-07-01 13:18:24 UTC

Created attachment 913763 [details]
vdsm.log

Comment 3 Nir Soffer 2014-07-01 13:19:15 UTC

Created attachment 913764 [details]
screenshot on engine hosts tab

Comment 4 Nir Soffer 2014-07-01 13:32:49 UTC

After restarting engine, problem disappeared - lowering severity.

Comment 5 Nir Soffer 2014-07-01 13:54:58 UTC

Problem came back. Seem that the correct way to reproduce it is:

1. Put host to maintenance - (fails in the vdsm side because of bug 1115033)
2. Restart vdsm
3. Activate host
 
The first message sent from engine fails, and engine does not try to send more messages:

Detector thread::DEBUG::2014-07-01 16:40:54,424::protocoldetector::184::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 10.35.1.31:35724
Detector thread::DEBUG::2014-07-01 16:40:54,425::protocoldetector::195::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 10.35.1.31:35724
Detector thread::WARNING::2014-07-01 16:40:54,425::protocoldetector::216::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'UNSUBSC'

See second vdsm log showing this issue twice.

This is 100% reproducible, increasing severity.

Comment 6 Nir Soffer 2014-07-01 13:55:49 UTC

Created attachment 913770 [details]
vdsm log showing how first message from engine fails

Comment 7 Nir Soffer 2014-07-01 14:18:08 UTC

This explain why vdsm fail when disconnecting pool - engine does not send all the required data according to the schema:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 480, in _serveRequest
    res = method(**params)
  File "/usr/share/vdsm/rpc/Bridge.py", line 252, in _dynamicMethod
    raise InvalidCall(fn, methodArgs, e)
InvalidCall: Attempt to call function <bound method StoragePool.disconnect of <API.StoragePool object at 0x7ff6405a9c50>> with arguments (1, u'2440ff3d-275f-42e6-b204-7d055b26b174'): disconnect() takes exactly 4 arguments (3 given)

Comment 8 Nir Soffer 2014-07-01 15:19:36 UTC

Fixing bug 1115033 eliminate this error, but we should prevent the unrecoverable state in the engine.

Comment 9 Piotr Kliczewski 2014-07-07 07:43:19 UTC

I am not able to find Unrecognized protocol: 'UNSUBSC' in attached vdsm log but looking at the description here what is happening.

When putting to maintenance mode there was a method failure which triggered engine to retry. Vdsm is restarted but the engine is not aware that the ssl socket channel is stale and it is still retrying. After vdsm is started again the engine reconnects to vdsm and sends pending message (UNSUBSCRIBE). This command type is not recognized by protocol detection.

We need to make sure that protocol detection is able to recognize all stomp commands and the engine retry logic will make sure that a host is behaving properly.

Comment 10 Piotr Kliczewski 2014-07-08 14:16:45 UTC

There is initial timeout set to 180 seconds so the user would wait this time for host to recover after loosing connection. We reduced client timeout so during verification of the fix we need to make sure that shorter timeout do not break any long lasting requests.

Comment 11 Antonin Pagac 2014-09-23 13:00:22 UTC

(In reply to Nir Soffer from comment #5)
> Problem came back. Seem that the correct way to reproduce it is:
> 
> 1. Put host to maintenance - (fails in the vdsm side because of bug 1115033)
> 2. Restart vdsm
> 3. Activate host
>  

Verified. Host is almost immediately in the 'Up' state.

RHEVM Version: 3.5.0-0.11.beta.el6ev.


(In reply to Piotr Kliczewski from comment #10)
> There is initial timeout set to 180 seconds so the user would wait this time
> for host to recover after loosing connection. We reduced client timeout so
> during verification of the fix we need to make sure that shorter timeout do
> not break any long lasting requests.

Hi Piotr,

can you please provide example of such long-lasting request? Thank you.

Comment 12 Piotr Kliczewski 2014-09-23 13:10:44 UTC

Creating SD take a bit of time (not sure exactly how long it takes). When I reduced the timeout to 20 seconds I was unable to configure system to create a VM.

Comment 13 Sandro Bonazzola 2014-10-17 12:45:07 UTC

oVirt 3.5 has been released and should include the fix for this issue.

Note You need to log in before you can comment on or make changes to this bug.