Description of problem: After vdsm fail to disconnect from pool (see bug 1115033), the host get stuck in "Unassigned" state. In vdsm log, we can see that the host does not get any more requests from engine - looks like engine cannot handle this failure on the vdsm side. The ohter hosts on this cluster is running normally, using xmlrpc. Version-Release number of selected component (if applicable): 3.5.0-0.0.master.20140630172346.git994996b.fc19 How reproducible: Always. To reproduce: 1. Add host and enable jsonrpc in General>Advanced Settings 2. Activate host 3. Move host to maintenance
Created attachment 913762 [details] engine log
Created attachment 913763 [details] vdsm.log
Created attachment 913764 [details] screenshot on engine hosts tab
After restarting engine, problem disappeared - lowering severity.
Problem came back. Seem that the correct way to reproduce it is: 1. Put host to maintenance - (fails in the vdsm side because of bug 1115033) 2. Restart vdsm 3. Activate host The first message sent from engine fails, and engine does not try to send more messages: Detector thread::DEBUG::2014-07-01 16:40:54,424::protocoldetector::184::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 10.35.1.31:35724 Detector thread::DEBUG::2014-07-01 16:40:54,425::protocoldetector::195::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 10.35.1.31:35724 Detector thread::WARNING::2014-07-01 16:40:54,425::protocoldetector::216::vds.MultiProtocolAcceptor::(_handle_connection_read) Unrecognized protocol: 'UNSUBSC' See second vdsm log showing this issue twice. This is 100% reproducible, increasing severity.
Created attachment 913770 [details] vdsm log showing how first message from engine fails
This explain why vdsm fail when disconnecting pool - engine does not send all the required data according to the schema: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 480, in _serveRequest res = method(**params) File "/usr/share/vdsm/rpc/Bridge.py", line 252, in _dynamicMethod raise InvalidCall(fn, methodArgs, e) InvalidCall: Attempt to call function <bound method StoragePool.disconnect of <API.StoragePool object at 0x7ff6405a9c50>> with arguments (1, u'2440ff3d-275f-42e6-b204-7d055b26b174'): disconnect() takes exactly 4 arguments (3 given)
Fixing bug 1115033 eliminate this error, but we should prevent the unrecoverable state in the engine.
I am not able to find Unrecognized protocol: 'UNSUBSC' in attached vdsm log but looking at the description here what is happening. When putting to maintenance mode there was a method failure which triggered engine to retry. Vdsm is restarted but the engine is not aware that the ssl socket channel is stale and it is still retrying. After vdsm is started again the engine reconnects to vdsm and sends pending message (UNSUBSCRIBE). This command type is not recognized by protocol detection. We need to make sure that protocol detection is able to recognize all stomp commands and the engine retry logic will make sure that a host is behaving properly.
There is initial timeout set to 180 seconds so the user would wait this time for host to recover after loosing connection. We reduced client timeout so during verification of the fix we need to make sure that shorter timeout do not break any long lasting requests.
(In reply to Nir Soffer from comment #5) > Problem came back. Seem that the correct way to reproduce it is: > > 1. Put host to maintenance - (fails in the vdsm side because of bug 1115033) > 2. Restart vdsm > 3. Activate host > Verified. Host is almost immediately in the 'Up' state. RHEVM Version: 3.5.0-0.11.beta.el6ev. (In reply to Piotr Kliczewski from comment #10) > There is initial timeout set to 180 seconds so the user would wait this time > for host to recover after loosing connection. We reduced client timeout so > during verification of the fix we need to make sure that shorter timeout do > not break any long lasting requests. Hi Piotr, can you please provide example of such long-lasting request? Thank you.
Creating SD take a bit of time (not sure exactly how long it takes). When I reduced the timeout to 20 seconds I was unable to configure system to create a VM.
oVirt 3.5 has been released and should include the fix for this issue.