Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1289868

Summary: Host cannot be modified because of XML protocol not supported error
Product: [oVirt] ovirt-engine Reporter: Petr Matyáš <pmatyas>
Component: Frontend.WebAdminAssignee: Moti Asayag <masayag>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.1CC: bugs, gklein, masayag, mgoldboi, oourfali, pkliczew, pmatyas, pstehlik, sbonazzo
Target Milestone: ovirt-3.6.3Flags: rule-engine: ovirt-3.6.z+
rule-engine: exception+
mgoldboi: planning_ack+
oourfali: devel_ack+
pstehlik: testing_ack+
Target Release: 3.6.3.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-11 07:24:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshot
none
engine, vdsm and supervdsm logs
none
engine, vdsm, supervdsm logs; screenshot
none
engine, vdsm, supervdsm logs; screenshot
none
engine, vdsm, supervdsm logs none

Description Petr Matyáš 2015-12-09 08:58:21 UTC
Created attachment 1103811 [details]
screenshot

Description of problem:
When I try to edit host and change even only the name, error appears saying 'XML protocol not supported by cluster 3.6 or higher' after clicking OK button. The host is working correctly and was installed on clean RHEL7.2.

Version-Release number of selected component (if applicable):
rhevm-3.6.1.1-0.1.el6.noarch
vdsm-4.17.11-0.el7ev

How reproducible:
always

Steps to Reproduce:
1. have correctly installed and working host in 3.6 cluster
2. try to edit it
3.

Actual results:
error

Expected results:
host is correctly edited

Additional info:
2015-12-09 09:42:44,482 WARN  [org.ovirt.engine.core.bll.hostdeploy.UpdateVdsCommand] (ajp-/127.0.0.1:8702-3) [324eed5e] CanDoAction of action 'UpdateVds' failed for user admin@internal. Reasons: VAR__ACTION__UPDATE,VAR__TYPE__HOST,NOT_SUPPORTED_PROTOCOL_FOR_CLUSTER_VERSION

Comment 1 Oved Ourfali 2015-12-10 07:08:00 UTC
In 3.6 you shouldn't get into a situation where you have an existing host in a 3.6 cluster, working with XMLRPC.

Please provide steps to reproduce that will explain what you did with the host before you edit it, so that we'll be able to understand how you got to this situation.

As I see it, you can only do the following to get there:
option1: Add a host to 3.6 cluster - will work with jsonrpc
option2: Add a 3.6 host to 3.5 cluster, and then update the cluster to 3.6 - should fail if there are hosts that work with XMLRPC
option3: Update host cluster to another cluster, which is 3.6, while the host works with XMLRPC - should fail.

Also, we need to know if that's a clean 3.6.1 environment, or a clean 3.5 environment upgraded to 3.6.1, as otherwise you might have gotten into this situation due to previous bugs.

Comment 2 Oved Ourfali 2015-12-10 07:22:09 UTC
Also, please provide logs.

Comment 3 Moti Asayag 2015-12-10 08:35:55 UTC
Petr, please provide also the output of the following query:

  select protocol from vds_static where vds_name = 'pmatyas-host04';

or by accessing the host resource via the api and paste its 'protocol' element value ?

Comment 4 Piotr Kliczewski 2015-12-10 10:38:06 UTC
There is fallback logic defined for older vdsms which do not support jsonrpc. At the time of connection we do not know whether vdsm which we connect to supports jsonrpc. We try to connect to vdsm using jsonrpc twice with a timeout in between and if that fails we assume that we are connecting to vdsm which do not supports it and we switch to xmlrpc. As a result when the engine finds out about the version of the vdsm it can declare it non-operatinal depending on cluster level.

We need to understand why 2 attempts to connect to vdsm using jsonrpc failed. Please provide engine and vdsm logs so we could understand what was the reason.

Comment 5 Petr Matyáš 2015-12-10 10:45:40 UTC
I'am already working with Moti on this. This is happening also in our selenium tests in jenkins.

Comment 6 Oved Ourfali 2015-12-10 12:30:23 UTC
Discussed offline.
As I see that it is a corner case (even if it is related to what Piotr described in comment #4), we'll address that in 3.6.2.

Comment 7 Oved Ourfali 2015-12-10 13:41:01 UTC
Petr - a clear reporoducer, if you have any, would be great.

Comment 8 Petr Matyáš 2015-12-10 14:00:40 UTC
I don't, it's happening just for one of my hosts and for every host in our jenkins. I'll try some upgrade scenario now.

Comment 9 Sandro Bonazzola 2015-12-23 15:08:37 UTC
This bug has target milestone 3.6.2 and is on modified without a target release.
This may be perfectly correct, but please check if the patch fixing this bug is included in ovirt-engine-3.6.2. If it's included, please set target-release to 3.6.2 and move to ON_QA. Thanks.

Comment 10 Sandro Bonazzola 2016-01-14 08:31:53 UTC
Sorry for the noise with assignee, something went wrong while setting target release.

Comment 11 Petr Matyáš 2016-01-21 10:05:20 UTC
Created attachment 1116876 [details]
engine, vdsm and supervdsm logs

This still happens for one of our test suits (https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.6/view/UI/job/3.6-git-rhevmCore-selenium_webadmin-sanity/)

Tested on 3.6.2-9

Comment 12 Moti Asayag 2016-01-21 11:12:34 UTC
The fix didn't get into 3.6.2. Promoting the target release to 3.6.3, where the fix is already merged.

Comment 13 Petr Matyáš 2016-01-29 12:06:04 UTC
Created attachment 1119407 [details]
engine, vdsm, supervdsm logs; screenshot

Still not fixed on 3.6.3-1

Comment 14 Red Hat Bugzilla Rules Engine 2016-01-29 12:06:26 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 15 Oved Ourfali 2016-01-29 17:11:31 UTC
Can you specify what isn't working?

Comment 16 Moti Asayag 2016-01-31 12:13:27 UTC
Piotr, 
the vdsm.log (attached) contains the following error, when attempt to detect the protocol to use:

ioprocess communication (29826)::ERROR::2016-01-20 20:22:24,745::__init__::174::IOProcessClient::(_communicate) IOProcess failure
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 129, in _communicate
    raise Exception("FD closed")
Exception: FD closed

and also later-on:

JsonRpc (StompReactor)::INFO::2016-01-20 20:29:50,286::stompreactor::153::Broker.StompAdapter::(_cmd_unsubscribe) Unsubscribe command received
JsonRpc (StompReactor)::ERROR::2016-01-20 20:29:50,299::betterAsyncore::124::vds.dispatcher::(recv) SSL error during reading data: unexpected eof
MainThread::DEBUG::2016-01-20 20:29:59,297::vdsm::71::vds::(sigtermHandler) Received signal 15

Could you advise ?

Comment 17 Piotr Kliczewski 2016-02-01 11:12:16 UTC
1. Issue with ioprocess is known for something. I asked maintainer of this package to take a look at it but I am not sure whether there was BZ opened.

2. It looks like we unsubscribed and closed connection. SSL error was raised due to closed connection which was expected at this time.

Comment 18 Piotr Kliczewski 2016-02-02 12:47:07 UTC
Please provide both engine and vdsm logs when the issues occurs. Provided logs are from different days so it is impossible to correlated both.

Comment 19 Petr Matyáš 2016-02-03 11:52:12 UTC
Created attachment 1120722 [details]
engine, vdsm, supervdsm logs; screenshot

Sorry about last time, I must have packed wrong logs.

Comment 20 Piotr Kliczewski 2016-02-03 12:13:06 UTC
From provided logs I can see that the engine connected to the host using xmlrpc only.

Reactor thread::DEBUG::2016-02-03 13:16:28,646::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('10.35.161.74', 60385)

On the engine side I can only see:

2016-02-03 13:16:31,938 ERROR [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (DefaultQuartzScheduler_Worker-92) [7e3fff7d] Can not run fence action on host 'host-10.35.160.31', no suitable proxy host was found.

Later I can see that host was put to maintenance and can do action for UpdateVds failed with:

2016-02-03 13:37:46,536 WARN  [org.ovirt.engine.core.bll.hostdeploy.UpdateVdsCommand] (ajp-/127.0.0.1:8702-2) [30c5480f] CanDoAction of action 'UpdateVds' failed for user admin@internal. Reasons: VAR__ACTION__UPDATE,VAR__TYPE__HOST,NOT_SUPPORTED_PROTOCOL_FOR_CLUSTER_VERSION

The maintenance trigger storage disconnect which failed on vdsm:

Thread-99::ERROR::2016-02-03 13:35:54,361::hsm::2557::Storage.HSM::(disconnectStorageServer) Could not disconnect from storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2553, in disconnectStorageServer
    conObj.disconnect()
  File "/usr/share/vdsm/storage/storageServer.py", line 447, in disconnect
    return self._mountCon.disconnect()
  File "/usr/share/vdsm/storage/storageServer.py", line 256, in disconnect
    self._mount.umount(True, True)
  File "/usr/share/vdsm/storage/mount.py", line 256, in umount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 241, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';umount: /rhev/data-center/mnt/vserver-nas02.qa.lab.tlv.redhat.com:_nas02_storage__bqLZjD__nfs__2016__02__03__13__14__55__15384: mountpoint not found\n')

but I do not see that communication protocol was changed.

I looks like validation issue on the engine side.

Comment 21 Petr Matyáš 2016-02-18 09:17:28 UTC
Created attachment 1128168 [details]
engine, vdsm, supervdsm logs

Comment 22 Petr Matyáš 2016-02-18 09:18:39 UTC
This issue is still not fixed in 3.6.3-3

Comment 23 Red Hat Bugzilla Rules Engine 2016-02-18 09:18:45 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 24 Moti Asayag 2016-02-21 22:13:43 UTC
Found the exact reproducer for this bug:

1. Add host to ovirt-engine
2. In 'Edit Host' dialog, move host to 3.4 cluster level.
3. In 'Edit Host' dialog, move host to 3.6 cluster level.
4. Try to edit the host once again.

The new patch handles this problem.

Comment 25 Petr Matyáš 2016-02-25 15:27:34 UTC
Verified on 3.6.3-4