1200474 – Unexpected error received for RHEVH hypervisor status

Bug 1200474 - Unexpected error received for RHEVH hypervisor status

Summary: Unexpected error received for RHEVH hypervisor status

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-node-plugin-hosted-engine
Sub Component:
Version:	3.5.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Fabian Deutsch
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:	node
Depends On:
Blocks:	1059435 1198639 1250199
TreeView+	depends on / blocked

Reported:	2015-03-10 16:00 UTC by Nikolai Sednev
Modified:	2016-02-10 20:10 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-03-30 06:50:51 UTC
oVirt Team:	Node
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Error status (120.23 KB, image/png) 2015-03-10 16:00 UTC, Nikolai Sednev	no flags	Details
engine logs (1.00 MB, application/x-gzip) 2015-03-12 14:47 UTC, Nikolai Sednev	no flags	Details
alma03logs (935.54 KB, application/x-gzip) 2015-03-12 15:00 UTC, Nikolai Sednev	no flags	Details
alma03logs (8.26 MB, application/x-gzip) 2015-03-12 15:24 UTC, Nikolai Sednev	no flags	Details
alma04logs (8.96 MB, application/x-gzip) 2015-03-12 15:25 UTC, Nikolai Sednev	no flags	Details
alma03deploymentlogs latest (61.78 KB, application/x-gzip) 2015-03-12 17:02 UTC, Nikolai Sednev	no flags	Details
answers-20150323062421.conf (2.54 KB, text/plain) 2015-03-23 06:35 UTC, Nikolai Sednev	no flags	Details
Show Obsolete (1) View All

Description Nikolai Sednev 2015-03-10 16:00:53 UTC

Created attachment 1000033 [details]
Error status

Description of problem:
Unexpected error received for RHEVH hypervisor status

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.Log in in to GA 3.5 engine and add to it latest RHEVH hypervisor.
2.wget rhev-hypervisor6-6.6-20150304.0.el6ev.noarch.rpm to engine.
3.yum localinstall rhev-hypervisor6-6.6-20150304.0.el6ev.noarch.rpm on engine.
4.Set to maintenance RHEVH via WEBUI on engine and upgrade it to rhev-hypervisor6-6.6-20150304.0.el6ev.
5.Add second RHEVH and do the same steps for it from 3 to 4.
6.Set both RHEVHs to maintenance and then remove them from the setup.
7.Log in to first RHEVH and start HE deployment by running | screen hosted-engine --deploy"
8.Follow the steps of installation and finally install the HE on first RHEVH.
9.Add second RHEVH hypervisor to your setup by running deployment on it and getting the answer file from the first hypervisor.
10.Log in to HE and see that status has an error for your first RHEVH.
11.Perform migration of VM from first RHEVH to second RHEVH, by pressing migrate at the VM section.
12.Check that HE VM migrated, but now also second RHEVH's status being reported as ERROR and you'll see both RHEVH hypervisors a appears within the attached picture.

Actual results:
Both RHEVHs appears with status in weired state of ERROR as depicted in attachment.

Expected results:
Both RHEVHs have to appear as active and without errors.

Additional info:

Comment 1 Doron Fediuck 2015-03-11 07:43:01 UTC

(In reply to Nikolai Sednev from comment #0)
> Created attachment 1000033 [details]
> Error status
> 
> Description of problem:
> Unexpected error received for RHEVH hypervisor status
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1.Log in in to GA 3.5 engine and add to it latest RHEVH hypervisor.
> 2.wget rhev-hypervisor6-6.6-20150304.0.el6ev.noarch.rpm to engine.
> 3.yum localinstall rhev-hypervisor6-6.6-20150304.0.el6ev.noarch.rpm on
> engine.
> 4.Set to maintenance RHEVH via WEBUI on engine and upgrade it to
> rhev-hypervisor6-6.6-20150304.0.el6ev.
> 5.Add second RHEVH and do the same steps for it from 3 to 4.
At this point you have 2 rhev-h machines associated with your initial engine.
This means vdsm is working and uses certificates created by your original engine.

> 6.Set both RHEVHs to maintenance and then remove them from the setup.
At this point your hosts are still using their original credentials, even if removed from the original engine.

> 7.Log in to first RHEVH and start HE deployment by running | screen
> hosted-engine --deploy"
> 8.Follow the steps of installation and finally install the HE on first RHEVH.
> 9.Add second RHEVH hypervisor to your setup by running deployment on it and
> getting the answer file from the first hypervisor.
> 10.Log in to HE and see that status has an error for your first RHEVH.
> 11.Perform migration of VM from first RHEVH to second RHEVH, by pressing
> migrate at the VM section.
> 12.Check that HE VM migrated, but now also second RHEVH's status being
> reported as ERROR and you'll see both RHEVH hypervisors a appears within the
> attached picture.
> 

See comments inline.
This is not a valid scenario. You should have a clean setup to deploy
hosted engine, and your hosts had running vdsm's configured to work with
a different engine.

Comment 2 Nikolai Sednev 2015-03-11 09:11:57 UTC

(In reply to Doron Fediuck from comment #1)
> (In reply to Nikolai Sednev from comment #0)
> > Created attachment 1000033 [details]
> > Error status
> > 
> > Description of problem:
> > Unexpected error received for RHEVH hypervisor status
> > 
> > Version-Release number of selected component (if applicable):
> > 
> > 
> > How reproducible:
> > 100%
> > 
> > Steps to Reproduce:
> > 1.Log in in to GA 3.5 engine and add to it latest RHEVH hypervisor.
> > 2.wget rhev-hypervisor6-6.6-20150304.0.el6ev.noarch.rpm to engine.
> > 3.yum localinstall rhev-hypervisor6-6.6-20150304.0.el6ev.noarch.rpm on
> > engine.
> > 4.Set to maintenance RHEVH via WEBUI on engine and upgrade it to
> > rhev-hypervisor6-6.6-20150304.0.el6ev.
> > 5.Add second RHEVH and do the same steps for it from 3 to 4.
> At this point you have 2 rhev-h machines associated with your initial engine.
> This means vdsm is working and uses certificates created by your original
> engine.
> 
> > 6.Set both RHEVHs to maintenance and then remove them from the setup.
> At this point your hosts are still using their original credentials, even if
> removed from the original engine.
> 
> > 7.Log in to first RHEVH and start HE deployment by running | screen
> > hosted-engine --deploy"
> > 8.Follow the steps of installation and finally install the HE on first RHEVH.
> > 9.Add second RHEVH hypervisor to your setup by running deployment on it and
> > getting the answer file from the first hypervisor.
> > 10.Log in to HE and see that status has an error for your first RHEVH.
> > 11.Perform migration of VM from first RHEVH to second RHEVH, by pressing
> > migrate at the VM section.
> > 12.Check that HE VM migrated, but now also second RHEVH's status being
> > reported as ERROR and you'll see both RHEVH hypervisors a appears within the
> > attached picture.
> > 
> 
> See comments inline.
> This is not a valid scenario. You should have a clean setup to deploy
> hosted engine, and your hosts had running vdsm's configured to work with
> a different engine.


You can remove RHELs from one setup and add them to another, no problem, why should be the difference for RHEVH? Behavior have to be the same for both.

Comment 3 Doron Fediuck 2015-03-11 12:13:17 UTC

(In reply to Nikolai Sednev from comment #2)
> (In reply to Doron Fediuck from comment #1)
> > (In reply to Nikolai Sednev from comment #0)
> 
> You can remove RHELs from one setup and add them to another, no problem, why
> should be the difference for RHEVH? Behavior have to be the same for both.

It's not about host type. It's about hosted engine installation you are doing. As I explained before-

> > This is not a valid scenario. You should have a clean setup to deploy
> > hosted engine, and your hosts had running vdsm's configured to work with
> > a different engine.

So the approach should be setup hosted engine on a fresh clean host or migrate an existing engine from a physical host to HE as described here:
http://www.ovirt.org/Migrate_to_Hosted_Engine

Comment 4 Nikolai Sednev 2015-03-12 08:35:13 UTC

I'm reopening the bug as the same system behavior occurred on clean installation for two hypervisors.

On both hypervisors:
ovirt-hosted-engine-setup-1.2.2-1.el6ev.noarch
ovirt-node-plugin-hosted-engine-0.2.0-9.0.el6ev.x86_64
ovirt-hosted-engine-ha-1.2.5-1.el6ev.noarch
ovirt-node-3.2.1-9.el6.noarch
sanlock-2.8-1.el6.x86_64
libvirt-client-0.10.2-46.el6_6.3.x86_64
libvirt-python-0.10.2-46.el6_6.3.x86_64
libvirt-lock-sanlock-0.10.2-46.el6_6.3.x86_64
vdsm-4.16.8.1-7.el6ev.x86_64
libvirt-0.10.2-46.el6_6.3.x86_64
libvirt-cim-0.6.1-12.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.446.el6.x86_64
Red Hat Enterprise Virtualization Hypervisor 6.6 (20150304.0.el6ev)
Linux version 2.6.32-504.12.2.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC) ) #1 SMP Sun Feb 1 12:14:02 EST 2015


On engine:
rhevm-guest-agent-common-1.0.10-2.el6ev.noarch
rhevm-3.5.1-0.1.el6ev.noarch
Red Hat Enterprise Linux Server release 6.6 (Santiago)
Linux version 2.6.32-504.8.1.el6.x86_64 (mockbuild.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC) ) #1 SMP Fri Dec 19 12:09:25 EST 2014

Comment 5 Nikolai Sednev 2015-03-12 12:02:18 UTC

Error status for both hosts causing for storage domain not being possible to be added, as there is no available hosts to be selected, the drop down menu of hosts is simply empty.

Comment 6 Fabian Deutsch 2015-03-12 14:05:36 UTC

Can you please provide the logs from RHEV-M and the RHEV-H hosts

Comment 7 Doron Fediuck 2015-03-12 14:15:38 UTC

(In reply to Nikolai Sednev from comment #5)
> Error status for both hosts causing for storage domain not being possible to
> be added, as there is no available hosts to be selected, the drop down menu
> of hosts is simply empty.

Please provide the reproduction steps based on clean freshly installed RHEV-H. 

The hosted engine installation should be-

1. Install a new RHEV-H machine.
2. Use TUI or relevant setup for hosted engine on the new RHEV-H machine.

At which point to you see the error?

Comment 9 Nikolai Sednev 2015-03-12 14:35:30 UTC

(In reply to Doron Fediuck from comment #7)
> (In reply to Nikolai Sednev from comment #5)
> > Error status for both hosts causing for storage domain not being possible to
> > be added, as there is no available hosts to be selected, the drop down menu
> > of hosts is simply empty.
> 
> Please provide the reproduction steps based on clean freshly installed
> RHEV-H. 
> 
> The hosted engine installation should be-
> 
> 1. Install a new RHEV-H machine.
> 2. Use TUI or relevant setup for hosted engine on the new RHEV-H machine.
> 
> At which point to you see the error?


Sure, here goes:
1.Go to Foreman and choose RHEVH_3.5 6.6 for OS.
2.Power-cycle or reboot your host.
3.After reprovision is complete, run "screen hosted-engine --deploy"
4.Use PXE for RHEL6.6 for engine.
5.Use NFS3 share as your storage solution.
6.On completion of your first host deployment, add second reprovisioned host and take parameters for the deployment from answer-file containted at first host.
7.Enter in to WEBUI of the engine and see that you already have your first host in "Error" status.
8.Once second host added, enter in to the WEBUI of the engine and press "migrate" button for the engine's VM, you'll get VM migrated and both hosts marked in "Error" status.

I see error right from the start, when enter in to WEBUI of the engine with at least one host there.

Comment 10 Nikolai Sednev 2015-03-12 14:41:13 UTC

(In reply to Fabian Deutsch from comment #6)
> Can you please provide the logs from RHEV-M and the RHEV-H hosts

Which exactly logs do you need?

Comment 11 Nikolai Sednev 2015-03-12 14:47:28 UTC

Created attachment 1001007 [details]
engine logs

Comment 12 Nikolai Sednev 2015-03-12 15:00:39 UTC

Created attachment 1001010 [details]
alma03logs

Comment 13 Doron Fediuck 2015-03-12 15:08:29 UTC

(In reply to Nikolai Sednev from comment #9)
> (In reply to Doron Fediuck from comment #7)
> > (In reply to Nikolai Sednev from comment #5)
> > > Error status for both hosts causing for storage domain not being possible to
> > > be added, as there is no available hosts to be selected, the drop down menu
> > > of hosts is simply empty.
> > 
> > Please provide the reproduction steps based on clean freshly installed
> > RHEV-H. 
> > 
> > The hosted engine installation should be-
> > 
> > 1. Install a new RHEV-H machine.
> > 2. Use TUI or relevant setup for hosted engine on the new RHEV-H machine.
> > 
> > At which point to you see the error?
> 
> 
> Sure, here goes:
> 1.Go to Foreman and choose RHEVH_3.5 6.6 for OS.
> 2.Power-cycle or reboot your host.
> 3.After reprovision is complete, run "screen hosted-engine --deploy"
> 4.Use PXE for RHEL6.6 for engine.
> 5.Use NFS3 share as your storage solution.
> 6.On completion of your first host deployment, add second reprovisioned host
> and take parameters for the deployment from answer-file containted at first
> host.
> 7.Enter in to WEBUI of the engine and see that you already have your first
> host in "Error" status.
> 8.Once second host added, enter in to the WEBUI of the engine and press
> "migrate" button for the engine's VM, you'll get VM migrated and both hosts
> marked in "Error" status.
> 
> I see error right from the start, when enter in to WEBUI of the engine with
> at least one host there.

I'm not sure why are you working with 2 hosts when the first one did not install correctly,
or is it wrong? 
ie- Did the installation ended successfully on the first host?
What did you see in the UI after the installation of the first host and before starting the
2nd one?

Note that step 6 is invalid- you should only provide a host ID and the details of the first host. The installer should be copying the file from the first host. If this is not the case it means the installation failed since the installer does not identify the hosted engine storage domain.

So please keeps things clean an organized; Make sure you complete the installation before continuing to the next host otherwise this is useless and confusing. 
Can you verify that the installation failed and provide the failure? If not, can you verify the first host was set to 'up'?

Comment 14 Nikolai Sednev 2015-03-12 15:24:03 UTC

Created attachment 1001029 [details]
alma03logs

Comment 15 Nikolai Sednev 2015-03-12 15:25:18 UTC

Created attachment 1001031 [details]
alma04logs

Comment 16 Doron Fediuck 2015-03-12 15:26:17 UTC

It seems that you had several installations of hosted engine, and at least one
failed on bridge creation:

Failed to execute stage 'Misc configuration': Failed to setup networks {'rhevm': {'nic': 'p**FILTERED**p**FILTERED**', 'bootproto': 'dhcp', 'blockingdhcp': True}}. Error code: "**FILTERED**6" message: "Unexpected exception"

As asked for in comment 7 this is not a fresh installation based on your logs,
as currently this is not a bug but an environmental issue.

Please run a setup with a single host and verify is succeeds first.
If it fails please provide the failure details including logs based on single clean host.

If the installation succeeds and the host is in 'up' status, you can proceed to the next host, providing an ID for the 2nd host and the first host credentials so settings will be handled automatically. If there are problems here, please report with the relevant log files.

Comment 17 Nikolai Sednev 2015-03-12 15:31:01 UTC

(In reply to Doron Fediuck from comment #13)
> (In reply to Nikolai Sednev from comment #9)
> > (In reply to Doron Fediuck from comment #7)
> > > (In reply to Nikolai Sednev from comment #5)
> > > > Error status for both hosts causing for storage domain not being possible to
> > > > be added, as there is no available hosts to be selected, the drop down menu
> > > > of hosts is simply empty.
> > > 
> > > Please provide the reproduction steps based on clean freshly installed
> > > RHEV-H. 
> > > 
> > > The hosted engine installation should be-
> > > 
> > > 1. Install a new RHEV-H machine.
> > > 2. Use TUI or relevant setup for hosted engine on the new RHEV-H machine.
> > > 
> > > At which point to you see the error?
> > 
> > 
> > Sure, here goes:
> > 1.Go to Foreman and choose RHEVH_3.5 6.6 for OS.
> > 2.Power-cycle or reboot your host.
> > 3.After reprovision is complete, run "screen hosted-engine --deploy"
> > 4.Use PXE for RHEL6.6 for engine.
> > 5.Use NFS3 share as your storage solution.
> > 6.On completion of your first host deployment, add second reprovisioned host
> > and take parameters for the deployment from answer-file containted at first
> > host.
> > 7.Enter in to WEBUI of the engine and see that you already have your first
> > host in "Error" status.
> > 8.Once second host added, enter in to the WEBUI of the engine and press
> > "migrate" button for the engine's VM, you'll get VM migrated and both hosts
> > marked in "Error" status.
> > 
> > I see error right from the start, when enter in to WEBUI of the engine with
> > at least one host there.
> 
> I'm not sure why are you working with 2 hosts when the first one did not
> install correctly,
> or is it wrong? 
> ie- Did the installation ended successfully on the first host?
> What did you see in the UI after the installation of the first host and
> before starting the
> 2nd one?
> 
> Note that step 6 is invalid- you should only provide a host ID and the
> details of the first host. The installer should be copying the file from the
> first host. If this is not the case it means the installation failed since
> the installer does not identify the hosted engine storage domain.
> 
> So please keeps things clean an organized; Make sure you complete the
> installation before continuing to the next host otherwise this is useless
> and confusing. 
> Can you verify that the installation failed and provide the failure? If not,
> can you verify the first host was set to 'up'?

Via CLI installation accomplished successfully, hence I proceeded to second host deployment, as a regular flow. In WEBUI I saw strange yellow triangle with "!" and "Error" on host's status right from the start, while still only one host was running.

Regarding part 6 of my installation, you simply rephrased my actions, I used answer file from first host and pointed to it's FQDN.

Again, installation not failed.

Comment 19 Doron Fediuck 2015-03-12 15:45:15 UTC

As I explained your setup is not clean and you should clarify the network issue first, as this is currently not a bug. We are not going to chase phantoms.

In a clean environment if the first host did not get to up status installation failed, and we need to understand why. 

So if you can reproduce the problem on a clean setup, we will need the engine log, the installation log of hosted engine and the hosted engine HA agent log file of the first (and only) host before doing any additional actions.

Comment 20 Nikolai Sednev 2015-03-12 15:53:32 UTC

(In reply to Doron Fediuck from comment #19)
> As I explained your setup is not clean and you should clarify the network
> issue first, as this is currently not a bug. We are not going to chase
> phantoms.
> 
> In a clean environment if the first host did not get to up status
> installation failed, and we need to understand why. 
> 
> So if you can reproduce the problem on a clean setup, we will need the
> engine log, the installation log of hosted engine and the hosted engine HA
> agent log file of the first (and only) host before doing any additional
> actions.

Its cleaner than it can be, what's not clean in it? Installed from Foreman, what's more? Which network issue?

Comment 24 Nikolai Sednev 2015-03-12 16:58:00 UTC

If re-running the same deployment once more, then deployment succeeds and I'm getting host with yellow triangle and "!" inside and status with "Error" in UI.

Comment 25 Nikolai Sednev 2015-03-12 17:01:46 UTC

logs attached

Comment 26 Nikolai Sednev 2015-03-12 17:02:21 UTC

Created attachment 1001135 [details]
alma03deploymentlogs latest

Comment 27 Nikolai Sednev 2015-03-15 09:17:31 UTC

AFAIK a yellow triangle with "!" Error status on host caused by https://bugzilla.redhat.com/show_bug.cgi?id=1198680 this bug, which should be fixed in 13.12 and above, hence current vt14 is not yet back-ported to that bug as appears.

Comment 28 Nikolai Sednev 2015-03-23 06:34:46 UTC

Now only configuration of virtual bridge fails during first iteration of HE deployment:  
[ INFO  ] Configuring the management bridge
[ ERROR ] Failed to execute stage 'Misc configuration': Failed to setup networks {'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True}}. Error code: "16" message: "Unexpected exception"
[ INFO  ] Stage: Clean up


and succeeds on second deployment iteration:

[ INFO  ] Starting vdsmd
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Configuring the management bridge
[ INFO  ] Creating Storage Domain
[ INFO  ] Creating Storage Pool
[ INFO  ] Connecting Storage Pool


At the end on latest vt14.1 rhevm-3.5.1-0.2.el6ev.noarch  both hosts seen normally now.

During addition of second host to environment, it also suffered same issue with rhevm bridge and caused failure for the first time I ran hosted-engine --deploy on it, but succeeded on second run. 
[ INFO  ] Stage: Closing up                                                                                                                                
[ ERROR ] Failed to execute stage 'Closing up': <urlopen error [Errno 111] Connection refused>                                                             
[ INFO  ] Stage: Clean up         

Please see latest answers-20150323062421.conf attachedfrom second host.    


Components used:

rhevm-guest-agent-common-1.0.10-2.el6ev.noarch
rhevm-3.5.1-0.2.el6ev.noarch

sanlock-2.8-1.el6.x86_64
mom-0.4.1-4.el6ev.noarch
vdsm-4.16.8.1-7.el6ev.x86_64
ovirt-hosted-engine-ha-1.2.5-1.el6ev.noarch
libvirt-0.10.2-46.el6_6.3.x86_64
qemu-kvm-rhev-0.12.1.2-2.446.el6.x86_64
ovirt-hosted-engine-setup-1.2.2-1.el6ev.noarch

Comment 29 Nikolai Sednev 2015-03-23 06:35:37 UTC

Created attachment 1005202 [details]
answers-20150323062421.conf

Comment 30 Doron Fediuck 2015-03-30 06:50:51 UTC

"Unexpected error received for RHEVH hypervisor status"
This issue uses an invalid flow of installing a 2nd host after installation failed. For this reason there's a devel NAK on this issue.

If there's an installation failure please open a new BZ on the installer
with a reproducer based on a clean setup using the latest RHEV-H.

Comment 31 Nikolai Sednev 2015-03-30 15:40:05 UTC

(In reply to Doron Fediuck from comment #30)
> "Unexpected error received for RHEVH hypervisor status"
> This issue uses an invalid flow of installing a 2nd host after installation
> failed. For this reason there's a devel NAK on this issue.
> 
> If there's an installation failure please open a new BZ on the installer
> with a reproducer based on a clean setup using the latest RHEV-H.

This bug happened and was reproduced also with a single host on a clean single installation. The same "yellow triangle with a ! symbol" was shown for host within the HE.
This weird behavior was fixed in another bug and verified by me and Elad, please change the status of this bug instead of not a bug to duplicate or at least fixed, as it was verified by me and now working fine.
There is no connection to second host at all, it's the difference between vt14.0 and 14.1, latest fixes the issue with vdsm on hosts.

Note You need to log in before you can comment on or make changes to this bug.