Bug 1178535

Summary:

migration to additional host fails before restarting HA agent

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Artyom <alukiano>

Component:

ovirt-hosted-engine-setup

Assignee:

Yedidyah Bar David <didi>

Status:

CLOSED ERRATA

QA Contact:

Artyom <alukiano>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

3.5.0

CC:

alukiano, didi, gklein, juwu, lsurette, mavital, sbonazzo, sherold, ykaul

Target Milestone:

ovirt-3.6.0-rc

Keywords:

Triaged, ZStream

Target Release:

3.6.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1184129 (view as bug list)

Environment:

Last Closed:

2016-03-09 19:07:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Integration

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1208458, 1213307, 1213878, 1215663, 1215967, 1227466, 1271272

Bug Blocks:

1164308, 1164311, 1184129

Attachments:

Description	Flags
logs	none

Description Artyom 2015-01-04 15:56:34 UTC

Created attachment 976002 [details]
logs

Description of problem:
Had 3.4 HE environment with one host(host_1), upgraded environment to 3.5(engine and host) and deployed additional host(host_2). After that I finish deployment I check that both hosts, show equal hosted-engine --vm-status information.
Now my HE vm run on host_1, I put host_1 to local maintenance and wait until vm will migrate to host_2, but migration failed, vdsm.log:
libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+tls://cyan-vdsf.qa.lab.tlv.redhat.com/system 

Version-Release number of selected component (if applicable):
==3.5==
ovirt-hosted-engine-ha-1.2.4-5.el6ev.noarch
vdsm-4.16.8.1-4.el6ev.x86_64
vdsm-xmlrpc-4.16.8.1-4.el6ev.noarch
==3.4==
Don't have exactly version, only build av14

How reproducible:
Always

Steps to Reproduce:
1. Have 3.4 HE environment with one host
2. Upgrade HE environment to 3.5 and deploy additional 3.5 host
3. Try to migrate vm between hosts

Actual results:
Migration failed with above error

Expected results:
Migration success without any errors

Additional info:
Seems like problem in certificates, because:
HOST DETAILS
hostname: cyan-vdsf.qa.lab.tlv.redhat.com
ip: 10.35.109.15

virsh -c qemu+tls://cyan-vdsf.qa.lab.tlv.redhat.com/system failed with the same error
and virsh -c qemu+tls://10.35.109.15/system, success without any errors

Comment 1 Artyom 2015-01-05 07:52:14 UTC

Update it happen to me also after clean install of 3.5 HE environment on clean hosts.

Comment 2 Artyom 2015-01-05 07:53:45 UTC

Also I see something strange in hosted-engine --vm-status:
--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.109.15
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 56262
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=56262 (Mon Jan  5 09:52:26 2015)
        host-id=1
        score=2400
        maintenance=False
        state=EngineUp


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : master-vds10.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 56052
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=56052 (Mon Jan  5 09:52:18 2015)
        host-id=2
        score=2400
        maintenance=False
        state=EngineDown

One host have hostname as IP and other as hostname.

Comment 3 Artyom 2015-01-05 07:58:53 UTC

Package details:
ovirt-hosted-engine-ha-1.2.4-5.el6ev.noarch
ovirt-hosted-engine-setup-1.2.1-8.el6ev.noarch
vdsm-4.16.8.1-4.el6ev.x86_64

Comment 4 Doron Fediuck 2015-01-06 08:26:39 UTC

We had a similar bug which ended up as an environmental issue.
Since according to comment 1 this is suppose to happen on every install
and no one else saw this so far, please try to reproduce on a new environment 
(fresh OS install) and let us know how we can access this setup.

Comment 6 Yedidyah Bar David 2015-01-07 07:20:52 UTC

So far it seems that shortly after setup, migration enforced by setting local maintenance mode on the active host fails, and starts working around an hour later. I gave a look at the code, still not sure why this happens. Verified that 3.4 isn't affected.

Seems related to comment 2: somehow the hostname kept by HA is a FQDN at first, while the certs are for IP addresses. Somehow later HA changes that to be IP addresses and it starts working.

Comment 7 Yedidyah Bar David 2015-01-07 08:10:18 UTC

Now verified that doing on the additional host 'service ovirt-ha-agent restart' is enough as a workaround. It's not about timing.

The bug was that now (current code, before the fix) we wait for the host to be added to the engine only on first host setup, and not additional ones. In 3.4 we always waited, thus it was not affected.

We should probably also use FQDN and host IP address - ask the user for the FQDN of the host and use that. This will wait for 3.6 I think.

Comment 8 Yedidyah Bar David 2015-01-07 09:50:07 UTC

(In reply to Yedidyah Bar David from comment #7)
> Now verified that doing on the additional host 'service ovirt-ha-agent
> restart' is enough as a workaround. It's not about timing.
> 
> The bug was that now (current code, before the fix) we wait for the host to
> be added to the engine only on first host setup, and not additional ones. In
> 3.4 we always waited, thus it was not affected.

Seems like this change was intentional, see bug 1086032

> 
> We should probably also use FQDN and host IP address - ask the user for the
> FQDN of the host and use that. This will wait for 3.6 I think.

Seems like this will be the solution for now.

Comment 9 Yedidyah Bar David 2015-01-07 14:24:11 UTC

Summary so far:

The fix, already merged to master, makes --deploy create certs with CN (common name) being the hostname of the host, instead of its IP address as was until today. These certs are used (also) by the HA daemons and related stuff, and cause changes visible to the user, including:
1. hosted-engine --vm-status
The field 'Hostname' will show the hostname and not the IP address.
2. In the web admin, the column 'Hostname/IP' will show the hostname and not the IP address.
3. Most important: things (including migration) will not work if the hostname of the host is not resolvable by other hosts and/or the engine. If the hostname is not resolvable locally, deploy will abort, and if it's not resolvable by dig - still only locally, checking against the configured name servers of the host - a warning will be emitted, such as:

[WARNING] Failed to resolve rhel6-he2.tlv.redhat.com using DNS, it can be resolved only locally

There is no principle problem with having some hosts use their IP address and some their hostname. E.g. A host deployed in 3.4 will have its IP address, and if another one is deployed in 3.5, it will have its hostname.

To change a host to use its hostname, you can redeploy it:
1. Move it to maintenance in the web admin
2. Run 'hosted-engine --vm-status', make sure it's in maintenance, and note its 'Host ID'
3. Remove it from the engine using the web admin
4. Clean up. Since we do not have a tool for this, reinstall OS and hosted-engine. Probably enough to 'yum remove vdsm; rm -rf /etc/pki/vdsm'.
5. deploy again, supply same host id that you had before (not really mandatory I think, but nicer).

Having the hostname and not IP address has all of the normal advantages and disadvantages of using names: You can change the IP address without touching stuff, you rely on resolution to work, etc.

Comment 10 Yedidyah Bar David 2015-01-07 14:38:22 UTC

A note about the root cause:

The vdsm cert is created twice during deploy:

1. Quite early, we directly run vdsm's vdsm-gencerts.sh which creates certs using as CN the machine's hostname

2. Much later, almost in the end, we add the host to the engine. While doing that, the engine runs on it host-deploy, which recreates the cert using as CN the IP address of the ovirtmgmt bridge on the machine (because that's the address we pass the engine).

After (2.), we (re)start the HA services. These populate the shared storage, if needed, with whatever CN found in the cert.

In <=3.4, we waited until the engine finished adding the host, and only then continued to start HA.

In 3.5 we also do that, but only on first host and not on additional ones. This was done due to bug 1086032 . So without the fix, HA starts first time with the CN being hostname, and when we restart it again it has the IP. Before it's started again, there is a conflict between the CN in the cert and what the other hosts see as host name, which causes this bug. We considered changing the scheduling, but this proved to be more complex than expected.

Comment 17 Yedidyah Bar David 2015-02-02 06:53:27 UTC

Note - eventually we decided we do need to change the behavior and return to the 3.4 one - [1] makes the script wait until the host is added to the engine, prompting the users as needed by bug 1086032 (required networks missing).

[1] http://gerrit.ovirt.org/36624

Comment 18 Artyom 2015-09-02 11:04:47 UTC

Verified on ovirt-hosted-engine-setup-1.3.0-0.4.beta.git42eb801.el7ev.noarch

Comment 20 errata-xmlrpc 2016-03-09 19:07:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0375.html