Bug 1178535
Summary: | migration to additional host fails before restarting HA agent | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Artyom <alukiano> | ||||
Component: | ovirt-hosted-engine-setup | Assignee: | Yedidyah Bar David <didi> | ||||
Status: | CLOSED ERRATA | QA Contact: | Artyom <alukiano> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.5.0 | CC: | alukiano, didi, gklein, juwu, lsurette, mavital, sbonazzo, sherold, ykaul | ||||
Target Milestone: | ovirt-3.6.0-rc | Keywords: | Triaged, ZStream | ||||
Target Release: | 3.6.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1184129 (view as bug list) | Environment: | |||||
Last Closed: | 2016-03-09 19:07:33 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1208458, 1213307, 1213878, 1215663, 1215967, 1227466, 1271272 | ||||||
Bug Blocks: | 1164308, 1164311, 1184129 | ||||||
Attachments: |
|
Update it happen to me also after clean install of 3.5 HE environment on clean hosts. Also I see something strange in hosted-engine --vm-status: --== Host 1 status ==-- Status up-to-date : True Hostname : 10.35.109.15 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 2400 Local maintenance : False Host timestamp : 56262 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=56262 (Mon Jan 5 09:52:26 2015) host-id=1 score=2400 maintenance=False state=EngineUp --== Host 2 status ==-- Status up-to-date : True Hostname : master-vds10.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 56052 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=56052 (Mon Jan 5 09:52:18 2015) host-id=2 score=2400 maintenance=False state=EngineDown One host have hostname as IP and other as hostname. Package details: ovirt-hosted-engine-ha-1.2.4-5.el6ev.noarch ovirt-hosted-engine-setup-1.2.1-8.el6ev.noarch vdsm-4.16.8.1-4.el6ev.x86_64 We had a similar bug which ended up as an environmental issue. Since according to comment 1 this is suppose to happen on every install and no one else saw this so far, please try to reproduce on a new environment (fresh OS install) and let us know how we can access this setup. So far it seems that shortly after setup, migration enforced by setting local maintenance mode on the active host fails, and starts working around an hour later. I gave a look at the code, still not sure why this happens. Verified that 3.4 isn't affected. Seems related to comment 2: somehow the hostname kept by HA is a FQDN at first, while the certs are for IP addresses. Somehow later HA changes that to be IP addresses and it starts working. Now verified that doing on the additional host 'service ovirt-ha-agent restart' is enough as a workaround. It's not about timing. The bug was that now (current code, before the fix) we wait for the host to be added to the engine only on first host setup, and not additional ones. In 3.4 we always waited, thus it was not affected. We should probably also use FQDN and host IP address - ask the user for the FQDN of the host and use that. This will wait for 3.6 I think. (In reply to Yedidyah Bar David from comment #7) > Now verified that doing on the additional host 'service ovirt-ha-agent > restart' is enough as a workaround. It's not about timing. > > The bug was that now (current code, before the fix) we wait for the host to > be added to the engine only on first host setup, and not additional ones. In > 3.4 we always waited, thus it was not affected. Seems like this change was intentional, see bug 1086032 > > We should probably also use FQDN and host IP address - ask the user for the > FQDN of the host and use that. This will wait for 3.6 I think. Seems like this will be the solution for now. Summary so far: The fix, already merged to master, makes --deploy create certs with CN (common name) being the hostname of the host, instead of its IP address as was until today. These certs are used (also) by the HA daemons and related stuff, and cause changes visible to the user, including: 1. hosted-engine --vm-status The field 'Hostname' will show the hostname and not the IP address. 2. In the web admin, the column 'Hostname/IP' will show the hostname and not the IP address. 3. Most important: things (including migration) will not work if the hostname of the host is not resolvable by other hosts and/or the engine. If the hostname is not resolvable locally, deploy will abort, and if it's not resolvable by dig - still only locally, checking against the configured name servers of the host - a warning will be emitted, such as: [WARNING] Failed to resolve rhel6-he2.tlv.redhat.com using DNS, it can be resolved only locally There is no principle problem with having some hosts use their IP address and some their hostname. E.g. A host deployed in 3.4 will have its IP address, and if another one is deployed in 3.5, it will have its hostname. To change a host to use its hostname, you can redeploy it: 1. Move it to maintenance in the web admin 2. Run 'hosted-engine --vm-status', make sure it's in maintenance, and note its 'Host ID' 3. Remove it from the engine using the web admin 4. Clean up. Since we do not have a tool for this, reinstall OS and hosted-engine. Probably enough to 'yum remove vdsm; rm -rf /etc/pki/vdsm'. 5. deploy again, supply same host id that you had before (not really mandatory I think, but nicer). Having the hostname and not IP address has all of the normal advantages and disadvantages of using names: You can change the IP address without touching stuff, you rely on resolution to work, etc. A note about the root cause: The vdsm cert is created twice during deploy: 1. Quite early, we directly run vdsm's vdsm-gencerts.sh which creates certs using as CN the machine's hostname 2. Much later, almost in the end, we add the host to the engine. While doing that, the engine runs on it host-deploy, which recreates the cert using as CN the IP address of the ovirtmgmt bridge on the machine (because that's the address we pass the engine). After (2.), we (re)start the HA services. These populate the shared storage, if needed, with whatever CN found in the cert. In <=3.4, we waited until the engine finished adding the host, and only then continued to start HA. In 3.5 we also do that, but only on first host and not on additional ones. This was done due to bug 1086032 . So without the fix, HA starts first time with the CN being hostname, and when we restart it again it has the IP. Before it's started again, there is a conflict between the CN in the cert and what the other hosts see as host name, which causes this bug. We considered changing the scheduling, but this proved to be more complex than expected. Note - eventually we decided we do need to change the behavior and return to the 3.4 one - [1] makes the script wait until the host is added to the engine, prompting the users as needed by bug 1086032 (required networks missing). [1] http://gerrit.ovirt.org/36624 Verified on ovirt-hosted-engine-setup-1.3.0-0.4.beta.git42eb801.el7ev.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0375.html |
Created attachment 976002 [details] logs Description of problem: Had 3.4 HE environment with one host(host_1), upgraded environment to 3.5(engine and host) and deployed additional host(host_2). After that I finish deployment I check that both hosts, show equal hosted-engine --vm-status information. Now my HE vm run on host_1, I put host_1 to local maintenance and wait until vm will migrate to host_2, but migration failed, vdsm.log: libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+tls://cyan-vdsf.qa.lab.tlv.redhat.com/system Version-Release number of selected component (if applicable): ==3.5== ovirt-hosted-engine-ha-1.2.4-5.el6ev.noarch vdsm-4.16.8.1-4.el6ev.x86_64 vdsm-xmlrpc-4.16.8.1-4.el6ev.noarch ==3.4== Don't have exactly version, only build av14 How reproducible: Always Steps to Reproduce: 1. Have 3.4 HE environment with one host 2. Upgrade HE environment to 3.5 and deploy additional 3.5 host 3. Try to migrate vm between hosts Actual results: Migration failed with above error Expected results: Migration success without any errors Additional info: Seems like problem in certificates, because: HOST DETAILS hostname: cyan-vdsf.qa.lab.tlv.redhat.com ip: 10.35.109.15 virsh -c qemu+tls://cyan-vdsf.qa.lab.tlv.redhat.com/system failed with the same error and virsh -c qemu+tls://10.35.109.15/system, success without any errors