Description of problem: 1. After new hosted-engine deployment if hypervisor gets rebooted due to power outage or some other reasons, the vm.conf does not get import into OVF store. 2. This cause complete 're-deployement' of hosted-engine setup as there is no way to get 'vm.conf' to start HEVM after host reboot. 3. As per RHEV3.6 HE setup, the vm.conf automatically gets imported only when there is at least 1 'master storage domain' added and the 'datacenter' is in "UP" status which is not possible in above mentioned scenario. Version-Release number of selected component (if applicable): RHEV-3.6 How reproducible: Always Steps to Reproduce: 1. Deploy HE setup 2. Reboot the host before importing the vm.conf to OVF storage. Actual results: HEVM failed to start after host reboot with following errors : # hosted-engine --vm-start Unable to read vm.conf, please check ovirt-ha-agent logs agent.log : MainThread::WARNING::2016-08-11 12:13:55,616::ovf_store::104::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE MainThread::ERROR::2016-08-11 12:13:55,617::config::235::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf MainThread::ERROR::2016-08-11 12:13:55,652::heconflib::111::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(validateConfImage) 'version' is not stored in the HE configuration image MainThread::ERROR::2016-08-11 12:13:55,666::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/run/ovirt-hosted-engine-ha/vm.conf, key=memSize'' - trying to restart agent # systemctl status ovirt-ha-agent Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: INFO:ovirt_hosted_engine_ha.lib.upgrade.StorageServer:Host configuration is already up-to-date Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Reloading vm.conf from the shared storage domain Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Trying to get a fresher copy of vm configurat... OVF_STORE Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: WARNING:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Unable to find OVF_STORE Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Unable to get vm.conf from OV...al vm.conf Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Unable to get vm.conf from OVF_STORE, fallin...al vm.conf Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR 'version' is not stored in th...tion image Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:'version' is not stored in the HE configuration image Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: ''Configuration value not found: file=/var/r...tart agent Aug 11 12:37:41 dhcp210-150.gsslab.pnq.redhat.com ovirt-ha-agent[6473]: ERROR:ovirt_hosted_engine_ha.agent.agent.Agent:Error: ''Configuration value not found: file=/var/run/ovirt-hosted...tart agent Expected results: There should be a way to start / recover a HEVM instead of redeployment of HE setup. Additional info: Need to find any possible workaround for this issue. Or As an alternative way by which we can keep copy of vm.conf in original location after initial HE setup i.e. /etc/ovirt-hosted-engine/vm.conf until it gets imported successfully to OVF_Store and then we may unlink or remove the vm.conf from old location. This will allow us to start the HEVM with available vm.conf and avoid redeployment of HE setup.
The engine VM should already re-start from the initial vm.conf till we get a valid OVF_STORE; this sequence can be repeated as many time as we want. I suspect that the issue was somewhere else.
OK, in this case the issue was here: 12:13:55,652::heconflib::111::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(validateConfImage) 'version' is not stored in the HE configuration image We got this since the setup failed before completing. In this case the setup failed since: 2016-09-06 19:57:24 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._closeup:614 Cannot add the host to cluster Default Traceback (most recent call last): File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/engine/add_host.py", line 604, in _closeup otopicons.NetEnv.IPTABLES_ENABLE File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/brokers.py", line 18305, in add headers={"Correlation-Id":correlation_id, "Expect":expect} File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 79, in add return self.request('POST', url, body, headers, cls=cls) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 122, in request persistent_auth=self.__persistent_auth File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 79, in do_request persistent_auth) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 156, in __do_request raise errors.RequestError(response_code, response_reason, response_body) RequestError: status: 400 reason: Bad Request detail: Host address must be a FQDN or a valid IP address 2016-09-06 19:57:24 ERROR otopi.plugins.gr_he_setup.engine.add_host add_host._closeup:622 Cannot automatically add the host to cluster Default: Host address must be a FQDN or a valid IP address and at the end: 2016-09-06 19:59:31 DEBUG otopi.context context._executeMethod:128 Stage terminate METHOD otopi.plugins.gr_he_common.core.misc.Plugin._terminate 2016-09-06 19:59:31 ERROR otopi.plugins.gr_he_common.core.misc misc._terminate:180 Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy The root cause is that: 2016-09-06 19:59:31 DEBUG otopi.context context.dumpEnvironment:770 ENV OVEHOSTED_NETWORK/host_name=str:'rhv_prod_h01.!!!MASKED!!!' is not a valid fqdn since it contains underscores which is not an allowed char for an hostname and so the agent is correctly refusing to deploy that host. Please try again with a valid hostname. On the other side cockipt and ovirt-hosted-engine-setup should fail before with a clear error.
Works for me on these components on host: rhvm-appliance-4.1.20170126.0-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.1-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-host-deploy-1.6.0-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-node-ng-nodectl-4.1.0-0.20170104.1.el7.noarch libvirt-client-2.0.0-10.el7_3.4.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64 vdsm-4.19.4-1.el7ev.x86_64 sanlock-3.4.0-1.el7.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch mom-0.5.8-1.el7ev.noarch ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 7.3 If incorrect FQDN of form a_b_c.some.domain.com being used, then in Cockpit customer being asked again to provide the correct FQDN, till correct FQDN is provided and then deployment continues as expected.