Created attachment 1092641 [details] sosreport Description of problem: After we install and configure RHEV-M VM successful, the RHEV-M VM still did not startup after 20 minutes. We need manually to restart the ovirt-ha-agent and ovirt-ha-broker services to start RHEV-M VM. Version-Release number of selected component (if applicable): rhev-hypervisor7-7.2-20151104.0.iso ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch rhevm-appliance-20151110.0-1.x86_64.rhevm.ova How reproducible: 50% We tried 4 times, met 2 times VM failed to startup automatically. Steps to Reproduce: 1. PXE install rhev-hypervisor7-7.2-20151104.0.iso 2. Configure HE successful, poweroff the RHEV-M VM, wait the monitor service to start VM. Actual results: After step2, wait more than 20 minutes, RHEV-M VM still did not startup automatically. Expected results: After step2, RHEV-M VM should startup in several minutes. Additional info: There was no errors during the HE installation and configuration. If you manually restart ovirt-ha-agent and ovirt-ha-broker services, the RHEV-M can startup properly.
Simone, can you tell by the logs what might be happening?
Chaofeng, did you also checked the status of that services before restarting?
From the log it seams that ovirt-ha-agent started only once at: MainThread::INFO::2015-11-11 07:00:23,528::agent::78::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 1.3.2.1 started No attempt before
And it was started by ovirt-hosted-engine-setup at that time. 2015-11-11 07:00:19 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:828 execute: ('/bin/systemctl', 'start', 'ovirt-ha-agent.service'), executable='None', cwd='None', env=None 2015-11-11 07:00:23 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:878 execute-result: ('/bin/systemctl', 'start', 'ovirt-ha-agent.service'), rc=0 No attempt before or after in this logs. The ovirt-ha-broker has been started two time in past but the last one was MainThread::INFO::2015-11-11 07:00:21,496::broker::57::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 1.3.2.1 started from ovirt-hosted-engine-setup So I don't see any manual restart here. ovirt-ha-agent got stopped at 07:00:26, not sure why Nov 11 07:00:26 localhost systemd-ovirt-ha-agent: Stopping ovirt-ha-agent: [ OK ] And nobody restarted it anymore. ovirt-hosted-engine-setup seams to be stalled or brutally crashed: 2015-11-11 07:00:23 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.misc misc._persist_files_start:131 Saving persisting file configuration
Please provide the complete /var/log directory.
This might be the same as bug 1280310
Looks like I have reproduced the report, using the below scenario: # cat /etc/redhat-release Red Hat Enterprise Virtualization Hypervisor release 7.2 (20151104.0.el7ev) # rpm -qa | grep -i hosted-engine ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch ovirt-node-plugin-hosted-engine-0.3.0-2.el7ev.noarch Steps to reproduce: ------------------------- #1 Configure HE successfully #2 ssh to machine, executed halt command. #3 wait the monitor service to start VM (No answer) I have noticed that even with halt command, I can see the qemu process still up in the node and probably that's why the HE keep waiting the vm get down. # ps aux | grep -i qemu qemu 20676 20.4 28.9 4841600 3696356 ? Sl 20:18 10:18 /usr/libexec/qemu-kvm -name HostedEngine -S -machine rhel6.5.0,accel=kvm,usb=off -cpu SandyBridge -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 9eace8ed-4410-4b60-858a-4ab8291776a5 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.2-20151104.0.el7ev,serial=A2FFC0AA-EC96-4A0A-8BCE-4169E6B72637,uuid=9eace8ed-4410-4b60-858a-4ab8291776a5 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-HostedEngine/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2015-11-11T20:19:00,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-reboot -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/run/vdsm/storage/9bdb425a-2260-461f-b9f2-83140de3dddf/fb927e75-a578-4fee-bb4c-c6f1fa0ef20a/75d05ee5-fb0b-442a-a497-be9adc50dbaa,if=none,id=drive-virtio-disk0,format=raw,serial=fb927e75-a578-4fee-bb4c-c6f1fa0ef20a,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:16:3e:4a:25:9a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/9eace8ed-4410-4b60-858a-4ab8291776a5.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/9eace8ed-4410-4b60-858a-4ab8291776a5.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev socket,id=charchannel2,path=/var/lib/libvirt/qemu/channels/9eace8ed-4410-4b60-858a-4ab8291776a5.org.ovirt.hosted-engine-setup.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=org.ovirt.hosted-engine-setup.0 -chardev socket,id=charconsole0,path=/var/run/ovirt-vmconsole-console/9eace8ed-4410-4b60-858a-4ab8291776a5.sock,server,nowait -device virtconsole,chardev=charconsole0,id=console0 -vnc 0:0,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -msg timestamp=on
Created attachment 1093036 [details] sosreport
Created attachment 1093037 [details] /var/log
hosted-engine-setup waits for the VM to got halted, and it reaches that status 2015-11-12 07:04:33 DEBUG otopi.ovirt_hosted_engine_setup.tasks tasks.wait:54 Waiting for VM down 2015-11-12 07:04:33 DEBUG otopi.ovirt_hosted_engine_setup.tasks tasks.wait:60 Done 2015-11-12 07:04:33 DEBUG otopi.plugins.ovirt_hosted_engine_setup.ha.ha_services ha_services._closeup:121 {'status': {'message': 'Machine destroyed', 'code': 0}} Then it configures and enables the ha services and it tries to persist the configuration but it seams that hosted-engine-setup got stuck persisting its configuration. Indeed in /var/log/ovirt-node.log (with wasn't included in sosreport archive and this is an issue by itself) we have: 2015-11-12 07:04:45,429 ERROR hosted-engine failed to deploy! Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt/node/setup/hostedengine/hosted_engine_page.py", line 301, in open_console self.__persist_configs() File "/usr/lib/python2.7/site-packages/ovirt/node/setup/hostedengine/hosted_engine_page.py", line 376, in __persist_configs [Config(d).persist(d) for d in dirs] TypeError: __init__() takes exactly 1 argument (2 given)
Thanks. It looks like this bug was just fixed yesterday by Douglas.
*** Bug 1278548 has been marked as a duplicate of this bug. ***
It still appears in RHEVH (Beta) release 7.2 (20151113.123.el7ev), while deployed over iSCSI/NFS: # service ovirt-ha-agent status Redirecting to /bin/systemctl status ovirt-ha-agent.service ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Sun 2015-11-15 13:13:51 UTC; 47min ago Process: 12786 ExecStop=/usr/lib/systemd/systemd-ovirt-ha-agent stop (code=exited, status=0/SUCCESS) Main PID: 12782 (code=exited, status=0/SUCCESS) Nov 15 13:13:44 black-vdsb.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Monitoring Agent... Nov 15 13:13:48 black-vdsb.qa.lab.tlv.redhat.com systemd-ovirt-ha-agent[12756]: Starting ovirt-ha-agent: [ OK ] Nov 15 13:13:48 black-vdsb.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent. Nov 15 13:13:51 black-vdsb.qa.lab.tlv.redhat.com systemd-ovirt-ha-agent[12786]: Stopping ovirt-ha-agent: [ OK ] # service ovirt-ha-broker status Redirecting to /bin/systemctl status ovirt-ha-broker.service ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2015-11-15 13:13:52 UTC; 47min ago Process: 12828 ExecStart=/usr/lib/systemd/systemd-ovirt-ha-broker start (code=exited, status=0/SUCCESS) Main PID: 12872 (ovirt-ha-broker) CGroup: /system.slice/ovirt-ha-broker.service └─12872 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker Nov 15 13:13:48 black-vdsb.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... Nov 15 13:13:52 black-vdsb.qa.lab.tlv.redhat.com systemd-ovirt-ha-broker[12828]: Starting ovirt-ha-broker: [ OK ] Nov 15 13:13:52 black-vdsb.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.
Created attachment 1094795 [details] sosreport from iSCSI deployment
Created attachment 1094810 [details] sosreport from NFS deployment
Please consider reopening as it appears in RHEVH (Beta) release 7.2 (20151113.123.el7ev).
Hi Fabian, Can you please add exact RHEVH version, in which this bug was fixed?
Nikolai, the build RHEVH (Beta) release 7.2 (20151113.123.el7ev) was never gave Virt QE for testing. And the last test build is rhev-hypervisor7-7.2-20151112.1.el7ev for Virt QE, if you still encounter the issue, please make the bug to ASSIGNED. And Virt QE are testing this bug as well.
Please test this bug with RHEV-H 7.2 rhev-hypervisor7-7.2-20151112.1
Nikolai, why is this bug blocking bug 1172511?
Removing the dependency as actually final deployment does not matter for the 1172511, as it is being cancelled during deployment.
(In reply to Ying Cui from comment #20) > Nikolai, the build RHEVH (Beta) release 7.2 (20151113.123.el7ev) was never > gave Virt QE for testing. > And the last test build is rhev-hypervisor7-7.2-20151112.1.el7ev for Virt > QE, if you still encounter the issue, please make the bug to ASSIGNED. And > Virt QE are testing this bug as well. Forth to my conversation with Anatoly Litovsky: It's most likely 99.9% (by Anatoly) that bug exists (the version I checked (20151113.123.el7ev) is newer than you currently have. For more information please report to Anatoly Litovsky.
Nikolai, please check this bug with the build given in comment 21.
Dropping the NEEDINFO for now. We see that the bug should be fixed in ovirt-node-plugin-hosted-engine-0.3.0-3. It needs to be clarified if 20151113.123.el7ev contains this ovirt-node-plugin-hosted-engine build or not. If RHEV-H 20151113.123.el7ev does not contain the build ovirt-node-plugin-hosted-engine-0.3.0-3 build, then the bug will not be fixed. However, we know that the beta1 one build rhev-hypervisor7-7.2-20151112.1.el7ev does contain this new ovirt-node-plugin-hosted-engine-build. Tolik or Nikolai, can you please check what ovirt-node-plugin-hosted-engine build is contained in RHEV-H 20151113.123.el7ev ?
I just did the all process with latest builds and both agent and engine come up fine I guess I need to see you setup. Ping me when you ready
As a note: There is a chance that the vdsm certs are not persisted correctly (or empty?) which can cause the "vdsm ProtocolDetector.SSLHandshakeDispatcher ERROR Error during handshake: unexpected eof" error
Created attachment 1098103 [details] /var/log
took 18 minutes for engine recovery And all this on local VM machines and no network delay I afraid this may be bigger then 20 minutes on real deployments Please set us up with the environment with the bug happening.
Created attachment 1099274 [details] he setup log
It seems that he setup fails but reports succes 2015-11-26 10:29:22 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._closeup:653 Adding the host to the cluster 2015-11-26 10:29:23 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._closeup:683 Cannot add the host to cluster Default Traceback (most recent call last): File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/engine/add_host.py", line 673, in _closeup otopicons.NetEnv.IPTABLES_ENABLE File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/brokers.py", line 17454, in add headers={"Correlation-Id":correlation_id, "Expect":expect} File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 75, in add return self.request('POST', url, body, headers) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 115, in request persistent_auth=self.__persistent_auth File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 79, in do_request persistent_auth) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 155, in __do_request raise errors.RequestError(response_code, response_reason, response_body) RequestError: status: 409 reason: Conflict detail: Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details. 2015-11-26 10:29:23 ERROR otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._closeup:691 Cannot automatically add the host to cluster Default: Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details. The closeup function def _closeup(self): # shutdown the vm if this is first host. if not self.environment[ohostedcons.CoreEnv.IS_ADDITIONAL_HOST]: if self.environment[ ohostedcons.VMEnv.AUTOMATE_VM_SHUTDOWN ]: self.logger.info(_('Shutting down the engine VM')) cli = self.environment[ohostedcons.VDSMEnv.VDS_CLI] res = cli.shutdown(self.environment[ohostedcons.VMEnv.VM_UUID]) self.logger.debug(res) else: self.dialog.note( _( 'Please shutdown the VM allowing the system ' 'to launch it as a monitored service.\n' 'The system will wait until the VM is down.' ) ) waiter = tasks.VMDownWaiter(self.environment) if not waiter.wait(): # The VM is down but not destroyed status = self.environment[ ohostedcons.VDSMEnv.VDS_CLI ].destroy( self.environment[ohostedcons.VMEnv.VM_UUID] ) self.logger.debug(status) if status['status']['code'] != 0: self.logger.error( _( 'Cannot destroy the Hosted Engine VM: ' + status['status']['message'] ) ) raise RuntimeError(status['status']['message']) self.logger.info(_('Enabling and starting HA services')) for service in ( ohostedcons.Const.HA_AGENT_SERVICE, ohostedcons.Const.HA_BROCKER_SERVICE, ): self.services.startup( name=service, state=True, ) self.services.state( name=service, state=True, )
Created attachment 1099299 [details] sosreport with add host failure
need to check why exit code is still 0 here.
Created attachment 1099302 [details] hosted engine VM sosreport
Adding the host failed cause the engine wasn't able to connect the host. 2015-11-26 05:27:37,405 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (ajp-/127.0.0.1:8702-2) [ec6f118] Failed to establish session with host 'hosted_engine_1': null 2015-11-26 05:27:37,405 WARN [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (ajp-/127.0.0.1:8702-2) [ec6f118] CanDoAction of action 'AddVds' failed for user admin@internal. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server dhcp-test-11-227.nay.redhat.com,VDS_CANNOT_CONNECT_TO_SERVER 2015-11-26 05:27:37,611 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-2) [] Operation Failed: [Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.] Then ovirt-hosted-engine-setup showed an alert about the issue and a menu to let the user retrying with also an option to poweroff the engine VM witch wasn't working. It has fixed here https://gerrit.ovirt.org/#/c/47504/ as for https://bugzilla.redhat.com/show_bug.cgi?id=1271511 The fix is available since https://brewweb.devel.redhat.com/buildinfo?buildID=466825
I wasn't able to reproduce with: [root@rhevh20151130 admin]# cat /etc/redhat-release Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151112.1.el7ev) [root@rhevh20151130 admin]# rpm -qa | grep -i hosted-engine ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch ovirt-node-plugin-hosted-engine-0.3.0-3.el7ev.noarch
(In reply to Simone Tiraboschi from comment #47) > I wasn't able to reproduce with: Do you need QE's reproduction? I don't see a NEEDINFO.
Yes, thanks. Chaofeng, are you able to provide a stable reproducer?
The problem described in comment 12 is fixed. It seems that hosted-engine is getting deployed: [root@cshaoh1 log]# grep -rni "failed to depl" [root@cshaoh1 log]# The remaining problem seems to be that ovirt-ha-agent is brought down by some tntity. It is started by he-setup, but then going down a few seconds later without any obvious reason: Dec 08 03:30:15 cshaoh1.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent. -- Subject: Unit ovirt-ha-agent.service has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ovirt-ha-agent.service has finished starting up. -- -- The start-up result is done. Dec 08 03:30:19 cshaoh1.redhat.com systemd-ovirt-ha-agent[28511]: Stopping ovirt-ha-agent: [ OK ] I don't see anything particular in /var/log. Leaving for Simone for further investigations.
For reference: In the end I manually started ovirt-ha-agent to see if it comes up, and if it was going down again. It came up, and was not going down within the next seconds.
I also see that 2015-12-08 03:30:16 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.misc misc._persist_files_start:131 Saving persisting file configuration is the last line in the hosted-engine-setup logfile. Is this the normal last line? or should there be more logs after this?
For reference, the builds for the comments since comment 55: # rpm -qa | grep ovirt libgovirt-0.3.3-1.el7.x86_64 ovirt-node-plugin-hosted-engine-0.3.0-4.el7ev.noarch ovirt-node-plugin-snmp-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-setup-lib-1.0.0-1.el7ev.noarch ovirt-host-deploy-1.4.1-1.el7ev.noarch ovirt-vmconsole-host-1.0.0-1.el7ev.noarch ovirt-node-lib-legacy-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-node-lib-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-node-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-node-plugin-snmp-logic-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-hosted-engine-setup-1.3.1.1-1.el7ev.noarch ovirt-node-plugin-vdsm-0.6.1-4.el7ev.noarch ovirt-node-plugin-cim-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-vmconsole-1.0.0-1.el7ev.noarch ovirt-node-lib-config-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-node-selinux-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-node-plugin-cim-logic-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-hosted-engine-ha-1.3.3.1-1.el7ev.noarch ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64 ovirt-node-plugin-rhn-3.6.0-0.23.20151201git5eed7af.el7ev.noarch ovirt-node-branding-rhev-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
(In reply to Fabian Deutsch from comment #57) > I also see that > > 2015-12-08 03:30:16 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.misc > misc._persist_files_start:131 Saving persisting file configuration > > is the last line in the hosted-engine-setup logfile. > > Is this the normal last line? or should there be more logs after this? No, it's not. Something weird happened here.
(In reply to Simone Tiraboschi from comment #61) > (In reply to Fabian Deutsch from comment #57) > > I also see that > > > > 2015-12-08 03:30:16 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.misc > > misc._persist_files_start:131 Saving persisting file configuration > > > > is the last line in the hosted-engine-setup logfile. > > > > Is this the normal last line? or should there be more logs after this? > > No, it's not. > Something weird happened here. I tried to reproduce on Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151201.2.el7ev) deploying on NFS and it successfully worked: the agent stayed up and it successfully restarted the engine VM. The hosted-engine-setup logs effectively terminated just with 'Saving persisting file configuration'. After Fabian restarted the HA agent on the reported hosted everything was fine also there. Huijuan, are able to provide an additional reproducer?
Simone, see comment 42 and comment 52, the 10.66.11.227 is still available, is the env. Is this env ok for you to investigate? I will also reproduce it in an additional machine later today.
(In reply to Huijuan Zhao from comment #64) > Simone, see comment 42 and comment 52, the 10.66.11.227 is still available, > is the env. Is this env ok for you to investigate? > I will also reproduce it in an additional machine later today. On 10.66.11.227 the ha agent simply gave up cause it was not able to connect the storage server and so it entered the failed state: Nov 27 14:57:43 dhcp-test-11-227.nay.redhat.com ovirt-ha-agent[15820]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent Nov 27 15:06:17 dhcp-test-11-227.nay.redhat.com ovirt-ha-agent[15820]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent Nov 27 15:06:22 dhcp-test-11-227.nay.redhat.com ovirt-ha-agent[15820]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Too many errors occurred, giving up. Please review the log and consider filing a bug. Nov 27 15:06:22 dhcp-test-11-227.nay.redhat.com systemd[1]: ovirt-ha-agent.service: main process exited, code=exited, status=157/n/a And from vdsm logs: Thread-345::ERROR::2015-11-27 15:06:17,759::hsm::2465::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2462, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 419, in connect return self._mountCon.connect() File "/usr/share/vdsm/storage/storageServer.py", line 232, in connect six.reraise(t, v, tb) File "/usr/share/vdsm/storage/storageServer.py", line 224, in connect self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP) File "/usr/share/vdsm/storage/mount.py", line 225, in mount return self._runcmd(cmd, timeout) File "/usr/share/vdsm/storage/mount.py", line 241, in _runcmd raise MountError(rc, ";".join((out, err))) MountError: (32, ';mount.nfs: Connection timed out\n') Thread-345::DEBUG::2015-11-27 15:06:17,759::hsm::2489::Storage.HSM::(connectStorageServer) knownSDs: {} Thread-345::INFO::2015-11-27 15:06:17,760::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 477, 'id': '69bcc4d0-81e3-4a88-9e19-5c2d7523a7ac'}]} Thread-345::DEBUG::2015-11-27 15:06:17,760::task::1191::Storage.TaskManager.Task::(prepare) Task=`751ab5b9-a410-4da6-8fb5-db2e01f8540d`::finished: {'statuslist': [{'status': 477, 'id': '69bcc4d0-81e3-4a88-9e19-5c2d7523a7ac'}]} But this is correct and attended if the storage server is not available. We need to focus just on this issue: ha-agent no starting/failing after a successfully run of hosted-engine-setup
Huijuan, how many times did you reproduce this issue out of how many trials - to get an idea on the likelyhood of this problem?
Simone, I did not complete setup HE today, I will update after I reproduce this issue. Fabian, QE team tested 5 times, reproduced 2 times.
ovirt-ha-agent logs in stardard mode seams clean. We need to reproduce with the agent in debug mode since here I didn't enough information in the standard log. To do that is enough to modify: /etc/ovirt-hosted-engine-ha/agent-log.conf changing from [logger_root] level=INFO to [logger_root] level=DEBUG and restart the agent if it was running. The issue is that currently restarting the agent solves it. Huijuan, can you please try to reproduce from scratch with the agent in debug mode?
Simone, I reproduced this issue with the agent in debug mode(according to Comment 70), the ENV: 10.66.73.132 admin/redhat You can access the system to debug now. This ENV will keep 2 days for you, is that ok?
ha-agent was still using Sys V init scripts and ha-agent systemd unit was simply calling Sys V init script. The Sys V init script was forking the deamon in background and since systemd can see it return quite fast it can kill thinking that it finished before the daemon was really detached. Being a race condition was not systematic. Properly moving to systemd should fix it.
Verified on Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151221.1.el7ev) # rpm -qa | grep ovirt-hosted ovirt-hosted-engine-setup-1.3.1.3-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch rhevm-appliance-20151216.0-1 I checked deploy 3 times and in all 3 times deployment succeed, all services up and engine vm succeed to run.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0422.html