When the hosted engine agent fails to start, no error message is logged. In master, this is the message (in agent.py): self._log.error("Error: '{0}' - trying to restart agent" .format(str(e))) In the 1.2 branch, I see this instead: self._log.error("") Not quite as useful :) I guess it was an error while backporting the patch... anyway, the message is a good diagnostic to have if we're building from that branch.
Hi Greg, 1.I need the exact steps for this bug reproduction. 2.Please specify the file to be checked during failure in HE start.
(In reply to Nikolai Sednev from comment #1) > Hi Greg, > 1.I need the exact steps for this bug reproduction. > 2.Please specify the file to be checked during failure in HE start. Hi Nikolai, Here's a way to reproduce that I just confirmed: 1. Stop the ovirt-ha-agent and ovirt-ha-broker services 2. Start the agent manually: /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (For this step, avoid using 'service' or 'systemctl' to start the agent because they will also start the broker, which for our test we are trying to avoid.) 3. Wait and check the log: /var/log/ovirt-hosted-engine-ha/agent.log For a message like the following: ("Error: '<...>' - trying to restart agent") If this message appears, the code is good. In my case, I see this, which took a few minutes to show up in a the log: MainThread::ERROR::2014-08-13 09:38:21,104::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to connect to broker, the number of errors has exceeded the limit (10)' - trying to restart agent After the test is over, just kill the running agent, or if you wait long enough, it will stop itself.
(In reply to Greg Padgett from comment #2) > (In reply to Nikolai Sednev from comment #1) > > Hi Greg, > > 1.I need the exact steps for this bug reproduction. > > 2.Please specify the file to be checked during failure in HE start. > > Hi Nikolai, > > Here's a way to reproduce that I just confirmed: > > 1. Stop the ovirt-ha-agent and ovirt-ha-broker services > > 2. Start the agent manually: > /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent > (For this step, avoid using 'service' or 'systemctl' to start the agent > because they will also start the broker, which for our test we are trying to > avoid.) > > 3. Wait and check the log: > /var/log/ovirt-hosted-engine-ha/agent.log > For a message like the following: > ("Error: '<...>' - trying to restart agent") > If this message appears, the code is good. > > In my case, I see this, which took a few minutes to show up in a the log: > MainThread::ERROR::2014-08-13 > 09:38:21,104::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent:: > (_run_agent) Error: 'Failed to connect to broker, the number of errors has > exceeded the limit (10)' - trying to restart agent > > After the test is over, just kill the running agent, or if you wait long > enough, it will stop itself. Seems like we have a fix then, I'm receiving these on both of my hosts and engine works well: MainThread::ERROR::2014-08-13 18:55:23,459::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to connect to broker, the num ber of errors has exceeded the limit (10)' - trying to restart agent MainThread::WARNING::2014-08-13 18:55:28,465::agent::175::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '3' MainThread::INFO::2014-08-13 18:55:28,487::hosted_engine::222::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate c ommon name: 10.35.64.85 MainThread::INFO::2014-08-13 18:55:28,803::hosted_engine::367::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing h a-broker connection MainThread::INFO::2014-08-13 18:55:28,803::brokerlink::67::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Failed to connect to broker: [Errno 2] No such file or directory MainThread::INFO::2014-08-13 18:55:28,803::brokerlink::69::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Retrying broker connection in '5' se conds MainThread::INFO::2014-08-13 18:55:33,808::brokerlink::67::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Failed to connect to broker: [Errno 2] No such file or directory MainThread::INFO::2014-08-13 18:55:33,809::brokerlink::69::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Retrying broker connection in '5' se conds Both hosts shows via webui that their HE HA [N/A] as broker was stopped manually before. Components used: Linux version 2.6.32-431.23.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014 ovirt-engine-setup-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch ovirt-engine-setup-base-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch libvirt-0.10.2-29.el6_5.10.x86_64 sanlock-2.8-1.el6.x86_64 vdsm-4.16.1-6.gita4a4614.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64
oVirt 3.5 has been released and should include the fix for this issue.