Bug 1124624

Summary: No error logged when agent restarts (ovirt-hosted-engine-ha-1.2 branch only)
Product: [Retired] oVirt Reporter: Greg Padgett <gpadgett>
Component: ovirt-hosted-engine-haAssignee: Greg Padgett <gpadgett>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.5CC: amureini, ecohen, gklein, gpadgett, iheim, rbalakri, sbonazzo, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: ovirt-3.5.0_rc1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:33:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1123006    

Description Greg Padgett 2014-07-29 23:37:58 UTC
When the hosted engine agent fails to start, no error message is logged.

In master, this is the message (in agent.py):
                self._log.error("Error: '{0}' - trying to restart agent"
                                .format(str(e)))

In the 1.2 branch, I see this instead:
                self._log.error("")

Not quite as useful :)  I guess it was an error while backporting the patch... anyway, the message is a good diagnostic to have if we're building from that branch.

Comment 1 Nikolai Sednev 2014-08-12 15:13:32 UTC
Hi Greg,
1.I need the exact steps for this bug reproduction.
2.Please specify the file to be checked during failure in HE start.

Comment 2 Greg Padgett 2014-08-13 13:54:00 UTC
(In reply to Nikolai Sednev from comment #1)
> Hi Greg,
> 1.I need the exact steps for this bug reproduction.
> 2.Please specify the file to be checked during failure in HE start.

Hi Nikolai,

Here's a way to reproduce that I just confirmed:

1. Stop the ovirt-ha-agent and ovirt-ha-broker services

2. Start the agent manually:
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
(For this step, avoid using 'service' or 'systemctl' to start the agent because they will also start the broker, which for our test we are trying to avoid.)

3. Wait and check the log:
/var/log/ovirt-hosted-engine-ha/agent.log
For a message like the following:
("Error: '<...>' - trying to restart agent")
If this message  appears, the code is good.

In my case, I see this, which took a few minutes to show up in a the log:
MainThread::ERROR::2014-08-13 09:38:21,104::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to connect to broker, the number of errors has exceeded the limit (10)' - trying to restart agent

After the test is over, just kill the running agent, or if you wait long enough, it will stop itself.

Comment 3 Nikolai Sednev 2014-08-13 16:02:32 UTC
(In reply to Greg Padgett from comment #2)
> (In reply to Nikolai Sednev from comment #1)
> > Hi Greg,
> > 1.I need the exact steps for this bug reproduction.
> > 2.Please specify the file to be checked during failure in HE start.
> 
> Hi Nikolai,
> 
> Here's a way to reproduce that I just confirmed:
> 
> 1. Stop the ovirt-ha-agent and ovirt-ha-broker services
> 
> 2. Start the agent manually:
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
> (For this step, avoid using 'service' or 'systemctl' to start the agent
> because they will also start the broker, which for our test we are trying to
> avoid.)
> 
> 3. Wait and check the log:
> /var/log/ovirt-hosted-engine-ha/agent.log
> For a message like the following:
> ("Error: '<...>' - trying to restart agent")
> If this message  appears, the code is good.
> 
> In my case, I see this, which took a few minutes to show up in a the log:
> MainThread::ERROR::2014-08-13
> 09:38:21,104::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::
> (_run_agent) Error: 'Failed to connect to broker, the number of errors has
> exceeded the limit (10)' - trying to restart agent
> 
> After the test is over, just kill the running agent, or if you wait long
> enough, it will stop itself.

Seems like we have a fix then, I'm receiving these on both of my hosts and engine works well:
MainThread::ERROR::2014-08-13 18:55:23,459::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to connect to broker, the num
ber of errors has exceeded the limit (10)' - trying to restart agent
MainThread::WARNING::2014-08-13 18:55:28,465::agent::175::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '3'
MainThread::INFO::2014-08-13 18:55:28,487::hosted_engine::222::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate c
ommon name: 10.35.64.85
MainThread::INFO::2014-08-13 18:55:28,803::hosted_engine::367::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing h
a-broker connection
MainThread::INFO::2014-08-13 18:55:28,803::brokerlink::67::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Failed to connect to broker: [Errno
2] No such file or directory
MainThread::INFO::2014-08-13 18:55:28,803::brokerlink::69::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Retrying broker connection in '5' se
conds
MainThread::INFO::2014-08-13 18:55:33,808::brokerlink::67::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Failed to connect to broker: [Errno
2] No such file or directory
MainThread::INFO::2014-08-13 18:55:33,809::brokerlink::69::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Retrying broker connection in '5' se
conds


Both hosts shows via webui that their HE HA [N/A] as broker was stopped manually before.

Components used:
Linux version 2.6.32-431.23.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014

ovirt-engine-setup-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch
ovirt-engine-setup-base-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch
libvirt-0.10.2-29.el6_5.10.x86_64
sanlock-2.8-1.el6.x86_64
vdsm-4.16.1-6.gita4a4614.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64

Comment 4 Sandro Bonazzola 2014-10-17 12:33:12 UTC
oVirt 3.5 has been released and should include the fix for this issue.