Bug 1124624 - No error logged when agent restarts (ovirt-hosted-engine-ha-1.2 branch only)
Summary: No error logged when agent restarts (ovirt-hosted-engine-ha-1.2 branch only)
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-hosted-engine-ha
Version: 3.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.5.0
Assignee: Greg Padgett
QA Contact: Nikolai Sednev
URL:
Whiteboard: sla
Depends On:
Blocks: 1123006
TreeView+ depends on / blocked
 
Reported: 2014-07-29 23:37 UTC by Greg Padgett
Modified: 2016-06-12 23:16 UTC (History)
8 users (show)

Fixed In Version: ovirt-3.5.0_rc1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-17 12:33:12 UTC
oVirt Team: SLA
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 30814 0 ovirt-hosted-engine-ha-1.2 MERGED agent: Log error when restarting Never

Description Greg Padgett 2014-07-29 23:37:58 UTC
When the hosted engine agent fails to start, no error message is logged.

In master, this is the message (in agent.py):
                self._log.error("Error: '{0}' - trying to restart agent"
                                .format(str(e)))

In the 1.2 branch, I see this instead:
                self._log.error("")

Not quite as useful :)  I guess it was an error while backporting the patch... anyway, the message is a good diagnostic to have if we're building from that branch.

Comment 1 Nikolai Sednev 2014-08-12 15:13:32 UTC
Hi Greg,
1.I need the exact steps for this bug reproduction.
2.Please specify the file to be checked during failure in HE start.

Comment 2 Greg Padgett 2014-08-13 13:54:00 UTC
(In reply to Nikolai Sednev from comment #1)
> Hi Greg,
> 1.I need the exact steps for this bug reproduction.
> 2.Please specify the file to be checked during failure in HE start.

Hi Nikolai,

Here's a way to reproduce that I just confirmed:

1. Stop the ovirt-ha-agent and ovirt-ha-broker services

2. Start the agent manually:
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
(For this step, avoid using 'service' or 'systemctl' to start the agent because they will also start the broker, which for our test we are trying to avoid.)

3. Wait and check the log:
/var/log/ovirt-hosted-engine-ha/agent.log
For a message like the following:
("Error: '<...>' - trying to restart agent")
If this message  appears, the code is good.

In my case, I see this, which took a few minutes to show up in a the log:
MainThread::ERROR::2014-08-13 09:38:21,104::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to connect to broker, the number of errors has exceeded the limit (10)' - trying to restart agent

After the test is over, just kill the running agent, or if you wait long enough, it will stop itself.

Comment 3 Nikolai Sednev 2014-08-13 16:02:32 UTC
(In reply to Greg Padgett from comment #2)
> (In reply to Nikolai Sednev from comment #1)
> > Hi Greg,
> > 1.I need the exact steps for this bug reproduction.
> > 2.Please specify the file to be checked during failure in HE start.
> 
> Hi Nikolai,
> 
> Here's a way to reproduce that I just confirmed:
> 
> 1. Stop the ovirt-ha-agent and ovirt-ha-broker services
> 
> 2. Start the agent manually:
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
> (For this step, avoid using 'service' or 'systemctl' to start the agent
> because they will also start the broker, which for our test we are trying to
> avoid.)
> 
> 3. Wait and check the log:
> /var/log/ovirt-hosted-engine-ha/agent.log
> For a message like the following:
> ("Error: '<...>' - trying to restart agent")
> If this message  appears, the code is good.
> 
> In my case, I see this, which took a few minutes to show up in a the log:
> MainThread::ERROR::2014-08-13
> 09:38:21,104::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::
> (_run_agent) Error: 'Failed to connect to broker, the number of errors has
> exceeded the limit (10)' - trying to restart agent
> 
> After the test is over, just kill the running agent, or if you wait long
> enough, it will stop itself.

Seems like we have a fix then, I'm receiving these on both of my hosts and engine works well:
MainThread::ERROR::2014-08-13 18:55:23,459::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to connect to broker, the num
ber of errors has exceeded the limit (10)' - trying to restart agent
MainThread::WARNING::2014-08-13 18:55:28,465::agent::175::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '3'
MainThread::INFO::2014-08-13 18:55:28,487::hosted_engine::222::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate c
ommon name: 10.35.64.85
MainThread::INFO::2014-08-13 18:55:28,803::hosted_engine::367::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing h
a-broker connection
MainThread::INFO::2014-08-13 18:55:28,803::brokerlink::67::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Failed to connect to broker: [Errno
2] No such file or directory
MainThread::INFO::2014-08-13 18:55:28,803::brokerlink::69::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Retrying broker connection in '5' se
conds
MainThread::INFO::2014-08-13 18:55:33,808::brokerlink::67::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Failed to connect to broker: [Errno
2] No such file or directory
MainThread::INFO::2014-08-13 18:55:33,809::brokerlink::69::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Retrying broker connection in '5' se
conds


Both hosts shows via webui that their HE HA [N/A] as broker was stopped manually before.

Components used:
Linux version 2.6.32-431.23.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014

ovirt-engine-setup-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch
ovirt-engine-setup-base-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch
libvirt-0.10.2-29.el6_5.10.x86_64
sanlock-2.8-1.el6.x86_64
vdsm-4.16.1-6.gita4a4614.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64

Comment 4 Sandro Bonazzola 2014-10-17 12:33:12 UTC
oVirt 3.5 has been released and should include the fix for this issue.


Note You need to log in before you can comment on or make changes to this bug.