Bug 1277013

Summary: ovirt-ha-agent gets killed after some time
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Ramesh N <rnachimu>
Component: AgentAssignee: Martin Sivák <msivak>
Status: CLOSED DUPLICATE QA Contact: Ilanit Stein <istein>
Severity: high Docs Contact:
Priority: low    
Version: 1.3.1CC: bugs, dfediuck, rnachimu, stirabos
Target Milestone: ---Flags: rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: integration
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-02 08:31:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
agent log none

Description Ramesh N 2015-11-02 04:56:37 UTC
Created attachment 1088458 [details]
agent log

Description of problem:
 
ovirt-ha-agent gets killed after some time with the error  "Too many errors occurred, giving up. Please review the log and consider filing a bug."

Version-Release number of selected component (if applicable):

ovirt-hosted-engine-ha-1.3.1

How reproducible:

Always

Steps to Reproduce:
1. Setup hosted engine with gluster volume using "hosted-engine --deploy" in first host 
2. Setup hosted engine with gluster volume using "hosted-engine --deploy" in second host
3. Check "service ovirt-ha-agent status"

Actual results:

 ovirt-ha-agent service is failed

Expected results:
 
ovirt-ha-agent service should be up and running. 

Additional info:

Same issue is seen in Third host as well. 

Note: "hosted-engine --deploy" failed in second and third host and fixed with workaround as mentioned in bz#1277010

Comment 1 Doron Fediuck 2015-11-02 07:19:11 UTC
The agent is designed to quit after several retries, as you can see in the
message:"Too many errors occurred, giving up."
Looking at the log file this seems to be a setup issue, unrelated to
the agent. So you first need to have a working environment and only then this
will become an issue. Can you reproduce this issue on a non-gluster working setup?

Comment 2 Simone Tiraboschi 2015-11-02 08:31:35 UTC
The real error is this one:

MainThread::ERROR::2015-10-29 15:20:31,256::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'list index out of range' - trying to restart agent

And it's not an agent error: VDSM raises an exception on getImagesList if called on an unattached storage domain, please see:
https://bugzilla.redhat.com/show_bug.cgi?id=1274622

We have also a workaround for it if we are not able to fix VDSM in time, please see:
https://bugzilla.redhat.com/show_bug.cgi?id=1276650

*** This bug has been marked as a duplicate of bug 1276650 ***