Bug 1129381

Summary: Fence flow is stuck in a loop of fence STATUS commands
Product: [Retired] oVirt Reporter: sefi litmanovich <slitmano>
Component: ovirt-engine-coreAssignee: Eli Mesika <emesika>
Status: CLOSED CURRENTRELEASE QA Contact: sefi litmanovich <slitmano>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.5CC: amureini, bugs, ecohen, emesika, gklein, iheim, oourfali, rbalakri, slitmano, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:27:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1139643    
Bug Blocks: 1090511    
Attachments:
Description Flags
engine and both proxy hosts vdsm logs
none
engine and vdsm logs - working none

Description sefi litmanovich 2014-08-12 15:28:08 UTC
Created attachment 926148 [details]
engine and both proxy hosts vdsm logs

Description of problem:

Steps:

1. Tested fence robustness by configuring 3 hosts in a DC.
   cluster1 - > host_with_pm, host_without_1.
   cluster2 - > host_without_2.

2. fence policy for the host_with_pm was default (cluster-dc).

3. blocked connection between the host_without_1 to the fence agent using iptables.

4. issue a fence RESTART action.

Actual results:

fence flow started as normal, first STATUS command failed with host_without_1 then issued with the seconds host.
then STOP command was attempted and failed with host_without_1.
From that point there was a 15 minutes loop where STATUS commands are attempted with host_without_1 -> fails -> succeeds with host_without_2 -> and so on.

after a while events showed messages for failure of power management attempt and stop/restart commands, but host_with_pm moved back up.
 

Expected results:

restart flow is: status->stop->status->start.
I would expect one of two behaviours:

1. at the first status call, if call fails using host without_1 as proxy, then it should be blacklisted for the rest of the flow, and from that point status-stop-status-start should all be issued by host_without_2 (which in this case will work)

2. each command is attempted first with host_without_1, fails and then attempted with host_without_2 and goes on to next stage upon success.
this seems like a more time consuming behaviour but would work.

Additional info:

Comment 1 Eli Mesika 2014-08-24 15:23:52 UTC
engine.log is corrupted please attach engine.log again 

attached file is actually a gzip file :


[(emesika) [/tmp/log] {}]$ ls
engine_vdsm_log.tar
[(emesika) [/tmp/log] {}]$ file engine_vdsm_log.tar
engine_vdsm_log.tar: gzip compressed data, last modified: Tue Aug 12 16:38:14 2014, from Unix
[(emesika) [/tmp/log] {}]$ mv engine_vdsm_log.tar engine_vdsm_log.tar.gz
[(emesika) [/tmp/log] {}]$ gzip -d engine_vdsm_log.tar.gz
[(emesika) [/tmp/log] {}]$ ls
engine_vdsm_log.tar
[(emesika) [/tmp/log] {}]$ file engine_vdsm_log.tar
engine_vdsm_log.tar: POSIX tar archive (GNU)
[(emesika) [/tmp/log] {}]$ tar xvf engine_vdsm_log.tar
./engine_log_12_08_14.log
./monique01_vdsm_12_08_14.log
./monique05_vdsm_12_08_14.log
[(emesika) [/tmp/log] {}]$ file *
engine_log_12_08_14.log:     gzip compressed data, last modified: Tue Aug 12 16:37:43 2014, from Unix
engine_vdsm_log.tar:         POSIX tar archive (GNU)
monique01_vdsm_12_08_14.log: ASCII text
monique05_vdsm_12_08_14.log: ASCII text, with very long lines
[(emesika) [/tmp/log] {}]$ mv engine_log_12_08_14.log engine_log_12_08_14.log.gz
[(emesika) [/tmp/log] {}]$ gzip -d engine_log_12_08_14.log.gz
[(emesika) [/tmp/log] {}]$ file *
engine_log_12_08_14.log:     POSIX tar archive (GNU)
engine_vdsm_log.tar:         POSIX tar archive (GNU)
monique01_vdsm_12_08_14.log: ASCII text
monique05_vdsm_12_08_14.log: ASCII text, with very long lines
[(emesika) [/tmp/log] {}]$ mv engine_log_12_08_14.log engine_log_12_08_14.log.tar
[(emesika) [/tmp/log] {}]$ tar xvf engine_log_12_08_14.log.tar
./monique01_vdsm_12_08_14.log
./monique05_vdsm_12_08_14.log
[(emesika) [/tmp/log] {}]$ file *
engine_log_12_08_14.log.tar: POSIX tar archive (GNU)
engine_vdsm_log.tar:         POSIX tar archive (GNU)
monique01_vdsm_12_08_14.log: ASCII text
monique05_vdsm_12_08_14.log: ASCII text, with very long lines

So, where is the engine_log

I gave up at this point (A gzip that contains a tar that contains a gzip that contains a tar .... and so on... and finally the actual file is not there )

Comment 2 sefi litmanovich 2014-08-27 11:58:15 UTC
Created attachment 931432 [details]
engine and vdsm logs - working

I reproduced the bz.
Please find the logs for engine and 2 hosts that are used as proxies.
I verified this tar was working and extracted the logs from it successfully.

Comment 3 Eli Mesika 2014-08-28 11:07:08 UTC
Can not reproduce 

Please note that in the log we have :

2014-08-21 11:37:12,778 ERROR [org.ovirt.engine.core.bll.StopVdsCommand] (ajp--127.0.0.1-8702-7) [4a55168e] Failed to verify host rose07.qa.lab.tlv.redhat.com stop status. Have retried 18 times with delay of 10 seconds between each retry.

This means that the host did not turn to 'off' within 3 min 

Also in the log we have

2014-08-21 11:37:12,714 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp--127.0.0.1-8702-7) [4a55168e] FINISH, FenceVdsVDSCommand, return: Test Failed, [Getting status of IPMI:blabla.blibli.com...Chassis power = Unknown, Failed, ipmilan: Failed to connect after 20 seconds], log id: 28013745
2014-08-21 11:37:12,716 WARN  [org.ovirt.engine.core.bll.StopVdsCommand] (ajp--127.0.0.1-8702-7) [4a55168e] Host rose07.qa.lab.tlv.redhat.com Primary PM Agent definitions are corrupted, Waiting for Host to Stop aborted.

The 20 seconds delay is probably a result of using DROP instead of REJECT when blocking connection in iptables 

Please retry scenario with REJECT that returns immediately

Comment 4 sefi litmanovich 2014-09-03 09:55:35 UTC
Retried as mentioned in comment 3.

Reproduced several times with my environment.

My host is available for reproduction and debugging in case this doesn't reproduce with different hosts.

A reminder from out session - the stop command issued by the blocked proxy host was reported as successful although we verified host's connection to fence agent was in fact blocked.

Comment 5 Oved Ourfali 2014-09-10 12:50:19 UTC
Moving to POST. Requires VDSM patch as part of Bug 1139643

Comment 6 sefi litmanovich 2014-10-14 13:01:22 UTC
bz- https://bugzilla.redhat.com/show_bug.cgi?id=1139643 - was verified, seems to me that it's a clone of this one, maybe we can change to on_qa->verified as well?

Comment 7 Sandro Bonazzola 2014-10-17 12:27:46 UTC
oVirt 3.5 has been released and should include the fix for this issue.

Comment 8 Eli Mesika 2014-10-18 22:13:19 UTC
(In reply to sefi litmanovich from comment #6)
> bz- https://bugzilla.redhat.com/show_bug.cgi?id=1139643 - was verified,
> seems to me that it's a clone of this one, maybe we can change to
> on_qa->verified as well?

Yes