Bug 1129381
Summary: | Fence flow is stuck in a loop of fence STATUS commands | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | sefi litmanovich <slitmano> | ||||||
Component: | ovirt-engine-core | Assignee: | Eli Mesika <emesika> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | sefi litmanovich <slitmano> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.5 | CC: | amureini, bugs, ecohen, emesika, gklein, iheim, oourfali, rbalakri, slitmano, yeylon | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.5.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | infra | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-10-17 12:27:46 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1139643 | ||||||||
Bug Blocks: | 1090511 | ||||||||
Attachments: |
|
engine.log is corrupted please attach engine.log again attached file is actually a gzip file : [(emesika) [/tmp/log] {}]$ ls engine_vdsm_log.tar [(emesika) [/tmp/log] {}]$ file engine_vdsm_log.tar engine_vdsm_log.tar: gzip compressed data, last modified: Tue Aug 12 16:38:14 2014, from Unix [(emesika) [/tmp/log] {}]$ mv engine_vdsm_log.tar engine_vdsm_log.tar.gz [(emesika) [/tmp/log] {}]$ gzip -d engine_vdsm_log.tar.gz [(emesika) [/tmp/log] {}]$ ls engine_vdsm_log.tar [(emesika) [/tmp/log] {}]$ file engine_vdsm_log.tar engine_vdsm_log.tar: POSIX tar archive (GNU) [(emesika) [/tmp/log] {}]$ tar xvf engine_vdsm_log.tar ./engine_log_12_08_14.log ./monique01_vdsm_12_08_14.log ./monique05_vdsm_12_08_14.log [(emesika) [/tmp/log] {}]$ file * engine_log_12_08_14.log: gzip compressed data, last modified: Tue Aug 12 16:37:43 2014, from Unix engine_vdsm_log.tar: POSIX tar archive (GNU) monique01_vdsm_12_08_14.log: ASCII text monique05_vdsm_12_08_14.log: ASCII text, with very long lines [(emesika) [/tmp/log] {}]$ mv engine_log_12_08_14.log engine_log_12_08_14.log.gz [(emesika) [/tmp/log] {}]$ gzip -d engine_log_12_08_14.log.gz [(emesika) [/tmp/log] {}]$ file * engine_log_12_08_14.log: POSIX tar archive (GNU) engine_vdsm_log.tar: POSIX tar archive (GNU) monique01_vdsm_12_08_14.log: ASCII text monique05_vdsm_12_08_14.log: ASCII text, with very long lines [(emesika) [/tmp/log] {}]$ mv engine_log_12_08_14.log engine_log_12_08_14.log.tar [(emesika) [/tmp/log] {}]$ tar xvf engine_log_12_08_14.log.tar ./monique01_vdsm_12_08_14.log ./monique05_vdsm_12_08_14.log [(emesika) [/tmp/log] {}]$ file * engine_log_12_08_14.log.tar: POSIX tar archive (GNU) engine_vdsm_log.tar: POSIX tar archive (GNU) monique01_vdsm_12_08_14.log: ASCII text monique05_vdsm_12_08_14.log: ASCII text, with very long lines So, where is the engine_log I gave up at this point (A gzip that contains a tar that contains a gzip that contains a tar .... and so on... and finally the actual file is not there ) Created attachment 931432 [details]
engine and vdsm logs - working
I reproduced the bz.
Please find the logs for engine and 2 hosts that are used as proxies.
I verified this tar was working and extracted the logs from it successfully.
Can not reproduce Please note that in the log we have : 2014-08-21 11:37:12,778 ERROR [org.ovirt.engine.core.bll.StopVdsCommand] (ajp--127.0.0.1-8702-7) [4a55168e] Failed to verify host rose07.qa.lab.tlv.redhat.com stop status. Have retried 18 times with delay of 10 seconds between each retry. This means that the host did not turn to 'off' within 3 min Also in the log we have 2014-08-21 11:37:12,714 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp--127.0.0.1-8702-7) [4a55168e] FINISH, FenceVdsVDSCommand, return: Test Failed, [Getting status of IPMI:blabla.blibli.com...Chassis power = Unknown, Failed, ipmilan: Failed to connect after 20 seconds], log id: 28013745 2014-08-21 11:37:12,716 WARN [org.ovirt.engine.core.bll.StopVdsCommand] (ajp--127.0.0.1-8702-7) [4a55168e] Host rose07.qa.lab.tlv.redhat.com Primary PM Agent definitions are corrupted, Waiting for Host to Stop aborted. The 20 seconds delay is probably a result of using DROP instead of REJECT when blocking connection in iptables Please retry scenario with REJECT that returns immediately Retried as mentioned in comment 3. Reproduced several times with my environment. My host is available for reproduction and debugging in case this doesn't reproduce with different hosts. A reminder from out session - the stop command issued by the blocked proxy host was reported as successful although we verified host's connection to fence agent was in fact blocked. Moving to POST. Requires VDSM patch as part of Bug 1139643 bz- https://bugzilla.redhat.com/show_bug.cgi?id=1139643 - was verified, seems to me that it's a clone of this one, maybe we can change to on_qa->verified as well? oVirt 3.5 has been released and should include the fix for this issue. (In reply to sefi litmanovich from comment #6) > bz- https://bugzilla.redhat.com/show_bug.cgi?id=1139643 - was verified, > seems to me that it's a clone of this one, maybe we can change to > on_qa->verified as well? Yes |
Created attachment 926148 [details] engine and both proxy hosts vdsm logs Description of problem: Steps: 1. Tested fence robustness by configuring 3 hosts in a DC. cluster1 - > host_with_pm, host_without_1. cluster2 - > host_without_2. 2. fence policy for the host_with_pm was default (cluster-dc). 3. blocked connection between the host_without_1 to the fence agent using iptables. 4. issue a fence RESTART action. Actual results: fence flow started as normal, first STATUS command failed with host_without_1 then issued with the seconds host. then STOP command was attempted and failed with host_without_1. From that point there was a 15 minutes loop where STATUS commands are attempted with host_without_1 -> fails -> succeeds with host_without_2 -> and so on. after a while events showed messages for failure of power management attempt and stop/restart commands, but host_with_pm moved back up. Expected results: restart flow is: status->stop->status->start. I would expect one of two behaviours: 1. at the first status call, if call fails using host without_1 as proxy, then it should be blacklisted for the rest of the flow, and from that point status-stop-status-start should all be issued by host_without_2 (which in this case will work) 2. each command is attempted first with host_without_1, fails and then attempted with host_without_2 and goes on to next stage upon success. this seems like a more time consuming behaviour but would work. Additional info: