Created attachment 942291 [details] engine log Description of problem: tried to test soft fencing and experienced strange behaviour. running engine with 3 hosts (2 rhel 6.5, 1 rhel7). 1. ssh to one of the rh6.5 hosts, stop vdsmd service. 2. after wait time, both that host and second rh6.5 move to connecting state with the message: Host {host_name} is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. 3. after 3 minutes hosts are still in connecting state and the same message is issued once more. 4. only after another 3 minutes engine tries to soft fence BOTH hosts and fails. hosts become non responsive. 5. one of the hosts has pm configured, but fence flow isn't invoked, only STATUS fence commands are being issued. 6. message regarding host status and grace period for fencing reappears and both hosts' statuses are connecting again. 7. the third host with rh7 is SPM and is up the whole time. Version-Release number of selected component (if applicable): rhevm-3.5.0-0.13.beta.el6ev.noarch How reproducible: always (tried it out 3 times) Actual results: this goes on and on, the only way I get the hosts back up is setting restarting vdsmd on the host in which it was down, and restarting engine. after engine restart, both hosts are up again. Expected results: only the host with vdsmd down becomes non-responsive. after grace period time host soft fencing should restart vdsmd and host should go back up. Additional info: Please see engine.log from: 2014-09-29 14:01:29,352 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzSche duler_Worker-53) Command GetStatsVDSCommand(HostName = {host_name}, HostId = f10f0801-7d59-421 d-9d4a-a5fa3b874b13, vds=Host[{host_name},f10f0801-7d59-421d-9d4a-a5fa3b874b13]) execution fai led. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be cau sed by communication issues' Please notice that according to the log it's the host which was untouched which lost connection first, only then the host where vdsmd is down.
this bug status was moved to MODIFIED before engine vt5 was built, hence moving to on_qa, if this was mistake and the fix isn't in, please contact rhev-integ
Verified with rhevm-3.5.0-0.14.beta.el6ev.noarch according to the description.
rhev 3.5.0 was released. closing.