Bug 1147487 - [Fencing] stop vdsmd on one host is followed by engine connection problem to the host and to a second host, fencing flow isn't invoked
Summary: [Fencing] stop vdsmd on one host is followed by engine connection problem to ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Piotr Kliczewski
QA Contact: sefi litmanovich
URL:
Whiteboard: infra
Depends On:
Blocks: rhev35betablocker 1149832 rhev35rcblocker rhev35gablocker
TreeView+ depends on / blocked
 
Reported: 2014-09-29 11:32 UTC by sefi litmanovich
Modified: 2016-02-10 19:18 UTC (History)
12 users (show)

Fixed In Version: org.ovirt.engine-root-3.5.0-14
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-17 17:09:18 UTC
oVirt Team: Infra


Attachments (Terms of Use)
engine log (1.19 MB, text/plain)
2014-09-29 11:32 UTC, sefi litmanovich
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 33340 master MERGED core: connection timeout overriden by client timeout Never
oVirt gerrit 33643 ovirt-3.5 MERGED ssl: ssl_accept blocks after reboot Never
oVirt gerrit 33666 ovirt-engine-3.5 MERGED core: connection timeout overriden by client timeout Never
oVirt gerrit 33668 ovirt-engine-3.5.0 MERGED core: connection timeout overriden by client timeout Never
oVirt gerrit 33670 ovirt-3.5 MERGED ssl: adding ssl threading initialization Never
oVirt gerrit 33671 ovirt-3.5 MERGED sslutils: Document M2Crypto threading initialization Never
oVirt gerrit 33675 ovirt-3.5 MERGED Revert "ssl: adding ssl threading initialization" Never

Description sefi litmanovich 2014-09-29 11:32:22 UTC
Created attachment 942291 [details]
engine log

Description of problem:

tried to test soft fencing and experienced strange behaviour.

running engine with 3 hosts (2 rhel 6.5, 1 rhel7).

1. ssh to one of the rh6.5 hosts, stop vdsmd service.
2. after wait time, both that host and second rh6.5 move to connecting state with the message:

Host {host_name} is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.

3. after 3 minutes hosts are still in connecting state and the same message is issued once more.

4. only after another 3 minutes engine tries to soft fence BOTH hosts and fails. hosts become non responsive.

5. one of the hosts has pm configured, but fence flow isn't invoked, only STATUS fence commands are being issued.

6. message regarding host status and grace period for fencing reappears and both hosts' statuses are connecting again.

7. the third host with rh7 is SPM and is up the whole time.

 

Version-Release number of selected component (if applicable):

rhevm-3.5.0-0.13.beta.el6ev.noarch

How reproducible:

always (tried it out 3 times)

Actual results:

this goes on and on, the only way I get the hosts back up is setting restarting vdsmd on the host in which it was down, and restarting engine.
after engine restart, both hosts are up again.

Expected results:

only the host with vdsmd down becomes non-responsive.
after grace period time host soft fencing should restart vdsmd and host should go back up.

Additional info:

Please see engine.log from:

2014-09-29 14:01:29,352 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzSche
duler_Worker-53) Command GetStatsVDSCommand(HostName = {host_name}, HostId = f10f0801-7d59-421
d-9d4a-a5fa3b874b13, vds=Host[{host_name},f10f0801-7d59-421d-9d4a-a5fa3b874b13]) execution fai
led. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be cau
sed by communication issues'

Please notice that according to the log it's the host which was untouched which lost connection first, only then the host where vdsmd is down.

Comment 2 Eyal Edri 2014-10-07 07:13:08 UTC
this bug status was moved to MODIFIED before engine vt5 was built,
hence moving to on_qa, if this was mistake and the fix isn't in,
please contact rhev-integ@redhat.com

Comment 3 sefi litmanovich 2014-10-14 14:31:03 UTC
Verified with rhevm-3.5.0-0.14.beta.el6ev.noarch according to the description.

Comment 4 Eyal Edri 2015-02-17 17:09:18 UTC
rhev 3.5.0 was released. closing.


Note You need to log in before you can comment on or make changes to this bug.