Bug 1415740

Summary: Soft fencing causes a "host is rebooting" message that might be misleading
Product: [oVirt] ovirt-engine Reporter: Michael Burman <mburman>
Component: BLL.InfraAssignee: Miroslava Voglova <mvoglova>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.1.0.2CC: bugs, mburman, mperina, oourfali, pmatyas, ybronhei
Target Milestone: ovirt-4.2.0Flags: rule-engine: ovirt-4.2+
Target Release: 4.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-20 10:45:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
messages log
none
engine log none

Description Michael Burman 2017-01-23 15:28:05 UTC
Created attachment 1243632 [details]
messages log

Description of problem:
mom-vdsm brings vdsmd up on it's own. 

It is seems that mom-vdsm is bringing vdsmd up although i stopped vdsmd with systemctl.

Jan 23 16:28:44 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopping MOM instance configured for VDSM purposes...
...
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopped MOM instance configured for VDSM purposes.
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Unit mom-vdsm.service entered failed state.
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: mom-vdsm.service failed. (expected, we should probably increase the timeout)
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopping Virtual Desktop Server Manager...
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopped Virtual Desktop Server Manager.

Jan 23 16:30:03 orchid-vds1.qa.lab.tlv.redhat.com sshd[15886]: Accepted publickey for root from 10.35.163.149 port 56318 ssh2: RSA 2c:d8:62:

The host resolves to: mburman-4-upgrade-env.scl.lab.tlv.redhat.com

Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd-logind[694]: New session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Started Session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Starting Session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com sshd[15886]: pam_unix(sshd:session): session opened for user root by (uid=0)
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Starting Virtual Desktop Server Manager

Version-Release number of selected component (if applicable):
mom-0.5.8-1.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Stop vdsmd on host using systemctl command

Actual results:
After a minute host comes up and vdsmd is running.
Engine report in the ui event log

Host 10.35.129.22 is non responsive.
Status of host 10.35.129.22 was set to Up.
Host 10.35.129.22 is rebooting. (why rebooting?)

Expected results:
If stopping vdsmd with systemctl it should remain down.

Comment 1 Yaniv Bronhaim 2017-01-23 15:33:48 UTC
Watching the engine log you can see that the engine triggers soft fencing when you stop vdsm service and get network exception. soft fencing sshing to the host and restart vdsmd. if it works, host gets back to up, if not, actual fence is done.

In any case, it should happens since 4.0 and its not related to mom

Comment 2 Martin Sivák 2017-01-23 15:37:40 UTC
The ssh connection from mburman-4-upgrade-env.scl.lab.tlv.redhat.com is bringing vdsm up. Mom has nothing to do with it according to this log (it is only started after the vdsm service).

Comment 3 Michael Burman 2017-01-23 15:44:21 UTC
Ok, but still this behavior is wrong.
And it's new, it didn't happen on 4.0
And it happen only to some hosts, not to all of them.
What if i need to stop vdsmd on the host?

Comment 4 Yaniv Bronhaim 2017-01-23 17:31:47 UTC
you probably need to update the bug component as well.. but look, this is the soft fencing logic. you need to disable fencing in the cluster edit tab. I'm quite sure it will do the trick and nothing will disturb your vdsmd down time

Comment 5 Oved Ourfali 2017-01-24 07:38:01 UTC
Michael - can you attach the engine.log?
I want to check where the "rebooting" comes from.

Comment 6 Martin Perina 2017-01-24 07:45:53 UTC
If you don't put host to Maintenance and stop VDSM, host become NonResponsive and engine will try to fence it (1st by SSH Soft Fencing and then by Power Management). If your host was in Maintenance before you stopped VDSM manually, then please provide engine logs.

Comment 7 Michael Burman 2017-01-24 07:49:45 UTC
Created attachment 1243847 [details]
engine log

Comment 8 Oved Ourfali 2017-01-24 08:19:04 UTC
So Seems like this is a general return when the non responsive treatment succeeded.
The soft fencing is part of that treatment.

Martin - thoughts?
I can we can rephrase, or add a specific audit log to distinguish.
Also, what happens in case it is skipped?

I've updated the title and set the severity to low.

Comment 9 Martin Perina 2017-01-24 09:29:03 UTC
Right, we don't distinguish between SSH Soft Fencing success and PM Restart success when returning result of VdsNotRespondingTreatment (code works fine, but events displayed to user may be a bit confusing). For now moving to 4.2, when a patch is ready we can retarget

Comment 10 Petr Matyáš 2017-11-14 15:58:43 UTC
Verified on ovirt-engine-4.2.0-0.0.master.20171113223918.git25568c3.el7.centos.noarch

Comment 11 Sandro Bonazzola 2017-12-20 10:45:21 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.