Bug 1415740 - Soft fencing causes a "host is rebooting" message that might be misleading
Summary: Soft fencing causes a "host is rebooting" message that might be misleading
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: 4.1.0.2
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ovirt-4.2.0
: 4.2.0
Assignee: Miroslava Voglova
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-23 15:28 UTC by Michael Burman
Modified: 2017-12-20 10:45 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-20 10:45:21 UTC
oVirt Team: Infra
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
messages log (160.33 KB, application/x-gzip)
2017-01-23 15:28 UTC, Michael Burman
no flags Details
engine log (432.61 KB, application/x-gzip)
2017-01-24 07:49 UTC, Michael Burman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 83653 0 master MERGED core: changed soft fencing message 2017-11-08 13:12:38 UTC

Description Michael Burman 2017-01-23 15:28:05 UTC
Created attachment 1243632 [details]
messages log

Description of problem:
mom-vdsm brings vdsmd up on it's own. 

It is seems that mom-vdsm is bringing vdsmd up although i stopped vdsmd with systemctl.

Jan 23 16:28:44 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopping MOM instance configured for VDSM purposes...
...
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopped MOM instance configured for VDSM purposes.
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Unit mom-vdsm.service entered failed state.
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: mom-vdsm.service failed. (expected, we should probably increase the timeout)
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopping Virtual Desktop Server Manager...
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopped Virtual Desktop Server Manager.

Jan 23 16:30:03 orchid-vds1.qa.lab.tlv.redhat.com sshd[15886]: Accepted publickey for root from 10.35.163.149 port 56318 ssh2: RSA 2c:d8:62:

The host resolves to: mburman-4-upgrade-env.scl.lab.tlv.redhat.com

Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd-logind[694]: New session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Started Session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Starting Session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com sshd[15886]: pam_unix(sshd:session): session opened for user root by (uid=0)
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Starting Virtual Desktop Server Manager

Version-Release number of selected component (if applicable):
mom-0.5.8-1.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Stop vdsmd on host using systemctl command

Actual results:
After a minute host comes up and vdsmd is running.
Engine report in the ui event log

Host 10.35.129.22 is non responsive.
Status of host 10.35.129.22 was set to Up.
Host 10.35.129.22 is rebooting. (why rebooting?)

Expected results:
If stopping vdsmd with systemctl it should remain down.

Comment 1 Yaniv Bronhaim 2017-01-23 15:33:48 UTC
Watching the engine log you can see that the engine triggers soft fencing when you stop vdsm service and get network exception. soft fencing sshing to the host and restart vdsmd. if it works, host gets back to up, if not, actual fence is done.

In any case, it should happens since 4.0 and its not related to mom

Comment 2 Martin Sivák 2017-01-23 15:37:40 UTC
The ssh connection from mburman-4-upgrade-env.scl.lab.tlv.redhat.com is bringing vdsm up. Mom has nothing to do with it according to this log (it is only started after the vdsm service).

Comment 3 Michael Burman 2017-01-23 15:44:21 UTC
Ok, but still this behavior is wrong.
And it's new, it didn't happen on 4.0
And it happen only to some hosts, not to all of them.
What if i need to stop vdsmd on the host?

Comment 4 Yaniv Bronhaim 2017-01-23 17:31:47 UTC
you probably need to update the bug component as well.. but look, this is the soft fencing logic. you need to disable fencing in the cluster edit tab. I'm quite sure it will do the trick and nothing will disturb your vdsmd down time

Comment 5 Oved Ourfali 2017-01-24 07:38:01 UTC
Michael - can you attach the engine.log?
I want to check where the "rebooting" comes from.

Comment 6 Martin Perina 2017-01-24 07:45:53 UTC
If you don't put host to Maintenance and stop VDSM, host become NonResponsive and engine will try to fence it (1st by SSH Soft Fencing and then by Power Management). If your host was in Maintenance before you stopped VDSM manually, then please provide engine logs.

Comment 7 Michael Burman 2017-01-24 07:49:45 UTC
Created attachment 1243847 [details]
engine log

Comment 8 Oved Ourfali 2017-01-24 08:19:04 UTC
So Seems like this is a general return when the non responsive treatment succeeded.
The soft fencing is part of that treatment.

Martin - thoughts?
I can we can rephrase, or add a specific audit log to distinguish.
Also, what happens in case it is skipped?

I've updated the title and set the severity to low.

Comment 9 Martin Perina 2017-01-24 09:29:03 UTC
Right, we don't distinguish between SSH Soft Fencing success and PM Restart success when returning result of VdsNotRespondingTreatment (code works fine, but events displayed to user may be a bit confusing). For now moving to 4.2, when a patch is ready we can retarget

Comment 10 Petr Matyáš 2017-11-14 15:58:43 UTC
Verified on ovirt-engine-4.2.0-0.0.master.20171113223918.git25568c3.el7.centos.noarch

Comment 11 Sandro Bonazzola 2017-12-20 10:45:21 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.