Bug 1415740

Summary:

Soft fencing causes a "host is rebooting" message that might be misleading

Product:

[oVirt] ovirt-engine

Reporter:

Michael Burman <mburman>

Component:

BLL.Infra

Assignee:

Miroslava Voglova <mvoglova>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Petr Matyáš <pmatyas>

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

4.1.0.2

CC:

bugs, mburman, mperina, oourfali, pmatyas, ybronhei

Target Milestone:

ovirt-4.2.0

Flags:

rule-engine: ovirt-4.2+

Target Release:

4.2.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-12-20 10:45:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
messages log	none
engine log	none

Description Michael Burman 2017-01-23 15:28:05 UTC

Created attachment 1243632 [details]
messages log

Description of problem:
mom-vdsm brings vdsmd up on it's own. 

It is seems that mom-vdsm is bringing vdsmd up although i stopped vdsmd with systemctl.

Jan 23 16:28:44 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopping MOM instance configured for VDSM purposes...
...
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopped MOM instance configured for VDSM purposes.
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Unit mom-vdsm.service entered failed state.
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: mom-vdsm.service failed. (expected, we should probably increase the timeout)
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopping Virtual Desktop Server Manager...
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Stopped Virtual Desktop Server Manager.

Jan 23 16:30:03 orchid-vds1.qa.lab.tlv.redhat.com sshd[15886]: Accepted publickey for root from 10.35.163.149 port 56318 ssh2: RSA 2c:d8:62:

The host resolves to: mburman-4-upgrade-env.scl.lab.tlv.redhat.com

Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd-logind[694]: New session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Started Session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Starting Session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com sshd[15886]: pam_unix(sshd:session): session opened for user root by (uid=0)
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd[1]: Starting Virtual Desktop Server Manager

Version-Release number of selected component (if applicable):
mom-0.5.8-1.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Stop vdsmd on host using systemctl command

Actual results:
After a minute host comes up and vdsmd is running.
Engine report in the ui event log

Host 10.35.129.22 is non responsive.
Status of host 10.35.129.22 was set to Up.
Host 10.35.129.22 is rebooting. (why rebooting?)

Expected results:
If stopping vdsmd with systemctl it should remain down.

Comment 1 Yaniv Bronhaim 2017-01-23 15:33:48 UTC

Watching the engine log you can see that the engine triggers soft fencing when you stop vdsm service and get network exception. soft fencing sshing to the host and restart vdsmd. if it works, host gets back to up, if not, actual fence is done.

In any case, it should happens since 4.0 and its not related to mom

Comment 2 Martin Sivák 2017-01-23 15:37:40 UTC

The ssh connection from mburman-4-upgrade-env.scl.lab.tlv.redhat.com is bringing vdsm up. Mom has nothing to do with it according to this log (it is only started after the vdsm service).

Comment 3 Michael Burman 2017-01-23 15:44:21 UTC

Ok, but still this behavior is wrong.
And it's new, it didn't happen on 4.0
And it happen only to some hosts, not to all of them.
What if i need to stop vdsmd on the host?

Comment 4 Yaniv Bronhaim 2017-01-23 17:31:47 UTC

you probably need to update the bug component as well.. but look, this is the soft fencing logic. you need to disable fencing in the cluster edit tab. I'm quite sure it will do the trick and nothing will disturb your vdsmd down time

Comment 5 Oved Ourfali 2017-01-24 07:38:01 UTC

Michael - can you attach the engine.log?
I want to check where the "rebooting" comes from.

Comment 6 Martin Perina 2017-01-24 07:45:53 UTC

If you don't put host to Maintenance and stop VDSM, host become NonResponsive and engine will try to fence it (1st by SSH Soft Fencing and then by Power Management). If your host was in Maintenance before you stopped VDSM manually, then please provide engine logs.

Comment 7 Michael Burman 2017-01-24 07:49:45 UTC

Created attachment 1243847 [details]
engine log

Comment 8 Oved Ourfali 2017-01-24 08:19:04 UTC

So Seems like this is a general return when the non responsive treatment succeeded.
The soft fencing is part of that treatment.

Martin - thoughts?
I can we can rephrase, or add a specific audit log to distinguish.
Also, what happens in case it is skipped?

I've updated the title and set the severity to low.

Comment 9 Martin Perina 2017-01-24 09:29:03 UTC

Right, we don't distinguish between SSH Soft Fencing success and PM Restart success when returning result of VdsNotRespondingTreatment (code works fine, but events displayed to user may be a bit confusing). For now moving to 4.2, when a patch is ready we can retarget

Comment 10 Petr Matyáš 2017-11-14 15:58:43 UTC

Verified on ovirt-engine-4.2.0-0.0.master.20171113223918.git25568c3.el7.centos.noarch

Comment 11 Sandro Bonazzola 2017-12-20 10:45:21 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.