Created attachment 1243632 [details]
Description of problem:
mom-vdsm brings vdsmd up on it's own.
It is seems that mom-vdsm is bringing vdsmd up although i stopped vdsmd with systemctl.
Jan 23 16:28:44 orchid-vds1.qa.lab.tlv.redhat.com systemd: Stopping MOM instance configured for VDSM purposes...
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd: Stopped MOM instance configured for VDSM purposes.
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd: Unit mom-vdsm.service entered failed state.
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd: mom-vdsm.service failed. (expected, we should probably increase the timeout)
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd: Stopping Virtual Desktop Server Manager...
Jan 23 16:28:54 orchid-vds1.qa.lab.tlv.redhat.com systemd: Stopped Virtual Desktop Server Manager.
Jan 23 16:30:03 orchid-vds1.qa.lab.tlv.redhat.com sshd: Accepted publickey for root from 10.35.163.149 port 56318 ssh2: RSA 2c:d8:62:
The host resolves to: mburman-4-upgrade-env.scl.lab.tlv.redhat.com
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd-logind: New session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd: Started Session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd: Starting Session 21 of user root.
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com sshd: pam_unix(sshd:session): session opened for user root by (uid=0)
Jan 23 16:30:04 orchid-vds1.qa.lab.tlv.redhat.com systemd: Starting Virtual Desktop Server Manager
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Stop vdsmd on host using systemctl command
After a minute host comes up and vdsmd is running.
Engine report in the ui event log
Host 10.35.129.22 is non responsive.
Status of host 10.35.129.22 was set to Up.
Host 10.35.129.22 is rebooting. (why rebooting?)
If stopping vdsmd with systemctl it should remain down.
Watching the engine log you can see that the engine triggers soft fencing when you stop vdsm service and get network exception. soft fencing sshing to the host and restart vdsmd. if it works, host gets back to up, if not, actual fence is done.
In any case, it should happens since 4.0 and its not related to mom
The ssh connection from mburman-4-upgrade-env.scl.lab.tlv.redhat.com is bringing vdsm up. Mom has nothing to do with it according to this log (it is only started after the vdsm service).
Ok, but still this behavior is wrong.
And it's new, it didn't happen on 4.0
And it happen only to some hosts, not to all of them.
What if i need to stop vdsmd on the host?
you probably need to update the bug component as well.. but look, this is the soft fencing logic. you need to disable fencing in the cluster edit tab. I'm quite sure it will do the trick and nothing will disturb your vdsmd down time
Michael - can you attach the engine.log?
I want to check where the "rebooting" comes from.
If you don't put host to Maintenance and stop VDSM, host become NonResponsive and engine will try to fence it (1st by SSH Soft Fencing and then by Power Management). If your host was in Maintenance before you stopped VDSM manually, then please provide engine logs.
Created attachment 1243847 [details]
So Seems like this is a general return when the non responsive treatment succeeded.
The soft fencing is part of that treatment.
Martin - thoughts?
I can we can rephrase, or add a specific audit log to distinguish.
Also, what happens in case it is skipped?
I've updated the title and set the severity to low.
Right, we don't distinguish between SSH Soft Fencing success and PM Restart success when returning result of VdsNotRespondingTreatment (code works fine, but events displayed to user may be a bit confusing). For now moving to 4.2, when a patch is ready we can retarget
Verified on ovirt-engine-4.2.0-0.0.master.20171113223918.git25568c3.el7.centos.noarch
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.
Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.