Created attachment 1308337 [details] engine log Description of problem: Host enters to power management restart loop because NullPointerException in VdsNotRespondingTreatmentCommand(I think so) 2017-08-01 13:05:31,410+03 ERROR [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (org.ovirt.thread.pool-6-thread-2) [] Exception: java.lang.NullPointerException at org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand.executeCommand(VdsNotRespondingTreatmentCommand.java:174) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055) [bll.jar:] ... Version-Release number of selected component (if applicable): rhevm-4.1.4.2-0.1.el7.noarch How reproducible: 25% Steps to Reproduce: 1. Configure power management on host 2. Set down ovirtmgmt on the host # ip link set down ovirtmgmt 3. Actual results: I can see that the engine restarts the host via power management but after only two minutes it run restart via power management again, that enter host to power management loop, so it never UP 2017-08-01 13:02:36,695+03 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand] (org.ovirt.thread.pool-6-thread-5) [934039e] Power-Management: STOP of host 'host_mixed_1' initiated. 2017-08-01 13:03:09,587+03 INFO [org.ovirt.engine.core.bll.pm.StartVdsCommand] (org.ovirt.thread.pool-6-thread-5) [] Power-Management: START of host 'host_mixed_1' initiated. ... 2017-08-01 13:03:18,797+03 INFO [org.ovirt.engine.core.bll.pm.StartVdsCommand] (org.ovirt.thread.pool-6-thread-10) [] Waiting 300 seconds, for server to finish reboot process. ... 2017-08-01 13:04:54,074+03 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand] (org.ovirt.thread.pool-6-thread-2) [466bccca] Power-Management: STOP of host 'host_mixed_1' initiated. 2017-08-01 13:05:21,121+03 INFO [org.ovirt.engine.core.bll.pm.StartVdsCommand] (org.ovirt.thread.pool-6-thread-2) [] Power-Management: START of host 'host_mixed_1' initiated. Expected results: I believe the real interval must be applied(300 seconds) between power management operations Additional info:
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
So here's the issue: When host is properly started during NotRespondingTreatment, we changed its status to Reboot and execute sleepOnReboot , which waits in different thread. In the meantime if server is really slow to boot and didn't finish reboot in 2 minutes, monitoring tries to monitor host even in Reboot status and when it's not responding to communication, it will be fenced again. Eli please take a look and try to find out why monitoring tries to monitor host in Reboot status? And if it has a reason, we will need to execute sleepOnReboot synchronously inStart artVdsCommand so it will not end before this timeout
There is no script, it's just clean RHEL 7.4 host
(In reply to Petr Matyáš from comment #15) > There is no script, it's just clean RHEL 7.4 host Please provide fresh engine log for the scenario (the attached engine log contains the NPE problem that was resolved in the 1st QE round, I would like to get a clean engine log without the NPE in order to investigate further)
Since the fix was merged on 9.8.2017 and the retest was executed on 23.8.2017 with 4.1.5-4 which build was done on 15.8.2017 I though the fix was already there, was it not? Also please put correct needinfo on me (QA contact) instead of reporter.
We are still not able to reproduce the issue, so retargeting to 4.1.7 for now, but we are continuing investigations and we can release as 4.1.6 async if ready soon
Verified on ovirt-engine-4.1.7.1-0.1.el7.noarch