Bug 1477700 - Host enters to power management restart loop
Summary: Host enters to power management restart loop
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: 4.1.4.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.1.7
: 4.1.7.1
Assignee: Eli Mesika
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks: 1487981
TreeView+ depends on / blocked
 
Reported: 2017-08-02 16:15 UTC by Artyom
Modified: 2017-11-13 12:29 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
: 1487981 (view as bug list)
Environment:
Last Closed: 2017-11-13 12:29:51 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: exception+
lsvaty: testing_ack+


Attachments (Terms of Use)
engine log (382.18 KB, text/plain)
2017-08-02 16:15 UTC, Artyom
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 80345 0 master MERGED core: fixing NPE in fencing flow 2017-08-08 13:38:13 UTC
oVirt gerrit 80401 0 ovirt-engine-4.1 MERGED core: fixing NPE in fencing flow 2017-08-09 07:23:40 UTC
oVirt gerrit 81931 0 master MERGED core: disable host fencing in reboot 2017-09-19 13:27:35 UTC
oVirt gerrit 81932 0 ovirt-engine-4.1 MERGED core: disable host fencing in reboot 2017-09-19 21:52:49 UTC

Description Artyom 2017-08-02 16:15:08 UTC
Created attachment 1308337 [details]
engine log

Description of problem:
Host enters to power management restart loop because NullPointerException in VdsNotRespondingTreatmentCommand(I think so)
2017-08-01 13:05:31,410+03 ERROR [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (org.ovirt.thread.pool-6-thread-2) [] Exception: java.lang.NullPointerException
    at org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand.executeCommand(VdsNotRespondingTreatmentCommand.java:174) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055) [bll.jar:]
...

Version-Release number of selected component (if applicable):
rhevm-4.1.4.2-0.1.el7.noarch

How reproducible:
25%

Steps to Reproduce:
1. Configure power management on host
2. Set down ovirtmgmt on the host # ip link set down ovirtmgmt 
3.

Actual results:
I can see that the engine restarts the host via power management but after only two minutes it run restart via power management again, that enter host to power management loop, so it never UP
2017-08-01 13:02:36,695+03 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (org.ovirt.thread.pool-6-thread-5) [934039e] Power-Management: STOP of host 'host_mixed_1' initiated.
2017-08-01 13:03:09,587+03 INFO  [org.ovirt.engine.core.bll.pm.StartVdsCommand] (org.ovirt.thread.pool-6-thread-5) [] Power-Management: START of host 'host_mixed_1' initiated.
...
2017-08-01 13:03:18,797+03 INFO  [org.ovirt.engine.core.bll.pm.StartVdsCommand] (org.ovirt.thread.pool-6-thread-10) [] Waiting 300 seconds, for server to finish reboot process.
...
2017-08-01 13:04:54,074+03 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (org.ovirt.thread.pool-6-thread-2) [466bccca] Power-Management: STOP of host 'host_mixed_1' initiated.
2017-08-01 13:05:21,121+03 INFO  [org.ovirt.engine.core.bll.pm.StartVdsCommand] (org.ovirt.thread.pool-6-thread-2) [] Power-Management: START of host 'host_mixed_1' initiated.

Expected results:
I believe the real interval must be applied(300 seconds) between power management operations

Additional info:

Comment 2 Red Hat Bugzilla Rules Engine 2017-08-14 14:21:48 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 13 Martin Perina 2017-08-23 13:48:51 UTC
So here's the issue:

When host is properly started during NotRespondingTreatment, we changed its status to Reboot and execute sleepOnReboot , which waits in different thread. In the meantime if server is really slow to boot and didn't finish reboot in 2 minutes, monitoring tries to monitor host even in Reboot status and when it's not responding to communication, it will be fenced again. Eli please take a look and try to find out why monitoring tries to monitor host in Reboot status? And if it has a reason, we will need to execute sleepOnReboot synchronously inStart artVdsCommand so it will not end before this timeout

Comment 15 Petr Matyáš 2017-08-28 16:45:56 UTC
There is no script, it's just clean RHEL 7.4 host

Comment 16 Eli Mesika 2017-08-29 06:40:25 UTC
(In reply to Petr Matyáš from comment #15)
> There is no script, it's just clean RHEL 7.4 host

Please provide fresh engine log for the scenario (the attached engine log contains the NPE problem that was resolved in the 1st QE round, I would like to get a clean engine log without the NPE in order to investigate further)

Comment 17 Petr Matyáš 2017-08-29 07:42:29 UTC
Since the fix was merged on 9.8.2017 and the retest was executed on 23.8.2017 with 4.1.5-4 which build was done on 15.8.2017 I though the fix was already there, was it not?

Also please put correct needinfo on me (QA contact) instead of reporter.

Comment 18 Martin Perina 2017-09-06 07:42:33 UTC
We are still not able to reproduce the issue, so retargeting to 4.1.7 for now, but we are continuing investigations and we can release as 4.1.6 async if ready soon

Comment 22 Petr Matyáš 2017-09-22 13:43:09 UTC
Verified on ovirt-engine-4.1.7.1-0.1.el7.noarch


Note You need to log in before you can comment on or make changes to this bug.