| Summary: | refreshVdsRunTimeInfo not being called against the SPM results in no change in state after a power outage. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Lee Yarwood <lyarwood> | ||||
| Component: | ovirt-engine | Assignee: | Martin Perina <mperina> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Jiri Belka <jbelka> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 3.2.0 | CC: | abisogia, acathrow, agkesos, ahoness, bazulay, emesika, flo_bugzilla, iheim, lpeer, lyarwood, mperina, pstehlik, pzhukov, Rhev-m-bugs, rhodain, sputhenp, tdosek, vkaigoro, yeylon | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | 3.3.3 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | infra | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-04-27 11:48:29 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1044088 | ||||||
| Attachments: |
|
||||||
|
Description
Lee Yarwood
2013-12-17 19:30:47 UTC
Hi Lee, I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I tested these scenarios: I. 1. Power off SPM host1 by IPMI and at the same time block access to IPMI from other hosts in cluster 2. host1 status changed to Connection and after that to Non responsive, but host1 stayed SPM 3. After I powered up host1 and unblock its IPMI, it changed status to Up and DC became Up again (host1 stayed SPM) II. 1. Power off SPM host1 by IPMI and at the same time block access to IPMI from other hosts in cluster 2. host1 status changed to Connection and after that to Non responsive, but host1 stayed SPM 3. I executed 'Confirm host as rebooted' on the host1 and after new election another host from the cluster became SPM 4. After I powered up host1 it became Up I any case refreshVdsRunTimeInfo started to execute on host1 once it became accessible by network. (In reply to Martin Perina from comment #3) > Hi Lee, > > I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I > tested these scenarios: > > [..] > > I any case refreshVdsRunTimeInfo started to execute on host1 once it became > accessible by network. Hey Martin, As covered in step 1 of my suggested reproducer (I should add that I've not reproduce this yet) and also in private comment #2 AFAICT refreshVdsRunTimeInfo was not being called against the SPM prior to the outage. If you agree that this was the case and want to manually reproduce this we could use something like byteman to inject code into the vdsmanager class that stops refreshVdsRunTimeInfo from being called against the current SPM. Lee (In reply to Lee Yarwood from comment #4) > (In reply to Martin Perina from comment #3) > > Hi Lee, > > > > I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I > > tested these scenarios: > > > > [..] > > > > I any case refreshVdsRunTimeInfo started to execute on host1 once it became > > accessible by network. > > Hey Martin, > > As covered in step 1 of my suggested reproducer (I should add that I've not > reproduce this yet) and also in private comment #2 AFAICT > refreshVdsRunTimeInfo was not being called against the SPM prior to the > outage. So this is the problem and not the $subject. Can we have a reproducer for that ? > > If you agree that this was the case and want to manually reproduce this we > could use something like byteman to inject code into the vdsmanager class > that stops refreshVdsRunTimeInfo from being called against the current SPM. > > Lee (In reply to Barak from comment #5) > (In reply to Lee Yarwood from comment #4) > > (In reply to Martin Perina from comment #3) > > > Hi Lee, > > > > > > I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I > > > tested these scenarios: > > > > > > [..] > > > > > > I any case refreshVdsRunTimeInfo started to execute on host1 once it became > > > accessible by network. > > > > Hey Martin, > > > > As covered in step 1 of my suggested reproducer (I should add that I've not > > reproduce this yet) and also in private comment #2 AFAICT > > refreshVdsRunTimeInfo was not being called against the SPM prior to the > > outage. > > > So this is the problem and not the $subject. > Can we have a reproducer for that ? Sorry Barak but $subject does state that and again no we don't have a complete reproducer as yet. IMHO there are two issues in this bug : 1. The scheduled refreshVdsRunTimeInfo job not being called against the host. I couldn't find any pointers as to why this happened in engine.log and would be grateful for a review from devel. 2. engine not monitoring or correcting the fact that this job wasn't being called. We can artificially reproduce this issue with byteman by killing the scheduled job : - Ensure the jdk is installed and used by engine : root@rhevm # yum install java-1.7.0-openjdk-devel -y root@rhevm # service ovirt-engine restart - Change to the ovirt user and configure byteman : root@rhevm # su ovirt -s /bin/sh ovirt@rhevm # wget http://downloads.jboss.org/byteman/2.1.2/byteman-download-2.1.2-bin.zip ovirt@rhevm # unzip byteman-download-2.1.2-bin.zip ovirt@rhevm # export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/ ovirt@rhevm # export BYTEMAN_HOME=/var/lib/ovirt-engine/byteman-download-2.1.2 ovirt@rhevm # export PATH=$JAVA_HOME/bin:$BYTEMAN_HOME/bin:$PATH - Attach byteman to engine : ovirt@rhevm # jps 16962 Jps 12936 jboss-modules.jar <-- this is the JBoss process to use. 20826 HistoryETL ovirt@rhevm # bminstall.sh -b -Dorg.jboss.byteman.transform.all 12936 - Define the VDS UUID : ovirt@rhevm # VDS_UUID="ea50ab33-361c-40d0-85e0-0d0773b96644" - Create the killjob script : ovirt@rhevm # cat > killjob.btm << EOF RULE VdsManager.OnTimer entry CLASS org.ovirt.engine.core.vdsbroker.VdsManager METHOD OnTimer AT ENTRY BIND vdsManager:org.ovirt.engine.core.vdsbroker.VdsManager = \$this IF vdsManager.getVdsId().toString() == "${VDS_UUID}" DO traceln("job " + vdsManager.onTimerJobId + " killed for VDS " + vdsManager.getVdsId().toString()); org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl.getInstance().deleteJob(vdsManager.onTimerJobId) ENDRULE EOF - Run the kill the job and stop the job : ovirt@rhevm # bmsubmit.sh killjob.btm ; tailf /var/log/ovirt-engine/server.log 2014-01-28 16:28:46,984 INFO [stdout] (Thread-81) CFG.carryForward: unexpected! invalid open enters count for block L1232225467 in method org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer()V 2014-01-28 16:28:48,755 INFO [stdout] (DefaultQuartzScheduler_Worker-95) job org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer#-9223372036854775783 killed for VDS ea50ab33-361c-40d0-85e0-0d0773b96644 - Shutdown the host etc and the state will not change from UP. (In reply to Lee Yarwood from comment #6) > (In reply to Barak from comment #5) > > (In reply to Lee Yarwood from comment #4) > > > (In reply to Martin Perina from comment #3) > > > > Hi Lee, > > > > > > > > I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I > > > > tested these scenarios: > > > > > > > > [..] > > > > > > > > I any case refreshVdsRunTimeInfo started to execute on host1 once it became > > > > accessible by network. > > > > > > Hey Martin, > > > > > > As covered in step 1 of my suggested reproducer (I should add that I've not > > > reproduce this yet) and also in private comment #2 AFAICT > > > refreshVdsRunTimeInfo was not being called against the SPM prior to the > > > outage. > > > > > > So this is the problem and not the $subject. > > Can we have a reproducer for that ? > > Sorry Barak but $subject does state that and again no we don't have a > complete reproducer as yet. > > IMHO there are two issues in this bug : > > 1. The scheduled refreshVdsRunTimeInfo job not being called against the host. > > I couldn't find any pointers as to why this happened in engine.log and would > be grateful for a review from devel. > > 2. engine not monitoring or correcting the fact that this job wasn't being > called. > > We can artificially reproduce this issue with byteman by killing the > scheduled job : > > - Ensure the jdk is installed and used by engine : > > root@rhevm # yum install java-1.7.0-openjdk-devel -y > root@rhevm # service ovirt-engine restart > > - Change to the ovirt user and configure byteman : > > root@rhevm # su ovirt -s /bin/sh > ovirt@rhevm # wget > http://downloads.jboss.org/byteman/2.1.2/byteman-download-2.1.2-bin.zip > ovirt@rhevm # unzip byteman-download-2.1.2-bin.zip > ovirt@rhevm # export > JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/ > ovirt@rhevm # export > BYTEMAN_HOME=/var/lib/ovirt-engine/byteman-download-2.1.2 > ovirt@rhevm # export PATH=$JAVA_HOME/bin:$BYTEMAN_HOME/bin:$PATH > > - Attach byteman to engine : > > ovirt@rhevm # jps > 16962 Jps > 12936 jboss-modules.jar <-- this is the JBoss process to use. > 20826 HistoryETL > ovirt@rhevm # bminstall.sh -b -Dorg.jboss.byteman.transform.all 12936 > > - Define the VDS UUID : > > ovirt@rhevm # VDS_UUID="ea50ab33-361c-40d0-85e0-0d0773b96644" > > - Create the killjob script : > > ovirt@rhevm # cat > killjob.btm << EOF > RULE VdsManager.OnTimer entry > CLASS org.ovirt.engine.core.vdsbroker.VdsManager > METHOD OnTimer > AT ENTRY > BIND vdsManager:org.ovirt.engine.core.vdsbroker.VdsManager = \$this > IF vdsManager.getVdsId().toString() == "${VDS_UUID}" > DO traceln("job " + vdsManager.onTimerJobId + " killed for VDS " + > vdsManager.getVdsId().toString()); > > org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl.getInstance(). > deleteJob(vdsManager.onTimerJobId) > ENDRULE > EOF > > - Run the kill the job and stop the job : > > ovirt@rhevm # bmsubmit.sh killjob.btm ; tailf > /var/log/ovirt-engine/server.log > 2014-01-28 16:28:46,984 INFO [stdout] (Thread-81) CFG.carryForward: > unexpected! invalid open enters count for block L1232225467 in method > org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer()V > 2014-01-28 16:28:48,755 INFO [stdout] (DefaultQuartzScheduler_Worker-95) > job org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer#-9223372036854775783 > killed for VDS ea50ab33-361c-40d0-85e0-0d0773b96644 > > - Shutdown the host etc and the state will not change from UP. The above is not a valid reproducer, We need a real one, killing the quartz job = creating the problem described in the $subject of the bug, the host status change is derived from this quarrtz. So unless we have a clear reproducer or a we can not proceed here ? Created attachment 858667 [details] engine logs I tried to reproduce based on comment#3. I. - IPMI power off host1 - connection to IPMI from host2 to host1 was not working - host1 states: Connecting -> Reboot - IPMI power on host1 - connection to IPMI from host2 to host1 was allowed - host1 states: Reboot, Non Responsive, Active II. - IPMI power off host1 - connection to IPMI from host2 to host1 was not working - host1 states: Connecting -> Reboot - host1 - Confirmation the host has been rebooted in Admin Portal - host1 states: Reboot (still) - IPMI power on host1 - connection to IPMI from host2 to host1 was allowed - host1 states: Reboot, Non Responsive, Active (In reply to Barak from comment #7) > The above is not a valid reproducer, > We need a real one, killing the quartz job = creating the problem described > in the $subject of the bug, the host status change is derived from this > quarrtz. > > So unless we have a clear reproducer or a we can not proceed here ? I don't have a non-artificial reproducer that results in $subject. Again, can devel at least review the attached log for anything that I might've missed in this regard? IMHO the artificial reproducer does highlight an issue with the engine's own monitoring (or lack of) for the quartz jobs. Shouldn't we at least ensure that this is fixed? > I don't have a non-artificial reproducer that results in $subject. Again, > can devel at least review the attached log for anything that I might've > missed in this regard? Lee, I investigated log several times, but I didn't found anything that lead to the cause of the issue. I did several tests with 2 hosts cluster and try to shutdown SPM hosts (details in comment 3). Last week I did more tests with 2 hosts cluster: I tried to shutdown both hosts at once (or with minimal interval). Every time all hosts got to Non responsive state (it takes a few minutes and in the mean time state can be also Up before changed to Connecting, or Reboot before changed to Non Responsive). When I powered up hosts again, every time all hosts and DC got back to Up state without any manual intervention. Clearing needinfo |