Bug 1382583
| Summary: | Periodic functions/monitor start before VM is run. | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | |
| Component: | vdsm | Assignee: | Francesco Romani <fromani> | |
| Status: | CLOSED ERRATA | QA Contact: | Nisim Simsolo <nsimsolo> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | high | |||
| Version: | 3.6.8 | CC: | bazulay, eedri, fromani, gklein, gveitmic, lsurette, mgoldboi, michal.skrivanek, mlehrer, nsimsolo, srevivo, trichard, ycui, ykaul | |
| Target Milestone: | ovirt-4.1.0-alpha | Keywords: | ZStream | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Previously, if a VM shutdown was too slow, the state of the VM could be misreported as unresponsive, even though the VM was operating correctly, albeit too slowly. This was caused by a too-aggressive check on startup and shutdown. This patch takes into account slowdowns in startup and shutdown, avoiding false positive reports.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1398415 (view as bug list) | Environment: | ||
| Last Closed: | 2017-04-25 00:55:45 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1398415 | |||
|
Description
Germano Veit Michel
2016-10-07 05:41:41 UTC
(In reply to Germano Veit Michel from comment #0) > Description of problem: > > This is the opposite of BZ 1382578. > > Apparently when a VM is starting periodic tasks/monitor may run before the > qemu process is actually up (or libvirt able to respond). VM might be seen > by the engine as Not responding during powering up. The root cause is the same of BZ 1382578, and the cause is slow operation on the host (too slow wrt the expectations) and a small change of behaviour of Vdsm. This change little from the user perspective, but it is important to understand the cause of the issue. Both the issue and the fix will be on Vdsm, Engine relies on this information. > Steps to Reproduce: > 1. Start VM from Administration Portal > > Actual results: > 'Powering Up' -> 'Not responding' -> 'Up' > > Expected results: > 'Powering Up' -> 'Up' Again, for the sake of clarity: the problem is that this information is a false positive, not that Vdsm reports the VM as "not responding": it is totally possible that a VM goes rogue already on the "powering up" phase. Fixing this issue isn't hard. Fixing in a way that doesn't hide this scenario has some challenges. > jsonrpc.Executor/5::WARNING::2016-10-07 > 03:41:43,773::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) > vmId=`4759032f-680f-4a5f-9f26-604a8fac2808`::monitor become unresponsive > (command timeout, age=4296317.33) This strangely high value is caused by the default value of timestmap (zero), but here the real issue is that the VM was seen last time too far in the past; > So so is the periodic function watermark > > periodic/6::WARNING::2016-10-07 > 03:41:44,317::periodic::268::virt.periodic.VmDispatcher::(__call__) could > not run <class 'virt.periodic.DriveWatermarkMonitor'> on > [u'4759032f-680f-4a5f-9f26-604a8fac2808'] This is related but not part of this issue, if you want this behaviour to change it should be tracked and discussed separately. (In reply to Francesco Romani from comment #2) > (In reply to Germano Veit Michel from comment #0) > > periodic/6::WARNING::2016-10-07 > > 03:41:44,317::periodic::268::virt.periodic.VmDispatcher::(__call__) could > > not run <class 'virt.periodic.DriveWatermarkMonitor'> on > > [u'4759032f-680f-4a5f-9f26-604a8fac2808'] > > This is related but not part of this issue, if you want this behaviour to > change it should be tracked and discussed separately. Hi Francesco. Thank you for owning this BZ as well. AFAIK, the watermark monitor running before the VM is harmless, so I am not sure it's worth spending resources on fixing/hiding it. If you think it can cause a problem I'll gladly open a new BZ for it, but for now I just see it as a harmless warning. Thanks! this is not yet POST. Patches are been posted against master branch only. They are backport-friendly, but still not there. no POST yet - still ASSIGNED The patches against master branch are in the final stages of verification, and should be merged soon. The fixes are simple and backportable, and for master they look good. I'd like to do additional testing for stable branches. I've verified locally the sequence of events that triggers the bug, and I'm confident the patches should help. Any chance of testing them perhaps in a bigger QE environment? I can provide RPMs/scratch builds for 4.0.z or even 3.6.z Hi Francesco, The logs I used to open both BZs are from my own env, which is really small (and slow). It reproduces this VERY easily. It's currently on 4.0.4 + RHEL 7.2. If you provide me the patches or the RPMs/scratch builds I can run some tests for you. Cheers! Will be fixed in 4.1.0 and 4.0.6. Backport to 3.6.10 still pending (patches 66011, 66012, 66013) note this bug is missing acks for a proper downstream clone *** Bug 1395916 has been marked as a duplicate of this bug. *** Verification builds: ovirt-engine-4.1.0.3-0.1.el7 libvirt-client-2.0.0-10.el7_3.4.x86_64 vdsm-4.19.4-1.el7ev.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64 sanlock-3.4.0-1.el7.x86_64 Verificatino scenario: 1. Overload host CPU. 2. On the overloaded host, create VM pool of 10 VMs with 10 prestarted VMs. 3. Verify VMs does not transit to "not responding" state before going to up state. 3. Power off all VMs at the same time. 4. Verify VMs does not transit into "not responding" state before going to down state. 5. repeat steps 2-4 few times. |