1398418 – [z-stream clone - 3.6.10] [z-stream clone - 4.0.6] Periodic functions/monitor start before VM is run.

Bug 1398418 - [z-stream clone - 3.6.10] [z-stream clone - 4.0.6] Periodic functions/monitor start before VM is run.

Summary: [z-stream clone - 3.6.10] [z-stream clone - 4.0.6] Periodic functions/monitor...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.6.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-3.6.10
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Depends On:	1398415
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-24 18:43 UTC by rhev-integ
Modified:	2020-03-11 15:32 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	Previously, when a virtual machine took a prolonged amount of time to shut down the state of the virtual machine could be misreported as unresponsive, even though the virtual machine was operating correctly. Now, the slow down during start up and shut down is taken into account to avoid misreporting the virtual machine as unresponsive.
Clone Of:	1398415
Environment:
Last Closed:	2017-01-17 18:07:18 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2749551	None	None	None	2016-11-24 18:45:39 UTC
Red Hat Product Errata	RHBA-2017:0109	normal	SHIPPED_LIVE	vdsm 3.6.10 bug fix and enhancement update	2017-01-17 22:48:48 UTC
oVirt gerrit	65814	None	None	None	2016-11-24 18:45:39 UTC
oVirt gerrit	66011	None	None	None	2016-11-24 18:45:39 UTC
oVirt gerrit	66012	None	None	None	2016-11-24 18:45:39 UTC
oVirt gerrit	66013	None	None	None	2016-11-24 18:45:39 UTC

Comment 3 rhev-integ 2016-11-24 18:43:54 UTC

Description of problem:

This is the opposite of BZ 1382578.

Apparently when a VM is starting periodic tasks/monitor may run before the qemu process is actually up (or libvirt able to respond). VM might be seen by the engine as Not responding during powering up.

In the Engine, instead of 
'Powering Up' -> 'Up'

It is seen as:
'Powering Up' -> 'Not responding' -> 'Up'

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160920.1.el7ev)
rhevm-3.6.8.1-0.1.el6.noarch

How reproducible:

Sometimes.

Steps to Reproduce:
1. Start VM from Administration Portal

Actual results:
'Powering Up' -> 'Not responding' -> 'Up'

Expected results:
'Powering Up' -> 'Up'

Additional info:

 2016-10-07 13:41:45.74+10  | VM ntptest was started by admin@internal (Host: Germano-Test).
 2016-10-07 13:41:46.439+10 | VM ntptest is not responding.
 2016-10-07 13:42:26.36+10  | VM ntptest started on Host Germano-Test

jsonrpc.Executor/3::DEBUG::2016-10-07 03:41:42,264::__init__::557::jsonrpc.JsonRpcServer::(_handle_request) Return 'VM.create' .....

Thread-89::DEBUG::2016-10-07 03:41:42,266::vm::697::virt.vm::(_startUnderlyingVm) vmId=`4759032f-680f-4a5f-9f26-604a8fac2808`::Start

Thread-89::INFO::2016-10-07 03:41:42,266::vm::1850::virt.vm::(_run) vmId=`4759032f-680f-4a5f-9f26-604a8fac2808`::VM wrapper has started

Monitor already running...

jsonrpc.Executor/5::WARNING::2016-10-07 03:41:43,773::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`4759032f-680f-4a5f-9f26-604a8fac2808`::monitor become unresponsive (command timeout, age=4296317.33)

So so is the periodic function watermark

periodic/6::WARNING::2016-10-07 03:41:44,317::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'4759032f-680f-4a5f-9f26-604a8fac2808']

Thread-89::INFO::2016-10-07 03:41:45,722::vm::1944::virt.vm::(_run) vmId=`4759032f-680f-4a5f-9f26-604a8fac2808`::<?xml version="1.0" encoding="utf-8"?>
<domain type="kvm" xmlns:ovirt="http://ovirt.org/vm/tune/1.0">
	<name>ntptest</name>
	<uuid>4759032f-680f-4a5f-9f26-604a8fac2808</uuid>
	<memory>1048576</memory>
	<currentMemory>1048576</currentMemory>
	<maxMemory slots="16">4294967296</maxMemory>


periodic/4::WARNING::2016-10-07 03:41:46,318::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'4759032f-680f-4a5f-9f26-604a8fac2808']

libvirtEventLoop::DEBUG::2016-10-07 03:41:46,352::vm::4648::virt.vm::(onLibvirtLifecycleEvent) vmId=`4759032f-680f-4a5f-9f26-604a8fac2808`::event Resumed detail 0 opaque None

libvirtEventLoop::INFO::2016-10-07 03:41:46,353::vm::5169::virt.vm::(_logGuestCpuStatus) vmId=`4759032f-680f-4a5f-9f26-604a8fac2808`::CPU running: onResume

VM Channels Listener::DEBUG::2016-10-07 03:42:23,017::__init__::207::jsonrpc.Notification::(emit) Sending event {"params": {"notify_time": 4296356580, "4759032f-680f-4a5f-9f26-604a8fac2808": {"status": "Up" .....

This comment was originaly posted by gveitmic

This comment was originaly posted by rhev-integ

Comment 6 rhev-integ 2016-11-24 18:44:10 UTC

(In reply to Germano Veit Michel from comment #0)
> Description of problem:
> 
> This is the opposite of BZ 1382578.
> 
> Apparently when a VM is starting periodic tasks/monitor may run before the
> qemu process is actually up (or libvirt able to respond). VM might be seen
> by the engine as Not responding during powering up.

The root cause is the same of BZ 1382578, and the cause is slow operation on the host (too slow wrt the expectations) and a small change of behaviour of Vdsm.
This change little from the user perspective, but it is important to understand the cause of the issue.

Both the issue and the fix will be on Vdsm, Engine relies on this information.

> Steps to Reproduce:
> 1. Start VM from Administration Portal
> 
> Actual results:
> 'Powering Up' -> 'Not responding' -> 'Up'
> 
> Expected results:
> 'Powering Up' -> 'Up'

Again, for the sake of clarity: the problem is that this information is a false positive, not that Vdsm reports the VM as "not responding": it is totally possible that a VM goes rogue already on the "powering up" phase.

Fixing this issue isn't hard.
Fixing in a way that doesn't hide this scenario has some challenges.

> jsonrpc.Executor/5::WARNING::2016-10-07
> 03:41:43,773::vm::5177::virt.vm::(_setUnresponsiveIfTimeout)
> vmId=`4759032f-680f-4a5f-9f26-604a8fac2808`::monitor become unresponsive
> (command timeout, age=4296317.33)

This strangely high value is caused by the default value of timestmap (zero),
but here the real issue is that the VM was seen last time too far in the past;


> So so is the periodic function watermark
> 
> periodic/6::WARNING::2016-10-07
> 03:41:44,317::periodic::268::virt.periodic.VmDispatcher::(__call__) could
> not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> [u'4759032f-680f-4a5f-9f26-604a8fac2808']

This is related but not part of this issue, if you want this behaviour to change it should be tracked and discussed separately.

This comment was originaly posted by fromani

This comment was originaly posted by rhev-integ

Comment 7 rhev-integ 2016-11-24 18:44:16 UTC

(In reply to Francesco Romani from comment #2)
> (In reply to Germano Veit Michel from comment #0)
> > periodic/6::WARNING::2016-10-07
> > 03:41:44,317::periodic::268::virt.periodic.VmDispatcher::(__call__) could
> > not run <class 'virt.periodic.DriveWatermarkMonitor'> on
> > [u'4759032f-680f-4a5f-9f26-604a8fac2808']
> 
> This is related but not part of this issue, if you want this behaviour to
> change it should be tracked and discussed separately.

Hi Francesco. Thank you for owning this BZ as well.

AFAIK, the watermark monitor running before the VM is harmless, so I am not sure it's worth spending resources on fixing/hiding it. If you think it can cause a problem I'll gladly open a new BZ for it, but for now I just see it as a harmless warning.

Thanks!

This comment was originaly posted by gveitmic

This comment was originaly posted by rhev-integ

Comment 8 rhev-integ 2016-11-24 18:44:21 UTC

this is not yet POST. Patches are been posted against master branch only. They are backport-friendly, but still not there.

This comment was originaly posted by fromani

This comment was originaly posted by rhev-integ

Comment 9 rhev-integ 2016-11-24 18:44:27 UTC

no POST yet - still ASSIGNED

This comment was originaly posted by fromani

This comment was originaly posted by rhev-integ

Comment 10 rhev-integ 2016-11-24 18:44:31 UTC

The patches against master branch are in the final stages of verification, and should be merged soon.
The fixes are simple and backportable, and for master they look good.
I'd like to do additional testing for stable branches.
I've verified locally the sequence of events that triggers the bug, and I'm confident the patches should help. Any chance of testing them perhaps in a bigger QE environment?
I can provide RPMs/scratch builds for 4.0.z or even 3.6.z

This comment was originaly posted by fromani

This comment was originaly posted by rhev-integ

Comment 11 rhev-integ 2016-11-24 18:44:37 UTC

please check https://bugzilla.redhat.com/show_bug.cgi?id=1382583#c6

This comment was originaly posted by fromani

This comment was originaly posted by rhev-integ

Comment 12 rhev-integ 2016-11-24 18:44:41 UTC

Hi Francesco,

The logs I used to open both BZs are from my own env, which is really small (and slow). It reproduces this VERY easily. It's currently on 4.0.4 + RHEL 7.2. If you provide me the patches or the RPMs/scratch builds I can run some tests for you.

Cheers!

This comment was originaly posted by gveitmic

This comment was originaly posted by rhev-integ

Comment 17 rhev-integ 2016-11-24 18:45:07 UTC

Will be fixed in 4.1.0 and 4.0.6. Backport to 3.6.10 still pending (patches 66011, 66012, 66013)

This comment was originaly posted by fromani

This comment was originaly posted by rhev-integ

Comment 18 rhev-integ 2016-11-24 18:45:12 UTC

note this bug is missing acks for a proper downstream clone

This comment was originaly posted by michal.skrivanek

This comment was originaly posted by rhev-integ

Comment 21 rhev-integ 2016-11-24 18:45:28 UTC

qa_ack please

@eedri, I believe ON_QA without acks is more appropriate if it is already included in a build(it is). I understand it won't help with errata, but still...

This comment was originaly posted by michal.skrivanek

Comment 22 Michal Skrivanek 2016-11-24 18:46:54 UTC

qa_ack please

Comment 23 Francesco Romani 2016-11-25 09:56:23 UTC

Patches merged on ovirt-3.6 -> MODIFIED

Comment 27 Nisim Simsolo 2016-12-19 14:12:12 UTC

Verification build:
rhevm-3.6.10.1-0.1.el6
vdsm-4.17.37-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.2.x86_64
qemu-kvm-rhev-2.6.0-27.el7.x86_64
sanlock-3.4.0-1.el7.x86_64

Verificatino scenario:
1. Overload host CPU.
2. On the overloaded host, create VM pool of 10 VMs with 10 prestarted VMs.
3. Verify VMs does not transit to "not responding" state before going to up state.
3. Power off all VMs at the same time.
4. Verify VMs does not transit into "not responding" state before going to down state.
5. repeat steps 2-4 few times.

Comment 29 errata-xmlrpc 2017-01-17 18:07:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0109.html

Note You need to log in before you can comment on or make changes to this bug.