Bug 1154389

Summary:

[scale] VM shutdown causes errors under load: exception in acpiShutdown()

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Yuri Obshansky <yobshans>

Component:

vdsm

Assignee:

Francesco Romani <fromani>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Eldad Marciano <emarcian>

Severity:

low

Docs Contact:

Priority:

low

Version:

3.5.0

CC:

bazulay, fromani, gklein, istein, lpeer, lsurette, michal.skrivanek, rbalakri, Rhev-m-bugs, srevivo, ykaul, yobshans

Target Milestone:

ovirt-3.6.0-rc

Target Release:

3.6.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

4.17.4

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-04-20 01:29:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
vdsm.log	none
engine.log	none
libvirt.log 1	none
libvirt.log 2	none
libvirt.log 3	none
Bug verification engine.log	none
Bug verification API response time table	none
Bug verification API response time graph	none

Description Yuri Obshansky 2014-10-19 12:17:08 UTC

Description of problem:
During REST API User Portal performance test of 50 concurrent threads on 50 VMs detected errors >> "VM scale-50 is down with error. Exit message: dictionary changed size during iteration" in spite of that VM shut-downed successfully.
As results performance degradation.
VM shutdowm response time 90% line in RHEV-M 3.4 = 2249 ms compare to RHEV-M 3.5 = 147839 ms.

Engine.log errors:
2014-10-15 11:39:23,050 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-79) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM scale-1 is down with error. Exit message: dictionary changed size during iteration.Vdsm.log error:
Thread-259072::ERROR::2014-10-15 12:06:52,842::utils::1193::utils.Callback::(__call__) acpiCallback failed
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 1191, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/share/vdsm/virt/vmpowerdown.py", line 91, in acpiCallback
    self.vm.acpiShutdown()
  File "/usr/share/vdsm/virt/vm.py", line 4942, in acpiShutdown
    self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN)
AttributeError: 'NoneType' object has no attribute 'shutdownFlags'

Version-Release number of selected component (if applicable):
RHEV-M: 3.5.0-0.14.beta
RHEL: 6Server - 6.5.0.1.el6
Kernel Version: 2.6.32 - 431.29.2.el6.x86_64
KVM Version: 0.12.1.2 - 2.415.el6_5.10
LIBVIRT Version: libvirt-0.10.2-29.el6_5.11
VDSM Version: vdsm-4.16.6-1.el6ev

How reproducible:
Run automatic test which shutdowm/start VMs using user portal REST API with load of 50 concurrent threads.

Steps to Reproduce:
1.
2.
3.

Actual results:
Errors
Performance degradation

Expected results:
No errors
No performance degradation

Additional info:

Comment 1 Omer Frenkel 2014-10-20 10:09:03 UTC

please attach vdsm logs

Comment 2 Yuri Obshansky 2014-10-20 10:11:39 UTC

Sorry, I don't have vdsm.log
Test was performed on 15.10 and vdsm has already deleted all log files created early from 17.10

Comment 3 Yuri Obshansky 2014-10-21 12:54:12 UTC

Created attachment 948946 [details]
vdsm.log

Comment 4 Yuri Obshansky 2014-10-21 12:54:40 UTC

Created attachment 948947 [details]
engine.log

Comment 5 Vitor de Lima 2014-10-21 18:58:36 UTC

Could you send the libvirt logs too, please?

Comment 6 Yuri Obshansky 2014-10-22 07:30:34 UTC

Unfortunately, I don't have libvirt.log file created under /var/log/libvirt
I looked at /etc/libvirt/libvirtd.conf file and found that all log configurations are commented out. Is it OK?
#################################################################
#
# Logging controls
#

# Logging level: 4 errors, 3 warnings, 2 information, 1 debug
# basically 1 will log everything possible
#log_level = 3

# Logging filters:
# A filter allows to select a different logging level for a given category
# of logs
# The format for a filter is one of:
#    x:name
#    x:+name
#      where name is a string which is matched against source file name,
#      e.g., "remote", "qemu", or "util/json", the optional "+" prefix
#      tells libvirt to log stack trace for each message matching name,
#      and x is the minimal level where matching messages should be logged:
#    1: DEBUG
#    2: INFO
#    3: WARNING
#    4: ERROR
#
# Multiple filter can be defined in a single @filters, they just need to be
# separated by spaces.
#
# e.g. to only get warning or errors from the remote layer and only errors
# from the event layer:
#log_filters="3:remote 4:event"

# Logging outputs:
# An output is one of the places to save logging information
# The format for an output can be:
#    x:stderr
#      output goes to stderr
#    x:syslog:name
#      use syslog for the output and use the given name as the ident
#    x:file:file_path
#      output to a file, with the given filepath
# In all case the x prefix is the minimal level, acting as a filter
#    1: DEBUG
#    2: INFO
#    3: WARNING
#    4: ERROR
#
# Multiple output can be defined, they just need to be separated by spaces.
# e.g. to log all warnings and errors to syslog under the libvirtd ident:
#log_outputs="3:syslog:libvirtd"
#

# Log debug buffer size: default 64
# The daemon keeps an internal debug log buffer which will be dumped in case
# of crash or upon receiving a SIGUSR2 signal. This setting allows to override
# the default buffer size in kilobytes.
# If value is 0 or less the debug log buffer is deactivated
#log_buffer_size = 64

I can rerun test again. Please, suggest what log configuration uncomment and what level do you prefer.

Comment 7 Yuri Obshansky 2014-10-23 08:42:50 UTC

Created attachment 949744 [details]
libvirt.log 1

Comment 8 Yuri Obshansky 2014-10-23 08:44:03 UTC

Created attachment 949746 [details]
libvirt.log 2

Comment 9 Yuri Obshansky 2014-10-23 08:45:06 UTC

Created attachment 949747 [details]
libvirt.log 3

Comment 10 Vitor de Lima 2014-10-28 16:03:29 UTC

The logs point out to QEMU dying earlier than expected, but no reason is given for it. Looks like an issue with QEMU rather than with the engine or VDSM.

Comment 11 Omer Frenkel 2014-10-30 15:25:47 UTC

duplicate of bug 1143968 ?

Comment 12 Vitor de Lima 2014-10-30 15:38:47 UTC

Part of this problem (the "dictionary changed size during iteration" error) has to do with the unsafe handling of Vm.conf that is mentioned in that BZ, another part (the "'NoneType' object has no attribute 'shutdownFlags'" error) is related to QEMU dying too quickly after the shutdown command is issued, which is a different problem.

Comment 13 Francesco Romani 2014-10-30 22:43:30 UTC

(In reply to Vitor de Lima from comment #12)
> Part of this problem (the "dictionary changed size during iteration" error)
> has to do with the unsafe handling of Vm.conf that is mentioned in that BZ,
> another part (the "'NoneType' object has no attribute 'shutdownFlags'"
> error) is related to QEMU dying too quickly after the shutdown command is
> issued, which is a different problem.

Correct. Unfortunately lots of flow are potentially affected, so definitely
VDSM logs will help shed some light here, in order to prioritize.

We know QEMU death and Shutdown/destroy are racy and potentially buggy, even though the specific error reported should be little harm.

refactoring of those flows was also planned but slipped through because of other more urgent issues, maybe is time to bump it up.

BZ1143968 is related but is tackling a different subset of the bigger problem, one of which we have evidence and a quite clean reproducer.
I don't think device handling is the root cause of this issue, but to be sure we need the full backtrace.

Comment 14 Francesco Romani 2014-11-06 13:23:34 UTC

patch http://gerrit.ovirt.org/#/c/34879/ should help for this issue.

Comment 15 Michal Skrivanek 2014-11-11 09:05:20 UTC

"dictionary changed size during iteration" tracked in bug 1143968.
Changing this one to the shutdown race in comment #14.
bug is harmless.

Comment 19 Yuri Obshansky 2016-02-04 11:53:12 UTC

The bug was verified on versions:
rhevm-setup-plugin-ovirt-engine-common-3.6.3-0.1.el6.noarch
ovirt-engine-extension-aaa-jdbc-1.0.5-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.3-0.1.el6.noarch
vdsm-python-4.17.19-0.el7ev.noarch
vdsm-hook-vmfex-dev-4.17.19-0.el7ev.noarch
vdsm-jsonrpc-4.17.19-0.el7ev.noarch
vdsm-yajsonrpc-4.17.19-0.el7ev.noarch
vdsm-xmlrpc-4.17.19-0.el7ev.noarch
vdsm-cli-4.17.19-0.el7ev.noarch
vdsm-4.17.19-0.el7ev.noarch
vdsm-infra-4.17.19-0.el7ev.noarch 

There was performed the same load test with 50 concurrent threads.
No exceptions raised during 2 hours of test.
(see attached engine.log file
timeframe: 2016-02-04 11:35:00 - 2016-02-04 13:35:00)
REST API call StopVM response time 90% line - 5385 msec
which is worse than in 3.4 but better than 3.5
(see attached AggregateTable.csv for details
and graph StopVM-response-time.png)

The bug can be closed.

Comment 20 Yuri Obshansky 2016-02-04 11:53:51 UTC

Created attachment 1121083 [details]
Bug verification engine.log

Comment 21 Yuri Obshansky 2016-02-04 11:54:22 UTC

Created attachment 1121084 [details]
Bug verification API response time table

Comment 22 Yuri Obshansky 2016-02-04 11:54:56 UTC

Created attachment 1121085 [details]
Bug verification API response time graph