Bug 1154389 - [scale] VM shutdown causes errors under load: exception in acpiShutdown()
Summary: [scale] VM shutdown causes errors under load: exception in acpiShutdown()
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.0
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Francesco Romani
QA Contact: Eldad Marciano
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-19 12:17 UTC by Yuri Obshansky
Modified: 2016-04-20 01:29 UTC (History)
12 users (show)

Fixed In Version: 4.17.4
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-20 01:29:06 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vdsm.log (2.28 MB, application/x-gzip)
2014-10-21 12:54 UTC, Yuri Obshansky
no flags Details
engine.log (1.06 MB, application/x-gzip)
2014-10-21 12:54 UTC, Yuri Obshansky
no flags Details
libvirt.log 1 (2.46 MB, application/x-xz)
2014-10-23 08:42 UTC, Yuri Obshansky
no flags Details
libvirt.log 2 (2.46 MB, application/x-xz)
2014-10-23 08:44 UTC, Yuri Obshansky
no flags Details
libvirt.log 3 (161.72 KB, application/x-gzip)
2014-10-23 08:45 UTC, Yuri Obshansky
no flags Details
Bug verification engine.log (1.25 MB, application/x-gzip)
2016-02-04 11:53 UTC, Yuri Obshansky
no flags Details
Bug verification API response time table (955 bytes, text/plain)
2016-02-04 11:54 UTC, Yuri Obshansky
no flags Details
Bug verification API response time graph (295.76 KB, image/png)
2016-02-04 11:54 UTC, Yuri Obshansky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 34879 0 master MERGED vm: make acpiShutdown handle NotConnectedError Never
oVirt gerrit 45377 0 ovirt-3.6 MERGED vm: Improve error handling when Vm._dom is None Never
oVirt gerrit 45378 0 ovirt-3.6 MERGED virt: introduce virdomain module Never
oVirt gerrit 45379 0 ovirt-3.6 MERGED vm: make acpiShutdown handle NotConnectedError Never

Description Yuri Obshansky 2014-10-19 12:17:08 UTC
Description of problem:
During REST API User Portal performance test of 50 concurrent threads on 50 VMs detected errors >> "VM scale-50 is down with error. Exit message: dictionary changed size during iteration" in spite of that VM shut-downed successfully.
As results performance degradation.
VM shutdowm response time 90% line in RHEV-M 3.4 = 2249 ms compare to RHEV-M 3.5 = 147839 ms.

Engine.log errors:
2014-10-15 11:39:23,050 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-79) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM scale-1 is down with error. Exit message: dictionary changed size during iteration.Vdsm.log error:
Thread-259072::ERROR::2014-10-15 12:06:52,842::utils::1193::utils.Callback::(__call__) acpiCallback failed
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 1191, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/share/vdsm/virt/vmpowerdown.py", line 91, in acpiCallback
    self.vm.acpiShutdown()
  File "/usr/share/vdsm/virt/vm.py", line 4942, in acpiShutdown
    self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN)
AttributeError: 'NoneType' object has no attribute 'shutdownFlags'

Version-Release number of selected component (if applicable):
RHEV-M: 3.5.0-0.14.beta
RHEL: 6Server - 6.5.0.1.el6
Kernel Version: 2.6.32 - 431.29.2.el6.x86_64
KVM Version: 0.12.1.2 - 2.415.el6_5.10
LIBVIRT Version: libvirt-0.10.2-29.el6_5.11
VDSM Version: vdsm-4.16.6-1.el6ev

How reproducible:
Run automatic test which shutdowm/start VMs using user portal REST API with load of 50 concurrent threads.

Steps to Reproduce:
1.
2.
3.

Actual results:
Errors
Performance degradation

Expected results:
No errors
No performance degradation

Additional info:

Comment 1 Omer Frenkel 2014-10-20 10:09:03 UTC
please attach vdsm logs

Comment 2 Yuri Obshansky 2014-10-20 10:11:39 UTC
Sorry, I don't have vdsm.log
Test was performed on 15.10 and vdsm has already deleted all log files created early from 17.10

Comment 3 Yuri Obshansky 2014-10-21 12:54:12 UTC
Created attachment 948946 [details]
vdsm.log

Comment 4 Yuri Obshansky 2014-10-21 12:54:40 UTC
Created attachment 948947 [details]
engine.log

Comment 5 Vitor de Lima 2014-10-21 18:58:36 UTC
Could you send the libvirt logs too, please?

Comment 6 Yuri Obshansky 2014-10-22 07:30:34 UTC
Unfortunately, I don't have libvirt.log file created under /var/log/libvirt
I looked at /etc/libvirt/libvirtd.conf file and found that all log configurations are commented out. Is it OK?
#################################################################
#
# Logging controls
#

# Logging level: 4 errors, 3 warnings, 2 information, 1 debug
# basically 1 will log everything possible
#log_level = 3

# Logging filters:
# A filter allows to select a different logging level for a given category
# of logs
# The format for a filter is one of:
#    x:name
#    x:+name
#      where name is a string which is matched against source file name,
#      e.g., "remote", "qemu", or "util/json", the optional "+" prefix
#      tells libvirt to log stack trace for each message matching name,
#      and x is the minimal level where matching messages should be logged:
#    1: DEBUG
#    2: INFO
#    3: WARNING
#    4: ERROR
#
# Multiple filter can be defined in a single @filters, they just need to be
# separated by spaces.
#
# e.g. to only get warning or errors from the remote layer and only errors
# from the event layer:
#log_filters="3:remote 4:event"

# Logging outputs:
# An output is one of the places to save logging information
# The format for an output can be:
#    x:stderr
#      output goes to stderr
#    x:syslog:name
#      use syslog for the output and use the given name as the ident
#    x:file:file_path
#      output to a file, with the given filepath
# In all case the x prefix is the minimal level, acting as a filter
#    1: DEBUG
#    2: INFO
#    3: WARNING
#    4: ERROR
#
# Multiple output can be defined, they just need to be separated by spaces.
# e.g. to log all warnings and errors to syslog under the libvirtd ident:
#log_outputs="3:syslog:libvirtd"
#

# Log debug buffer size: default 64
# The daemon keeps an internal debug log buffer which will be dumped in case
# of crash or upon receiving a SIGUSR2 signal. This setting allows to override
# the default buffer size in kilobytes.
# If value is 0 or less the debug log buffer is deactivated
#log_buffer_size = 64

I can rerun test again. Please, suggest what log configuration uncomment and what level do you prefer.

Comment 7 Yuri Obshansky 2014-10-23 08:42:50 UTC
Created attachment 949744 [details]
libvirt.log 1

Comment 8 Yuri Obshansky 2014-10-23 08:44:03 UTC
Created attachment 949746 [details]
libvirt.log 2

Comment 9 Yuri Obshansky 2014-10-23 08:45:06 UTC
Created attachment 949747 [details]
libvirt.log 3

Comment 10 Vitor de Lima 2014-10-28 16:03:29 UTC
The logs point out to QEMU dying earlier than expected, but no reason is given for it. Looks like an issue with QEMU rather than with the engine or VDSM.

Comment 11 Omer Frenkel 2014-10-30 15:25:47 UTC
duplicate of bug 1143968 ?

Comment 12 Vitor de Lima 2014-10-30 15:38:47 UTC
Part of this problem (the "dictionary changed size during iteration" error) has to do with the unsafe handling of Vm.conf that is mentioned in that BZ, another part (the "'NoneType' object has no attribute 'shutdownFlags'" error) is related to QEMU dying too quickly after the shutdown command is issued, which is a different problem.

Comment 13 Francesco Romani 2014-10-30 22:43:30 UTC
(In reply to Vitor de Lima from comment #12)
> Part of this problem (the "dictionary changed size during iteration" error)
> has to do with the unsafe handling of Vm.conf that is mentioned in that BZ,
> another part (the "'NoneType' object has no attribute 'shutdownFlags'"
> error) is related to QEMU dying too quickly after the shutdown command is
> issued, which is a different problem.

Correct. Unfortunately lots of flow are potentially affected, so definitely
VDSM logs will help shed some light here, in order to prioritize.

We know QEMU death and Shutdown/destroy are racy and potentially buggy, even though the specific error reported should be little harm.

refactoring of those flows was also planned but slipped through because of other more urgent issues, maybe is time to bump it up.

BZ1143968 is related but is tackling a different subset of the bigger problem, one of which we have evidence and a quite clean reproducer.
I don't think device handling is the root cause of this issue, but to be sure we need the full backtrace.

Comment 14 Francesco Romani 2014-11-06 13:23:34 UTC
patch http://gerrit.ovirt.org/#/c/34879/ should help for this issue.

Comment 15 Michal Skrivanek 2014-11-11 09:05:20 UTC
"dictionary changed size during iteration" tracked in bug 1143968.
Changing this one to the shutdown race in comment #14.
bug is harmless.

Comment 19 Yuri Obshansky 2016-02-04 11:53:12 UTC
The bug was verified on versions:
rhevm-setup-plugin-ovirt-engine-common-3.6.3-0.1.el6.noarch
ovirt-engine-extension-aaa-jdbc-1.0.5-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.3-0.1.el6.noarch
vdsm-python-4.17.19-0.el7ev.noarch
vdsm-hook-vmfex-dev-4.17.19-0.el7ev.noarch
vdsm-jsonrpc-4.17.19-0.el7ev.noarch
vdsm-yajsonrpc-4.17.19-0.el7ev.noarch
vdsm-xmlrpc-4.17.19-0.el7ev.noarch
vdsm-cli-4.17.19-0.el7ev.noarch
vdsm-4.17.19-0.el7ev.noarch
vdsm-infra-4.17.19-0.el7ev.noarch 

There was performed the same load test with 50 concurrent threads.
No exceptions raised during 2 hours of test.
(see attached engine.log file
timeframe: 2016-02-04 11:35:00 - 2016-02-04 13:35:00)
REST API call StopVM response time 90% line - 5385 msec
which is worse than in 3.4 but better than 3.5
(see attached AggregateTable.csv for details
and graph StopVM-response-time.png)

The bug can be closed.

Comment 20 Yuri Obshansky 2016-02-04 11:53:51 UTC
Created attachment 1121083 [details]
Bug verification engine.log

Comment 21 Yuri Obshansky 2016-02-04 11:54:22 UTC
Created attachment 1121084 [details]
Bug verification API response time table

Comment 22 Yuri Obshansky 2016-02-04 11:54:56 UTC
Created attachment 1121085 [details]
Bug verification API response time graph


Note You need to log in before you can comment on or make changes to this bug.