Description of problem: During REST API User Portal performance test of 50 concurrent threads on 50 VMs detected errors >> "VM scale-50 is down with error. Exit message: dictionary changed size during iteration" in spite of that VM shut-downed successfully. As results performance degradation. VM shutdowm response time 90% line in RHEV-M 3.4 = 2249 ms compare to RHEV-M 3.5 = 147839 ms. Engine.log errors: 2014-10-15 11:39:23,050 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-79) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM scale-1 is down with error. Exit message: dictionary changed size during iteration.Vdsm.log error: Thread-259072::ERROR::2014-10-15 12:06:52,842::utils::1193::utils.Callback::(__call__) acpiCallback failed Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 1191, in __call__ result = self.func(*self.args, **self.kwargs) File "/usr/share/vdsm/virt/vmpowerdown.py", line 91, in acpiCallback self.vm.acpiShutdown() File "/usr/share/vdsm/virt/vm.py", line 4942, in acpiShutdown self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN) AttributeError: 'NoneType' object has no attribute 'shutdownFlags' Version-Release number of selected component (if applicable): RHEV-M: 3.5.0-0.14.beta RHEL: 6Server - 6.5.0.1.el6 Kernel Version: 2.6.32 - 431.29.2.el6.x86_64 KVM Version: 0.12.1.2 - 2.415.el6_5.10 LIBVIRT Version: libvirt-0.10.2-29.el6_5.11 VDSM Version: vdsm-4.16.6-1.el6ev How reproducible: Run automatic test which shutdowm/start VMs using user portal REST API with load of 50 concurrent threads. Steps to Reproduce: 1. 2. 3. Actual results: Errors Performance degradation Expected results: No errors No performance degradation Additional info:
please attach vdsm logs
Sorry, I don't have vdsm.log Test was performed on 15.10 and vdsm has already deleted all log files created early from 17.10
Created attachment 948946 [details] vdsm.log
Created attachment 948947 [details] engine.log
Could you send the libvirt logs too, please?
Unfortunately, I don't have libvirt.log file created under /var/log/libvirt I looked at /etc/libvirt/libvirtd.conf file and found that all log configurations are commented out. Is it OK? ################################################################# # # Logging controls # # Logging level: 4 errors, 3 warnings, 2 information, 1 debug # basically 1 will log everything possible #log_level = 3 # Logging filters: # A filter allows to select a different logging level for a given category # of logs # The format for a filter is one of: # x:name # x:+name # where name is a string which is matched against source file name, # e.g., "remote", "qemu", or "util/json", the optional "+" prefix # tells libvirt to log stack trace for each message matching name, # and x is the minimal level where matching messages should be logged: # 1: DEBUG # 2: INFO # 3: WARNING # 4: ERROR # # Multiple filter can be defined in a single @filters, they just need to be # separated by spaces. # # e.g. to only get warning or errors from the remote layer and only errors # from the event layer: #log_filters="3:remote 4:event" # Logging outputs: # An output is one of the places to save logging information # The format for an output can be: # x:stderr # output goes to stderr # x:syslog:name # use syslog for the output and use the given name as the ident # x:file:file_path # output to a file, with the given filepath # In all case the x prefix is the minimal level, acting as a filter # 1: DEBUG # 2: INFO # 3: WARNING # 4: ERROR # # Multiple output can be defined, they just need to be separated by spaces. # e.g. to log all warnings and errors to syslog under the libvirtd ident: #log_outputs="3:syslog:libvirtd" # # Log debug buffer size: default 64 # The daemon keeps an internal debug log buffer which will be dumped in case # of crash or upon receiving a SIGUSR2 signal. This setting allows to override # the default buffer size in kilobytes. # If value is 0 or less the debug log buffer is deactivated #log_buffer_size = 64 I can rerun test again. Please, suggest what log configuration uncomment and what level do you prefer.
Created attachment 949744 [details] libvirt.log 1
Created attachment 949746 [details] libvirt.log 2
Created attachment 949747 [details] libvirt.log 3
The logs point out to QEMU dying earlier than expected, but no reason is given for it. Looks like an issue with QEMU rather than with the engine or VDSM.
duplicate of bug 1143968 ?
Part of this problem (the "dictionary changed size during iteration" error) has to do with the unsafe handling of Vm.conf that is mentioned in that BZ, another part (the "'NoneType' object has no attribute 'shutdownFlags'" error) is related to QEMU dying too quickly after the shutdown command is issued, which is a different problem.
(In reply to Vitor de Lima from comment #12) > Part of this problem (the "dictionary changed size during iteration" error) > has to do with the unsafe handling of Vm.conf that is mentioned in that BZ, > another part (the "'NoneType' object has no attribute 'shutdownFlags'" > error) is related to QEMU dying too quickly after the shutdown command is > issued, which is a different problem. Correct. Unfortunately lots of flow are potentially affected, so definitely VDSM logs will help shed some light here, in order to prioritize. We know QEMU death and Shutdown/destroy are racy and potentially buggy, even though the specific error reported should be little harm. refactoring of those flows was also planned but slipped through because of other more urgent issues, maybe is time to bump it up. BZ1143968 is related but is tackling a different subset of the bigger problem, one of which we have evidence and a quite clean reproducer. I don't think device handling is the root cause of this issue, but to be sure we need the full backtrace.
patch http://gerrit.ovirt.org/#/c/34879/ should help for this issue.
"dictionary changed size during iteration" tracked in bug 1143968. Changing this one to the shutdown race in comment #14. bug is harmless.
The bug was verified on versions: rhevm-setup-plugin-ovirt-engine-common-3.6.3-0.1.el6.noarch ovirt-engine-extension-aaa-jdbc-1.0.5-1.el6ev.noarch rhevm-setup-plugin-ovirt-engine-3.6.3-0.1.el6.noarch vdsm-python-4.17.19-0.el7ev.noarch vdsm-hook-vmfex-dev-4.17.19-0.el7ev.noarch vdsm-jsonrpc-4.17.19-0.el7ev.noarch vdsm-yajsonrpc-4.17.19-0.el7ev.noarch vdsm-xmlrpc-4.17.19-0.el7ev.noarch vdsm-cli-4.17.19-0.el7ev.noarch vdsm-4.17.19-0.el7ev.noarch vdsm-infra-4.17.19-0.el7ev.noarch There was performed the same load test with 50 concurrent threads. No exceptions raised during 2 hours of test. (see attached engine.log file timeframe: 2016-02-04 11:35:00 - 2016-02-04 13:35:00) REST API call StopVM response time 90% line - 5385 msec which is worse than in 3.4 but better than 3.5 (see attached AggregateTable.csv for details and graph StopVM-response-time.png) The bug can be closed.
Created attachment 1121083 [details] Bug verification engine.log
Created attachment 1121084 [details] Bug verification API response time table
Created attachment 1121085 [details] Bug verification API response time graph