Bug 1226911
| Summary: | vmchannel thread consumes 100% of CPU | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Pavel Zhukov <pzhukov> | ||||
| Component: | vdsm | Assignee: | Vinzenz Feenstra [evilissimo] <vfeenstr> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Israel Pinto <ipinto> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 3.5.1 | CC: | bazulay, erik-fedora, gklein, gveitmic, kyulee, lpeer, lsurette, mavital, mgoldboi, michal.skrivanek, ofrenkel, pdwyer, pzhukov, sbonazzo, s.kieske, yeylon, ykaul, ylavi | ||||
| Target Milestone: | ovirt-3.6.3 | Keywords: | Reopened, ZStream | ||||
| Target Release: | 3.6.3 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
In some cases after migration fails, many error messages flooded the VDSM log, which caused one of the VDSM threads to consume 100% CPU. This was caused by incorrect usage of epoll, which has been fixed in this update.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1297414 (view as bug list) | Environment: | |||||
| Last Closed: | 2016-03-09 19:40:50 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1297414 | ||||||
| Attachments: |
|
||||||
|
Description
Pavel Zhukov
2015-06-01 12:23:35 UTC
we need the rest of the vdsm.log, would you add it please Created attachment 1035078 [details]
vdsm log
The acrhive was not completed because of system overloading but it can be extracted.
this happened at the end of migration (from src point of view) Can you add libvirt and qemu logs from both sides? Is it reproducible? Seems that the migration was not properly cancelled as I see the src VM got suspended, and then the qemu process went crazy will add defensive code nacked by upstream, will try one last push for solution since we don't not know the cause of the CPU consumption and not happy with the defensive code of restarting vdsm (could create additional issues). closing this one till we can get a reproducer. After some investigation we have found the root cause of this and can now handle this appropriately. The issue comes from a wrong assumption which has been made in some earlier changes. Originally it was assumed that we can rely on epoll to automatically unregister fds when they have been closed. However when forking a process (that is the case when starting child processes) the applications open file handles are getting shared with the child and if not both child and parent close them epoll won't consider those fds as closed and won't automatically unregister them from the watched set. Usually this problem is solved by passing the flag SOCK_CLOEXEC to a socket so in case of a fork, those handles get automatically closed in the child process. However this flag only exists in python 3.2 and higher. So we should at least set the flag even that we know it's raceful and we should handle errors on the fd with unregistering it from epoll - And in case that we're watching that socket and have a handle to it, then also close it. verify with: RHEVM: 3.6.3-0.1.el6 vdsm: vdsm-4.17.19-0.el7ev libvirt: libvirt-1.2.17-13.el7_2.2 Steps to Reproduce: 1. Two host environment. Put one host to maintenance mode (second host doesn't have enough resources to run all VMs so migration(s) is failed) 2. Wait until migration is failed 3. Power the VM(s) off 4. Activate the host Actual results: The VM is down, Host is up and running. No 'vmchannels' print in vdsm log. Check PASS Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html |