Bug 1279048
Summary: | All compute nodes get stuck in futex | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Mark Wagner <mwagner> |
Component: | ceph | Assignee: | Sébastien Han <shan> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Warren <wusui> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.0 (Kilo) | CC: | berrange, dasmith, eglynn, flucifre, jdurgin, jtaleric, kchamart, lhh, nlevine, sbauza, seb, sferdjao, sgordon, srevivo, vromanso |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | 10.0 (Newton) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-07-20 08:42:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mark Wagner
2015-11-07 13:22:07 UTC
Can we exclude a ceph issue being the root cause here? Can you elaborate on the state of the system at the time of the hang? How is memory use? What are the ulimit settings? This sort of hang can occur if ulimit is too low and you run out of file descriptors. Are there any abnormal messages in dmesg? Is ceph being used for nova disks? I was using CEPH for Glance and ephemeral storage. 60 compute nodes. After further investigation: The issue is clearly caused by nova-compute waiting on the futex with the ceph servers. I have observed that the ceph servers start to have an issue and then all the computes get stuck in the futex. Clearing the issues on the ceph servers clears the issues on the nova-computes. They all come back to life without any action required on the compute systems. So, it seems like we probably need some type of timer to make sure that CEPH can't block nova-compute indefinitely. Can we have more debugging info for this one? It looks like you fixed the issue, but I don't see how? If you have a fix we should consider, please post it and then we will do our best implement it. Thanks. I have no fix, just the observation that clearing the CEPH issues unblocked the futex on nova. (restarted CEPH machines) I no longer work on OpenStack so i can not help any further. Thanks for your reply Mark. This issue is still a bit obscur to me, so if we can't reproduce it or get more results I suggest we close this. Thanks! |