Red Hat Bugzilla – Bug 1279048
All compute nodes get stuck in futex
Last modified: 2016-07-20 04:42:04 EDT
Description of problem:
During scale testing we are blocked because all of the compute nodes get stuck in a futex. We cannot add more vms, delete vms. Nova even stops logging to the compute log. The gmr does not work.
Version-Release number of selected component (if applicable):
Every scale run ends this way.
Steps to Reproduce:
1. Try to ramp to 26000 instances
We stop somewhere between 12000 and 16000 instances. Debug shows this.
[root@overcloud-compute-0 ~]# strace -p 24133
Process 24133 attached
futex(0x7ffd6633da1c, FUTEX_WAIT_PRIVATE, 1, NULL
It should work.
Issue is only cleared by rebooting compute and ceph nodes
Can we exclude a ceph issue being the root cause here?
Can you elaborate on the state of the system at the time of the hang?
How is memory use?
What are the ulimit settings? This sort of hang can occur if ulimit is too low
and you run out of file descriptors.
Are there any abnormal messages in dmesg?
Is ceph being used for nova disks?
I was using CEPH for Glance and ephemeral storage. 60 compute nodes.
After further investigation:
The issue is clearly caused by nova-compute waiting on the futex with the ceph servers. I have observed that the ceph servers start to have an issue and then all the computes get stuck in the futex.
Clearing the issues on the ceph servers clears the issues on the nova-computes.
They all come back to life without any action required on the compute systems.
So, it seems like we probably need some type of timer to make sure that CEPH can't block nova-compute indefinitely.
Can we have more debugging info for this one?
It looks like you fixed the issue, but I don't see how?
If you have a fix we should consider, please post it and then we will do our best implement it.
I have no fix, just the observation that clearing the CEPH issues unblocked the futex on nova. (restarted CEPH machines)
I no longer work on OpenStack so i can not help any further.
Thanks for your reply Mark.
This issue is still a bit obscur to me, so if we can't reproduce it or get more results I suggest we close this.