Bug 1279048 - All compute nodes get stuck in futex
All compute nodes get stuck in futex
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: ceph (Show other bugs)
7.0 (Kilo)
x86_64 Linux
urgent Severity urgent
: ---
: 10.0 (Newton)
Assigned To: leseb
Warren
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-07 08:22 EST by Mark Wagner
Modified: 2016-07-20 04:42 EDT (History)
16 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-07-20 04:42:04 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Mark Wagner 2015-11-07 08:22:07 EST
Description of problem:
During scale testing we are blocked because all of the compute nodes get stuck in a futex. We cannot add more vms, delete vms. Nova even stops logging to the compute log. The gmr does not work. 



Version-Release number of selected component (if applicable):


How reproducible:
Every scale run ends this way.

Steps to Reproduce:
1. Try to ramp to 26000 instances
2.
3.

Actual results:
We stop somewhere between 12000 and 16000 instances. Debug shows this.

[root@overcloud-compute-0 ~]# strace -p 24133
Process 24133 attached
futex(0x7ffd6633da1c, FUTEX_WAIT_PRIVATE, 1, NULL

Expected results:
It should work.

Additional info:

Issue is only cleared by rebooting compute and ceph nodes
Comment 2 Eoghan Glynn 2015-11-24 09:34:53 EST
Can we exclude a ceph issue being the root cause here?
Comment 3 Josh Durgin 2015-11-25 00:12:13 EST
Can you elaborate on the state of the system at the time of the hang?

How is memory use?
What are the ulimit settings? This sort of hang can occur if ulimit is too low 
and you run out of file descriptors.
Are there any abnormal messages in dmesg?

Is ceph being used for nova disks?
Comment 4 Mark Wagner 2015-11-25 08:18:11 EST
I was using CEPH for Glance and ephemeral storage. 60 compute nodes.

After further investigation:

The issue is clearly caused by nova-compute waiting on the futex with the ceph servers.  I have observed that the ceph servers start to have an issue and then all the computes get stuck in the futex. 

Clearing the issues on the ceph servers clears the issues on the nova-computes.
They all come back to life without any action required on the compute systems. 

So, it seems like we probably need some type of timer to make sure that CEPH can't block nova-compute indefinitely.
Comment 5 seb 2016-07-19 09:51:29 EDT
Can we have more debugging info for this one?
It looks like you fixed the issue, but I don't see how?

If you have a fix we should consider, please post it and then we will do our best implement it.
Thanks.
Comment 6 Mark Wagner 2016-07-19 16:43:55 EDT
I have no fix, just the observation that clearing the CEPH issues unblocked the futex on nova. (restarted CEPH machines)

I no longer work on OpenStack so i can not help any further.
Comment 7 seb 2016-07-20 04:37:37 EDT
Thanks for your reply Mark.
This issue is still a bit obscur to me, so if we can't reproduce it or get more results I suggest we close this.

Thanks!

Note You need to log in before you can comment on or make changes to this bug.