1279048 – All compute nodes get stuck in futex

Bug 1279048 - All compute nodes get stuck in futex

Summary: All compute nodes get stuck in futex

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	ceph
Sub Component:
Version:	7.0 (Kilo)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	10.0 (Newton)
Assignee:	Sébastien Han
QA Contact:	Warren
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-07 13:22 UTC by Mark Wagner
Modified:	2019-09-09 15:43 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-07-20 08:42:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mark Wagner 2015-11-07 13:22:07 UTC

Description of problem:
During scale testing we are blocked because all of the compute nodes get stuck in a futex. We cannot add more vms, delete vms. Nova even stops logging to the compute log. The gmr does not work. 



Version-Release number of selected component (if applicable):


How reproducible:
Every scale run ends this way.

Steps to Reproduce:
1. Try to ramp to 26000 instances
2.
3.

Actual results:
We stop somewhere between 12000 and 16000 instances. Debug shows this.

[root@overcloud-compute-0 ~]# strace -p 24133
Process 24133 attached
futex(0x7ffd6633da1c, FUTEX_WAIT_PRIVATE, 1, NULL

Expected results:
It should work.

Additional info:

Issue is only cleared by rebooting compute and ceph nodes

Comment 2 Eoghan Glynn 2015-11-24 14:34:53 UTC

Can we exclude a ceph issue being the root cause here?

Comment 3 Josh Durgin 2015-11-25 05:12:13 UTC

Can you elaborate on the state of the system at the time of the hang?

How is memory use?
What are the ulimit settings? This sort of hang can occur if ulimit is too low 
and you run out of file descriptors.
Are there any abnormal messages in dmesg?

Is ceph being used for nova disks?

Comment 4 Mark Wagner 2015-11-25 13:18:11 UTC

I was using CEPH for Glance and ephemeral storage. 60 compute nodes.

After further investigation:

The issue is clearly caused by nova-compute waiting on the futex with the ceph servers.  I have observed that the ceph servers start to have an issue and then all the computes get stuck in the futex. 

Clearing the issues on the ceph servers clears the issues on the nova-computes.
They all come back to life without any action required on the compute systems. 

So, it seems like we probably need some type of timer to make sure that CEPH can't block nova-compute indefinitely.

Comment 5 seb 2016-07-19 13:51:29 UTC

Can we have more debugging info for this one?
It looks like you fixed the issue, but I don't see how?

If you have a fix we should consider, please post it and then we will do our best implement it.
Thanks.

Comment 6 Mark Wagner 2016-07-19 20:43:55 UTC

I have no fix, just the observation that clearing the CEPH issues unblocked the futex on nova. (restarted CEPH machines)

I no longer work on OpenStack so i can not help any further.

Comment 7 seb 2016-07-20 08:37:37 UTC

Thanks for your reply Mark.
This issue is still a bit obscur to me, so if we can't reproduce it or get more results I suggest we close this.

Thanks!

Note You need to log in before you can comment on or make changes to this bug.