Bug 1279048

Summary:	All compute nodes get stuck in futex
Product:	Red Hat OpenStack	Reporter:	Mark Wagner <mwagner>
Component:	ceph	Assignee:	Sébastien Han <shan>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Warren <wusui>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.0 (Kilo)	CC:	berrange, dasmith, eglynn, flucifre, jdurgin, jtaleric, kchamart, lhh, nlevine, sbauza, seb, sferdjao, sgordon, srevivo, vromanso
Target Milestone:	---	Keywords:	ZStream
Target Release:	10.0 (Newton)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-07-20 08:42:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mark Wagner 2015-11-07 13:22:07 UTC

Description of problem:
During scale testing we are blocked because all of the compute nodes get stuck in a futex. We cannot add more vms, delete vms. Nova even stops logging to the compute log. The gmr does not work. 



Version-Release number of selected component (if applicable):


How reproducible:
Every scale run ends this way.

Steps to Reproduce:
1. Try to ramp to 26000 instances
2.
3.

Actual results:
We stop somewhere between 12000 and 16000 instances. Debug shows this.

[root@overcloud-compute-0 ~]# strace -p 24133
Process 24133 attached
futex(0x7ffd6633da1c, FUTEX_WAIT_PRIVATE, 1, NULL

Expected results:
It should work.

Additional info:

Issue is only cleared by rebooting compute and ceph nodes

Comment 2 Eoghan Glynn 2015-11-24 14:34:53 UTC

Can we exclude a ceph issue being the root cause here?

Comment 3 Josh Durgin 2015-11-25 05:12:13 UTC

Can you elaborate on the state of the system at the time of the hang?

How is memory use?
What are the ulimit settings? This sort of hang can occur if ulimit is too low 
and you run out of file descriptors.
Are there any abnormal messages in dmesg?

Is ceph being used for nova disks?

Comment 4 Mark Wagner 2015-11-25 13:18:11 UTC

I was using CEPH for Glance and ephemeral storage. 60 compute nodes.

After further investigation:

The issue is clearly caused by nova-compute waiting on the futex with the ceph servers.  I have observed that the ceph servers start to have an issue and then all the computes get stuck in the futex. 

Clearing the issues on the ceph servers clears the issues on the nova-computes.
They all come back to life without any action required on the compute systems. 

So, it seems like we probably need some type of timer to make sure that CEPH can't block nova-compute indefinitely.

Comment 5 seb 2016-07-19 13:51:29 UTC

Can we have more debugging info for this one?
It looks like you fixed the issue, but I don't see how?

If you have a fix we should consider, please post it and then we will do our best implement it.
Thanks.

Comment 6 Mark Wagner 2016-07-19 20:43:55 UTC

I have no fix, just the observation that clearing the CEPH issues unblocked the futex on nova. (restarted CEPH machines)

I no longer work on OpenStack so i can not help any further.

Comment 7 seb 2016-07-20 08:37:37 UTC

Thanks for your reply Mark.
This issue is still a bit obscur to me, so if we can't reproduce it or get more results I suggest we close this.

Thanks!