Bug 1020841

Summary:	Get " signal Segmentation fault" error in log and return 500 when accessing python-2.7/2.6 app with medium gear &python-2.6 app with large gear
Product:	OpenShift Online	Reporter:	chunchen <chunchen>
Component:	Containers	Assignee:	mfisher
Status:	CLOSED CURRENTRELEASE	QA Contact:	libra bugs <libra-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.x	CC:	mfojtik, wsun, xtian
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-01-24 03:25:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description chunchen 2013-10-18 11:19:16 UTC

Description of problem:
when creating a python-2.7/2.6 app with medium gear size, will get 500 page when access this app via browser.

Version-Release number of selected component (if applicable):
devenv_3912

How reproducible:
always

Steps to Reproduce:
1. Create a python-2.7/2.6 app with medium gear size
rhc app create cpy27 python-2.7 -g medium --no-git
2. Access this app via browser and tail the log of this app
rhc tail cpy27

Actual results:
1) get 500 page(Internal Server Error)
2) meet errors:
[Fri Oct 18 06:11:20 2013] [error] [client 127.2.1.129] Premature end of script headers: application
[Fri Oct 18 06:11:20 2013] [notice] child pid 31629 exit signal Segmentation fault (11)

Expected results:
Should access python-2.7/2.6 app with medium gear size via browser

Additional info:

Comment 1 Xiaoli Tian 2013-10-21 10:03:56 UTC

Could be reproduced with large gear python-2.6 app as well:

[Mon Oct 21 05:58:05 2013] [notice] Apache/2.2.15 (Unix) mod_wsgi/3.2 Python/2.6.6 configured -- resuming normal operations
[Mon Oct 21 05:59:29 2013] [error] [client 127.1.246.1] Premature end of script headers: application
[Mon Oct 21 05:59:29 2013] [notice] child pid 14415 exit signal Segmentation fault (11)

Comment 2 Rob Millner 2013-10-22 01:26:59 UTC

This is due to setting stack-size. I can reliably test this by removing the setting from performance.conf.erb to show no failure or putting it back in to show a failure.

Looking through a few core dumps, it appears as though the settings are causing writes off of the stack which is causing the segfault.

The docs on WSGIDaemonProcess mention that the value for stack-size is in bytes. We appear to be making the following settings:

Gear size: Memory: stack-size:
small 512MB 8388 bytes
medium 1024MB 16777 bytes
large 2048MB 33554 bytes

The system limits allow up to 10485760 bytes and the default may well be that value.

Looking at the manpage for pthread_attr_setstack, it fails if you try to set a stack size below 16384 bytes. I'll bet the only reason why this setting works on small gears is that 8388 is low enough that setstack fails and the default ends up getting used.

Python docs claim it needs 32k stacks just to run the interpreter.
http://docs.python.org/2/library/thread.html

Also, the embedded formula produces values that are not aligned to 4k boundaries.
stack-size=<%= (((ENV['OPENSHIFT_GEAR_MEMORY_MB'].to_i * 0.8)/25) * 1024).to_i/2

The value given is likely being rounded down to the nearest 4k page boundary.

Tweaking the stack size seems to risk blowing things up. And it doesn't matter much since most of what the script itself is doing will be on the heap which we have no control over.

Perhaps a better alternative would be to tweak down the number of threads we create instead.

We set 25 for every gear size. How about something like "10 threads per 1024MB" instead which would result in the following table:
Gear size: Memory: threads:
small 512MB 5
medium 1024MB 10
large 2048MB 20

Comment 3 Michal Fojtik 2013-10-22 11:20:53 UTC

Hi Rob,

Thanks a **lot** for this investigation. Yeah, I agree that the better way would be to set the number of threads OR number of processes. I'll work on this today and make a PR.

Comment 4 Michal Fojtik 2013-10-22 11:31:24 UTC

The PR:

https://github.com/openshift/origin-server/pull/3951

Comment 5 openshift-github-bot 2013-10-22 19:51:25 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/3cfc6954453fad0952cf53910bc69ea8e0d7abb7
Bug 1020841 - Tune python cartridge by increasing number of threads instead of stack-size

Comment 6 chunchen 2013-10-23 05:47:07 UTC

It's fixed, verified on devenv_3932.