Bug 1445540

Summary: mod_wsgi segfault if loaded during a restart
Product: Red Hat Enterprise Linux 7 Reporter: Ian Wienand <iwienand>
Component: mod_wsgiAssignee: Luboš Uhliarik <luhliari>
Status: CLOSED ERRATA QA Contact: Jan Houska <jhouska>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: bnater, gandavar, gdumplet, hpham, iwienand, jkejda, jorton, luhliari, spoore, tbarron
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: mod_wsgi-3.4-14.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 11:19:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1549616, 1562205    
Attachments:
Description Flags
core file generated from gdb
none
full bt none

Description Ian Wienand 2017-04-25 22:18:26 UTC
Created attachment 1274035 [details]
core file generated from gdb

We are seeing mod_wsgi get into a bad state during devstack runs in OpenStack CI.  Apache will just sit in a tight loop of segfaulting mod_wsgi children until it is restarted.  I am not currently sure how it gets into this situation, but have some info from a debugging session

---
Program received signal SIGSEGV, Segmentation fault.
0x00007fb6886cf15b in wsgi_python_init (p=p@entry=0x7fb69909e0f8) at mod_wsgi.c:5952
5952	        if (wsgi_server_config->py3k_warning_flag == 1)

(gdb) bt
#0  0x00007fb6886cf15b in wsgi_python_init (p=p@entry=0x7fb69909e0f8) at mod_wsgi.c:5952
#1  0x00007fb6886d8b40 in wsgi_hook_child_init (p=0x7fb69909e0f8, s=<optimized out>) at mod_wsgi.c:13778
#2  0x00007fb697c6855c in ap_run_child_init (pchild=0x7fb69909e0f8, s=0x7fb698fbb480) at config.c:165
#3  0x00007fb68d86644b in child_main (child_num_arg=child_num_arg@entry=0) at prefork.c:546
#4  0x00007fb68d866a55 in make_child (s=0x7fb698fbb480, slot=0) at prefork.c:810
#5  0x00007fb68d8676ee in perform_idle_server_maintenance (p=<optimized out>) at prefork.c:912
#6  prefork_run (_pconf=<optimized out>, plog=<optimized out>, s=<optimized out>) at prefork.c:1100
#7  0x00007fb697c4cf6e in ap_run_mpm (pconf=pconf@entry=0x7fb698f86158, plog=0x7fb698fb3378, s=0x7fb698fbb480) at mpm_common.c:96
#8  0x00007fb697c45d76 in main (argc=2, argv=0x7ffd4f772aa8) at main.c:777


(gdb) print wsgi_server_config
$5 = (WSGIServerConfig *) 0x0

(gdb) frame
#1  0x00007fb6886d8b40 in wsgi_hook_child_init (p=0x7fb69909e0f8, s=<optimized out>) at mod_wsgi.c:13778
13778	            wsgi_python_init(p);

(gdb) print wsgi_server
$8 = (server_rec *) 0x0
---

I have attached a core and a full backtrace.

When we restart apache, this starts working.  The issue is this is happening in a CI situation where clearly some of our setup is causing problems, so finding the root cause so we can ensure we don't go down that path will be helpful

Comment 2 Ian Wienand 2017-04-25 22:19:41 UTC
Created attachment 1274036 [details]
full bt

Comment 3 Ian Wienand 2017-04-25 23:39:05 UTC
As found by @tbarron, [1] seems like a very similar issue.

In short, 

---
wsgi_hook_child_init is supposed to be called after wsgi_hook_init and uses the information in wsgi_server_config. Problem is that when graceful is triggered, the mod_wsgi.so is unloaded and reloaded BUT wsgi_hook_init is called just once! As documented in the apache wiki [1] this smart hack fails miserably for our case
---

in summary maybe, graceful restarts aren't so graceful?  we are not calling "apachectl" directly, I don't think ... only via systemd.  but it might be entering the same path

[1] https://bugs.launchpad.net/fuel/+bug/1493372/comments/35

Comment 4 Ian Wienand 2017-04-26 00:01:08 UTC
ok, reload timeline is making sense;  our CI bits are installing mod_wsgi at 19:56:18

---
2017-04-25 19:56:18.244 | + functions-common:sudo_with_proxies:2446  :   sudo http_proxy= https_proxy= no_proxy= yum install -y httpd mod_wsgi
---

We then do some configuration of the uwsgi proxy, and issue a "reload" at 19:56:21

---
2017-04-25 19:56:21.862 | + lib/apache:write_uwsgi_config:270        :   reload_apache_server
2017-04-25 19:56:21.864 | + lib/apache:reload_apache_server:223      :   reload_service httpd
2017-04-25 19:56:21.866 | + functions-common:reload_service:2416     :   '[' -x /bin/systemctl ']'
2017-04-25 19:56:21.868 | + functions-common:reload_service:2417     :   sudo /bin/systemctl reload-or-restart httpd
---

this is where the graceful restart comes in

---
-bash-4.2# cat /usr/lib/systemd/system/httpd.service | grep Reload
ExecReload=/usr/sbin/httpd $OPTIONS -k graceful
---

it's right after this that the segfaults start

---
[Tue Apr 25 19:56:21.942157 2017] [mpm_prefork:notice] [pid 17792] AH00171: Graceful restart requested, doing restart
AH00558: httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1. Set the 'ServerName' directive globally to suppress this message
[Tue Apr 25 19:56:21.968518 2017] [auth_digest:notice] [pid 17792] AH01757: generating secret for digest authentication ...
[Tue Apr 25 19:56:21.969794 2017] [lbmethod_heartbeat:notice] [pid 17792] AH02282: No slotmem from mod_heartmonitor
[Tue Apr 25 19:56:21.970704 2017] [mpm_prefork:notice] [pid 17792] AH00163: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips configured -- resuming normal operations
[Tue Apr 25 19:56:21.970714 2017] [core:notice] [pid 17792] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Tue Apr 25 19:56:21.974122 2017] [core:notice] [pid 17792] AH00052: child pid 26554 exit signal Segmentation fault (11)
---

Depending on various things after that, apache may get "restarted" (actually stopped and started) or not; those that do restart before trying to do anything with apache would work.  That would explain why some jobs work and others fail.

(http://logs.openstack.org/06/352606/13/check/gate-manila-tempest-minimal-dsvm-lvm-centos-7-nv/0ca39b2/logs)

Comment 5 Joe Orton 2017-04-26 07:21:30 UTC
Can we have some config details, package versions etc?

Comment 6 Ian Wienand 2017-04-27 00:13:10 UTC
Sorry about that, the details are all in the run logs but I forget not everyone reads them all day long :)

The packages in question:
 - httpd 2.4.6-45
 - mod_wsgi 3.4-12

With a bit of playing around just now, I believe the replication is to install httpd, start it, install mod_wsgi and then *reload* httpd.

---
 [centos@test ~]$ sudo yum install httpd
 ---> Package httpd.x86_64 0:2.4.6-45.el7.centos.4 will be installed
 [ ... ]

 [centos@test ~]$ sudo service httpd start
 Redirecting to /bin/systemctl start  httpd.service

 [centos@test ~]$ sudo yum install mod_wsgi
 ---> Package mod_wsgi.x86_64 0:3.4-12.el7_0 will be installed
 [ ... ]

 [centos@test ~]$ sudo service httpd reload
 Redirecting to /bin/systemctl reload  httpd.service

 [root@test centos]# tail /var/log/httpd/error_log 
 [Thu Apr 27 00:02:43.045172 2017] [core:notice] [pid 1107] AH00052: child pid 1206 exit signal Segmentation fault (11)
 [Thu Apr 27 00:02:43.045174 2017] [core:error] [pid 1107] AH00546: no record of generation 0 of exiting child 1206
 ...
---

Although this seems obscure, I guess that other heat bug shows we're not the first people to hit it.

Comment 7 Graham Dumpleton 2017-04-27 00:40:12 UTC
From memory this problem arises due to differences between how Apache 2.2 and 2.4 handles initialisation of modules. It was addressed in mod_wsgi version 4.4.15. See (2) in:

* http://modwsgi.readthedocs.io/en/develop/release-notes/version-4.4.15.html

Details of patch available from:

* https://github.com/GrahamDumpleton/mod_wsgi/commit/3aef8bf4fb713fc94403bfa45b13d8a7df7dfca7

Comment 8 Joe Orton 2017-04-27 06:55:39 UTC
Ah, great, thanks a lot Graham.

Comment 12 Joe Orton 2017-07-13 16:34:28 UTC
*** Bug 1163516 has been marked as a duplicate of this bug. ***

Comment 14 Joe Orton 2018-02-26 16:17:42 UTC
*** Bug 1548540 has been marked as a duplicate of this bug. ***

Comment 24 errata-xmlrpc 2018-10-30 11:19:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3210