Bug 1612792 - httpd dies and cannot restart because of lack of semaphores
Summary: httpd dies and cannot restart because of lack of semaphores
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.10.0
Assignee: Joe Rafaniello
QA Contact: Dave Johnson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-06 09:38 UTC by Ievgen Zapolskyi
Modified: 2020-01-29 07:40 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-16 14:25:10 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
appliance logs (18.89 MB, application/x-bzip)
2018-08-06 09:38 UTC, Ievgen Zapolskyi
no flags Details
some evidence from appliance (7.97 KB, text/plain)
2018-08-06 09:41 UTC, Ievgen Zapolskyi
no flags Details

Description Ievgen Zapolskyi 2018-08-06 09:38:17 UTC
Created attachment 1473553 [details]
appliance logs

Description of problem:
Some 5.10 appliances die during auto tests.
When I logged into such appliance, I found that apache was dead and its log contained the following errors

[Sat Aug 04 04:45:05.588887 2018] [core:notice] [pid 65345] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Sat Aug 04 04:49:52.518235 2018] [mpm_prefork:notice] [pid 65345] AH00170: caught SIGWINCH, shutting down gracefully
[Sat Aug 04 04:51:22.778845 2018] [core:notice] [pid 2255] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0
[Sat Aug 04 04:51:22.780170 2018] [core:emerg] [pid 2255] (28)No space left on device: AH00023: Couldn't create the rewrite-map mutex
AH00016: Configuration Failed
[Sat Aug 04 04:52:44.621688 2018] [core:notice] [pid 2740] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0
[Sat Aug 04 04:52:44.623020 2018] [core:emerg] [pid 2740] (28)No space left on device: AH00023: Couldn't create the rewrite-map mutex
AH00016: Configuration Failed

There was enough disk space everywhere in appliance.
When I checked ipc, I found that semaphore arrays max value -> 128
And all of them are used by apache and postgre

[root@dhcp-8-198-209 apache]# ipcs -l

------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 32
semaphore max value = 32767


[root@dhcp-8-198-209 apache]# ipcs -s | wc -l
132

So, httpd couldn't start because system ran out of sem arrays and httpd didn't clean up its sem arrays on exit/error.


Version-Release number of selected component (if applicable):
5.10.0.8

How reproducible:
20%, it happens frequently in our auto tests. I cannot provide exact steps atm but I'm still figuring out this.
I can provide appliance with such issue.

Steps to Reproduce:
1. in progress

Actual results:
appliance isn't available via https.
all semaphore arrays are used.
so, it seems appliance either has too small sem array value or httpd has a leak of semaphores.

Expected results:
no such issue

Additional info:
I cured this issue by increasing number of sem arrays. However, number of sem arrays didn't decrease after some time. So, I assume apache doesn't take care of them.

Logs and some collected info are attached.

Comment 2 Ievgen Zapolskyi 2018-08-06 09:41:34 UTC
Created attachment 1473555 [details]
some evidence from appliance

Comment 4 Joe Rafaniello 2018-08-06 19:25:41 UTC
Hi Ievgen,

Interesting... we've seen this problem in the past where we were accidentally doing kill -9 on httpd.  See [1].  

I'm not sure how to troubleshoot this but the first thing to try is look at the journalctl for httpd, and see if the process is being abnormally terminated.   I guess you could also setup a cron job to write the count of semaphores every minute or 5 minutes and see if those line up with anything that happens in journalctl, /var/log/message, or our evm.log. 

Let me know what you find.  Thanks!

[1] https://github.com/ManageIQ/manageiq/pull/9236

Comment 6 Joe Rafaniello 2018-08-08 15:10:43 UTC
Hi Ievgen, it looks like httpd was asked to stop and after 90 seconds, systemd kill -9 it.  Perhaps the httpd logs would help let us know what's happening for 90 seconds to know why it's not exiting in time.

Comment 7 Ievgen Zapolskyi 2018-08-08 16:16:57 UTC
Hi Joe, 

yes, I also think so. I looked at httpd logs in both /var/log and vmdb/log/apache but those didn't shed a light on the reason for such latency.
You can also check logs. I attached those above when created BZ.

Comment 8 Joe Rafaniello 2018-08-08 18:38:46 UTC
in bug #1434916, they had similar symptoms where a shutdown of httpd was timing out.  There was a bug in the mod_http2 module so they disabled that module to troubleshoot it.  I guess you could try disabling modules one at a time just to see if you can shut down faster.

Another much older bug #906321, had similar symptoms but is a much older version.

I looked at your logs and like you said, there isn't a clear reason why it's failing to shutdown within the timeout.

Comment 9 Ievgen Zapolskyi 2018-08-21 11:00:45 UTC
Hi Joe,

Now we have this issue in 5.9.4.4 as well. 
Can I see what has been merged since 5.9.4.3 ?

Btw, this issue appears only in tests which setup MiqLDAP and restart httpd.

P.S. Couldn't we just make httpd unit remove its stale semathores in ExecStartPre ?

Comment 10 Joe Rafaniello 2018-08-21 14:20:55 UTC
Is this container or appliance, or both?

What exactly do you do to make this happen?

do you have an ipadress of a machine I can ssh into to recreate it and investigate it?

thanks!

Comment 11 Ievgen Zapolskyi 2018-08-21 16:13:28 UTC
1. We see this issue only in regular appliance.

2. It happens only when we run tests in jenkins. in addition, we encounter it only during MiqLdap auth tests, which restart httpd after changing auth settings.
I cannot reproduce it locally. so, I suppose it is somehow related to performance.

3. I cannot give you ip address right now because appliances used by tests have already been killed. I can provide you access to next such appliance when we encounter such issue again. 
I have to confess, there wasn't much info in appliance. all logs and etc were attached to BZ. In addition, I couldn't reproduce the issue locally using the same set of tests and the same appliance.

Comment 12 Joe Rafaniello 2018-08-21 17:35:32 UTC
I know we've already covered much of this, but 

https://access.redhat.com/solutions/55914

says, this happens because:
1) KILL signal... 

2) a module is using too many semaphores

From your logs, it looks like it's 1, and coming from systemd.  For whatever reason, httpd stopping doesn't finish in 90 seconds and gets killed.  Maybe you can turn up httpd logging and experiment with disabling some of the apache modules?  Perhaps some of the auth/pam/saml modules are causing it to take to long to shut down?  It's interesting you mention the MiqLdap auth tests.

Try LogLevel debug, recreate it as is.  Collect the logs, then try disabling some of the modules.

Can you try to do a subset of tests?  Can you do only the MiqLdap and a small number of other tests?  Maybe we can eliminate tests until we get a smaller recreation step.

Could you try older versions of 5.9?  Maybe we can bisect this down to comparing package lists using rpm -qa |sort ?

Comment 13 Joe Rafaniello 2018-08-21 18:02:55 UTC
Ievgen, I had a great suggestion by JoeV.  Basically, all MiqLdap tests should not need any apache auth modules.  So there's two things to try:

1) bisect which cfme versions have this problem... run the whole suite and figure out which versions of 5.9 were ok (if any) and when did it start failing... perhaps we can get a rpm list to compare and see what might have introduced the problem

2) Try to limit the test suite to narrow the problem scope.  For example, are you doing external auth tests and miq ldap tests?  Are they happening in the same sequence each time?  Is miqldap always first?  Is external auth always second?  Are you using the same test setup for miqldap as you're using for external auth?  
Does it fail when you only run miqldap?  Or only run external auth tests?

JoeV said you should be able to run the miqldap tests alone without any of the auth modules for apache... so if you can recreate it on just the miqldap tests, try eliminating apache modules one at a time and retry.

Comment 14 Ievgen Zapolskyi 2018-08-21 20:37:49 UTC
1. I already tried to use a bunch of appliances with httpd LogLevel = trace6 two times but didn't manage to catch that issue. I'll try once more tomorrow.
2. 5.9.4.3 didn't have such issue. It appeared only in 5.9.4.4. So, as I asked above, could you help to find out what was changed since 5.9.4.3 ?
3. I already narrowed out which tests caused such issue but I still cannot reproduce this issue locally using those tests.

I'll try to patch appliance templates which are used in automated testing to provide trace logs for httpd.

Comment 15 Joe Rafaniello 2018-08-21 21:11:23 UTC
Ievgen, here's the package differences from 5.9.4.3 to 5.9.4.4.  I had to manually deploy two ovas, boot them, and run rpm -qa | sort and diff them locally:

diff --git a/5.9.4.3.txt b/5.9.4.4.txt
index 50823af8d7..5af4a142ff 100644
--- a/5.9.4.3.txt
+++ b/5.9.4.4.txt
@@ -9,9 +9,9 @@ apr-1.4.8-3.el7_4.1.x86_64
 apr-util-1.5.2-6.el7.x86_64
 at-3.1.13-23.el7.x86_64
 attr-2.4.46-13.el7.x86_64
-audit-2.8.1-3.el7.x86_64
-audit-libs-2.8.1-3.el7.x86_64
-audit-libs-python-2.8.1-3.el7.x86_64
+audit-2.8.1-3.el7_5.1.x86_64
+audit-libs-2.8.1-3.el7_5.1.x86_64
+audit-libs-python-2.8.1-3.el7_5.1.x86_64
 authconfig-6.2.8-30.el7.x86_64
 autofs-5.0.7-83.el7.x86_64
 autogen-libopts-5.18-5.el7.x86_64
@@ -32,11 +32,11 @@ bzip2-libs-1.0.6-13.el7.x86_64
 ca-certificates-2018.2.22-70.0.el7_5.noarch
 c-ares-1.10.0-3.el7.x86_64
 certmonger-0.78.4-3.el7_5.1.x86_64
-cfme-5.9.4.3-1.el7cf.x86_64
-cfme-appliance-5.9.4.3-1.el7cf.x86_64
-cfme-appliance-common-5.9.4.3-1.el7cf.x86_64
-cfme-appliance-tools-5.9.4.3-1.el7cf.x86_64
-cfme-gemset-5.9.4.3-1.el7cf.x86_64
+cfme-5.9.4.4-1.el7cf.x86_64
+cfme-appliance-5.9.4.4-1.el7cf.x86_64
+cfme-appliance-common-5.9.4.4-1.el7cf.x86_64
+cfme-appliance-tools-5.9.4.4-1.el7cf.x86_64
+cfme-gemset-5.9.4.4-1.el7cf.x86_64
 checkpolicy-2.5-6.el7.x86_64
 chkconfig-1.7.4-1.el7.x86_64
 chrony-3.2-2.el7.x86_64
@@ -44,7 +44,7 @@ cifs-utils-6.2-10.el7.x86_64
 cjkuni-ukai-fonts-0.2.20080216.1-51.el7.noarch
 cjkuni-uming-fonts-0.2.20080216.1-53.el7.noarch
 cloud-init-0.7.9-24.el7_5.1.x86_64
-cockpit-dashboard-169-1.el7.x86_64
+cockpit-dashboard-172-2.el7.x86_64
 cockpit-ws-154-3.el7.x86_64
 coreutils-8.22-21.el7.x86_64
 cpio-2.11-27.el7.x86_64
@@ -76,9 +76,9 @@ dhcp-common-4.2.5-68.el7_5.1.x86_64
 dhcp-libs-4.2.5-68.el7_5.1.x86_64
 diffutils-3.3-4.el7.x86_64
 dmidecode-3.0-5.el7.x86_64
-dracut-033-535.el7.x86_64
-dracut-config-rescue-033-535.el7.x86_64
-dracut-network-033-535.el7.x86_64
+dracut-033-535.el7_5.1.x86_64
+dracut-config-rescue-033-535.el7_5.1.x86_64
+dracut-network-033-535.el7_5.1.x86_64
 e2fsprogs-1.42.9-12.el7_5.x86_64
 e2fsprogs-libs-1.42.9-12.el7_5.x86_64
 ebtables-2.0.10-16.el7.x86_64
@@ -143,7 +143,7 @@ httpd-tools-2.4.6-80.el7_5.1.x86_64
 http-parser-2.7.1-5.el7_4.x86_64
 hwdata-0.252-8.8.el7.x86_64
 info-5.1-5.el7.x86_64
-initscripts-9.49.41-1.el7.x86_64
+initscripts-9.49.41-1.el7_5.1.x86_64
 ipa-client-4.5.4-10.el7_5.3.x86_64
 ipa-client-common-4.5.4-10.el7_5.3.noarch
 ipa-common-4.5.4-10.el7_5.3.noarch
@@ -161,15 +161,15 @@ json-glib-1.2.6-1.el7.x86_64
 kbd-1.15.5-13.el7.x86_64
 kbd-legacy-1.15.5-13.el7.noarch
 kbd-misc-1.15.5-13.el7.noarch
-kernel-3.10.0-862.9.1.el7.x86_64
-kernel-tools-3.10.0-862.9.1.el7.x86_64
-kernel-tools-libs-3.10.0-862.9.1.el7.x86_64
-kexec-tools-2.0.15-13.el7.x86_64
+kernel-3.10.0-862.11.6.el7.x86_64
+kernel-tools-3.10.0-862.11.6.el7.x86_64
+kernel-tools-libs-3.10.0-862.11.6.el7.x86_64
+kexec-tools-2.0.15-13.el7_5.1.x86_64
 keyutils-1.5.8-3.el7.x86_64
 keyutils-libs-1.5.8-3.el7.x86_64
 kmod-20-21.el7.x86_64
 kmod-libs-20-21.el7.x86_64
-kpartx-0.4.9-119.el7.x86_64
+kpartx-0.4.9-119.el7_5.1.x86_64
 krb5-libs-1.15.1-19.el7.x86_64
 krb5-workstation-1.15.1-19.el7.x86_64
 lasso-2.5.1-2.el7.x86_64
@@ -179,7 +179,7 @@ libaio-0.3.109-13.el7.x86_64
 libassuan-2.1.0-3.el7.x86_64
 libattr-2.4.46-13.el7.x86_64
 libbasicobjects-0.1.1-29.el7.x86_64
-libblkid-2.23.2-52.el7.x86_64
+libblkid-2.23.2-52.el7_5.1.x86_64
 libcap-2.22-9.el7.x86_64
 libcap-ng-0.7.5-4.el7.x86_64
 libcgroup-0.41-15.el7.x86_64
@@ -207,7 +207,7 @@ libgcrypt-1.5.3-14.el7.x86_64
 libgnome-keyring-3.12.0-1.el7.x86_64
 libgomp-4.8.5-28.el7_5.1.x86_64
 libgpg-error-1.12-3.el7.x86_64
-libgudev1-219-57.el7.x86_64
+libgudev1-219-57.el7_5.1.x86_64
 libicu-50.1.2-15.el7.x86_64
 libidn-1.28-4.el7.x86_64
 libini_config-1.3.1-29.el7.x86_64
@@ -217,7 +217,7 @@ libkadm5-1.15.1-19.el7.x86_64
 libldb-1.2.2-1.el7.x86_64
 libmnl-1.0.3-7.el7.x86_64
 libmodman-2.0.1-8.el7.x86_64
-libmount-2.23.2-52.el7.x86_64
+libmount-2.23.2-52.el7_5.1.x86_64
 libmspack-0.5-0.5.alpha.el7.x86_64
 libndp-1.2-7.el7.x86_64
 libnetfilter_conntrack-1.0.6-1.el7_3.x86_64
@@ -242,7 +242,7 @@ libselinux-utils-2.5-12.el7.x86_64
 libsemanage-2.5-11.el7.x86_64
 libsemanage-python-2.5-11.el7.x86_64
 libsepol-2.5-8.1.el7.x86_64
-libsmbclient-4.7.1-6.el7.x86_64
+libsmbclient-4.7.1-9.el7_5.x86_64
 libss-1.42.9-12.el7_5.x86_64
 libssh-0.7.1-3.el7.x86_64
 libssh2-1.4.3-10.el7_2.1.x86_64
@@ -264,10 +264,10 @@ libunistring-0.9.3-9.el7.x86_64
 libusbx-1.0.21-1.el7.x86_64
 libuser-0.60-9.el7.x86_64
 libutempter-1.1.6-4.el7.x86_64
-libuuid-2.23.2-52.el7.x86_64
+libuuid-2.23.2-52.el7_5.1.x86_64
 libverto-0.2.5-4.el7.x86_64
 libverto-tevent-0.2.5-4.el7.x86_64
-libwbclient-4.7.1-6.el7.x86_64
+libwbclient-4.7.1-9.el7_5.x86_64
 libX11-1.6.5-1.el7.x86_64
 libX11-common-1.6.5-1.el7.noarch
 libXau-1.0.8-2.1.el7.x86_64
@@ -292,7 +292,7 @@ m2crypto-0.21.1-17.el7.x86_64
 mailcap-2.1.41-2.el7.noarch
 make-3.82-23.el7.x86_64
 man-db-2.6.3-9.el7.x86_64
-mariadb-libs-5.5.56-2.el7.x86_64
+mariadb-libs-5.5.60-1.el7_5.x86_64
 memcached-1.4.15-10.el7_3.1.x86_64
 mercurial-2.6.2-8.el7_4.x86_64
 microcode_ctl-2.1-29.10.el7_5.x86_64
@@ -474,7 +474,7 @@ python-netifaces-0.10.4-3.el7.x86_64
 python-nss-0.16.0-3.el7.x86_64
 python-paramiko-2.1.1-4.el7.noarch
 python-passlib-1.6.5-2.el7.noarch
-python-perf-3.10.0-862.9.1.el7.x86_64
+python-perf-3.10.0-862.11.6.el7.x86_64
 python-ply-3.4-11.el7.noarch
 python-prettytable-0.7.2-3.el7.noarch
 python-pycparser-2.14-1.el7.noarch
@@ -496,7 +496,7 @@ python-yubico-1.2.3-1.el7.noarch
 pyusb-1.0.0-0.11.b1.el7.noarch
 pyxattr-0.5.1-5.el7.x86_64
 PyYAML-3.10-11.el7.x86_64
-qemu-guest-agent-2.8.0-2.el7.x86_64
+qemu-guest-agent-2.8.0-2.el7_5.1.x86_64
 qpid-proton-c-0.19.0-1.el7cf.x86_64
 qrencode-libs-3.4.1-3.el7.x86_64
 quota-4.01-17.el7.x86_64
@@ -572,17 +572,17 @@ rubygem-rdoc-4.0.0-33.el7_4.noarch
 rubygems-2.0.14.1-33.el7_4.noarch
 ruby-irb-2.0.0.648-33.el7_4.noarch
 ruby-libs-2.0.0.648-33.el7_4.x86_64
-samba-client-libs-4.7.1-6.el7.x86_64
-samba-common-4.7.1-6.el7.noarch
-samba-common-libs-4.7.1-6.el7.x86_64
-samba-common-tools-4.7.1-6.el7.x86_64
-samba-libs-4.7.1-6.el7.x86_64
+samba-client-libs-4.7.1-9.el7_5.x86_64
+samba-common-4.7.1-9.el7_5.noarch
+samba-common-libs-4.7.1-9.el7_5.x86_64
+samba-common-tools-4.7.1-9.el7_5.x86_64
+samba-libs-4.7.1-9.el7_5.x86_64
 scap-security-guide-0.1.36-9.el7_5.noarch
 scl-utils-20130529-18.el7_4.x86_64
 screen-4.1.0-0.25.20120314git3c2946.el7.x86_64
 sed-4.2.2-5.el7.x86_64
-selinux-policy-3.13.1-192.el7_5.4.noarch
-selinux-policy-targeted-3.13.1-192.el7_5.4.noarch
+selinux-policy-3.13.1-192.el7_5.6.noarch
+selinux-policy-targeted-3.13.1-192.el7_5.6.noarch
 setools-libs-3.3.8-2.el7.x86_64
 setserial-2.17-33.el7.x86_64
 setup-2.8.71-9.el7.noarch
@@ -615,9 +615,9 @@ subversion-1.7.14-14.el7.x86_64
 subversion-libs-1.7.14-14.el7.x86_64
 sudo-1.8.19p2-14.el7_5.x86_64
 supervisor-3.1.4-1.el7.noarch
-systemd-219-57.el7.x86_64
-systemd-libs-219-57.el7.x86_64
-systemd-sysv-219-57.el7.x86_64
+systemd-219-57.el7_5.1.x86_64
+systemd-libs-219-57.el7_5.1.x86_64
+systemd-sysv-219-57.el7_5.1.x86_64
 sysvinit-tools-2.88-14.dsf.el7.x86_64
 tar-1.26-34.el7.x86_64
 tcp_wrappers-7.6-77.el7.x86_64
@@ -626,14 +626,14 @@ teamd-1.27-4.el7.x86_64
 traceroute-2.0.22-2.el7.x86_64
 tree-1.6.0-10.el7.x86_64
 trousers-0.3.14-2.el7.x86_64
-tuned-2.9.0-1.el7.noarch
+tuned-2.9.0-1.el7_5.1.noarch
 tzdata-2018e-3.el7.noarch
 unzip-6.0-19.el7.x86_64
 usb_modeswitch-2.5.1-1.el7.x86_64
 usb_modeswitch-data-20170806-1.el7.noarch
 usermode-1.111-5.el7.x86_64
 ustr-1.0.4-16.el7.x86_64
-util-linux-2.23.2-52.el7.x86_64
+util-linux-2.23.2-52.el7_5.1.x86_64
 vim-common-7.4.160-4.el7.x86_64
 vim-enhanced-7.4.160-4.el7.x86_64
 vim-filesystem-7.4.160-4.el7.x86_64

Comment 16 Joe Rafaniello 2018-08-23 18:05:39 UTC
Hi Ievgen, I put the rpm changes between 5.9.4.3 and 5.9.4.4 in comment #15. 

If you can easily redeploy and confirm 5.9.4.3 works now and 5.9.4.4 fails, maybe you can try three different tests using 3 copies of the working 5.9.4.3 appliances:
1) yum update 5.9.4.3 for all non-"cfme*" packages
2) yum update 5.9.4.3 for only "cfme*" packages
3) yum update everything

It's interesting that systemd series of packages were changed.  I'm not sure how to figure out what might be causing this problem other than the tests above.

Comment 18 Joe Rafaniello 2018-10-10 19:36:33 UTC
Hi Ievgen,

Did you try the tests described in comment #16?

Thanks,
joe

Comment 20 Joe Rafaniello 2018-10-16 14:25:10 UTC
Hi Ievgen,

I think it's possible there's a bug in httpd or one of the modules we're using in those test but unfortunately, it doesn't look to be very easy to reproduce.  Frequently restarting apache is not something we do in production a lot so I don't there's much risk is closing this until we hit it somewhere else.

Thank you for trying to reproduce this.
Joe


Note You need to log in before you can comment on or make changes to this bug.