Created attachment 1473553 [details] appliance logs Description of problem: Some 5.10 appliances die during auto tests. When I logged into such appliance, I found that apache was dead and its log contained the following errors [Sat Aug 04 04:45:05.588887 2018] [core:notice] [pid 65345] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND' [Sat Aug 04 04:49:52.518235 2018] [mpm_prefork:notice] [pid 65345] AH00170: caught SIGWINCH, shutting down gracefully [Sat Aug 04 04:51:22.778845 2018] [core:notice] [pid 2255] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0 [Sat Aug 04 04:51:22.780170 2018] [core:emerg] [pid 2255] (28)No space left on device: AH00023: Couldn't create the rewrite-map mutex AH00016: Configuration Failed [Sat Aug 04 04:52:44.621688 2018] [core:notice] [pid 2740] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0 [Sat Aug 04 04:52:44.623020 2018] [core:emerg] [pid 2740] (28)No space left on device: AH00023: Couldn't create the rewrite-map mutex AH00016: Configuration Failed There was enough disk space everywhere in appliance. When I checked ipc, I found that semaphore arrays max value -> 128 And all of them are used by apache and postgre [root@dhcp-8-198-209 apache]# ipcs -l ------ Semaphore Limits -------- max number of arrays = 128 max semaphores per array = 250 max semaphores system wide = 32000 max ops per semop call = 32 semaphore max value = 32767 [root@dhcp-8-198-209 apache]# ipcs -s | wc -l 132 So, httpd couldn't start because system ran out of sem arrays and httpd didn't clean up its sem arrays on exit/error. Version-Release number of selected component (if applicable): 5.10.0.8 How reproducible: 20%, it happens frequently in our auto tests. I cannot provide exact steps atm but I'm still figuring out this. I can provide appliance with such issue. Steps to Reproduce: 1. in progress Actual results: appliance isn't available via https. all semaphore arrays are used. so, it seems appliance either has too small sem array value or httpd has a leak of semaphores. Expected results: no such issue Additional info: I cured this issue by increasing number of sem arrays. However, number of sem arrays didn't decrease after some time. So, I assume apache doesn't take care of them. Logs and some collected info are attached.
Created attachment 1473555 [details] some evidence from appliance
Hi Ievgen, Interesting... we've seen this problem in the past where we were accidentally doing kill -9 on httpd. See [1]. I'm not sure how to troubleshoot this but the first thing to try is look at the journalctl for httpd, and see if the process is being abnormally terminated. I guess you could also setup a cron job to write the count of semaphores every minute or 5 minutes and see if those line up with anything that happens in journalctl, /var/log/message, or our evm.log. Let me know what you find. Thanks! [1] https://github.com/ManageIQ/manageiq/pull/9236
Hi Ievgen, it looks like httpd was asked to stop and after 90 seconds, systemd kill -9 it. Perhaps the httpd logs would help let us know what's happening for 90 seconds to know why it's not exiting in time.
Hi Joe, yes, I also think so. I looked at httpd logs in both /var/log and vmdb/log/apache but those didn't shed a light on the reason for such latency. You can also check logs. I attached those above when created BZ.
in bug #1434916, they had similar symptoms where a shutdown of httpd was timing out. There was a bug in the mod_http2 module so they disabled that module to troubleshoot it. I guess you could try disabling modules one at a time just to see if you can shut down faster. Another much older bug #906321, had similar symptoms but is a much older version. I looked at your logs and like you said, there isn't a clear reason why it's failing to shutdown within the timeout.
Hi Joe, Now we have this issue in 5.9.4.4 as well. Can I see what has been merged since 5.9.4.3 ? Btw, this issue appears only in tests which setup MiqLDAP and restart httpd. P.S. Couldn't we just make httpd unit remove its stale semathores in ExecStartPre ?
Is this container or appliance, or both? What exactly do you do to make this happen? do you have an ipadress of a machine I can ssh into to recreate it and investigate it? thanks!
1. We see this issue only in regular appliance. 2. It happens only when we run tests in jenkins. in addition, we encounter it only during MiqLdap auth tests, which restart httpd after changing auth settings. I cannot reproduce it locally. so, I suppose it is somehow related to performance. 3. I cannot give you ip address right now because appliances used by tests have already been killed. I can provide you access to next such appliance when we encounter such issue again. I have to confess, there wasn't much info in appliance. all logs and etc were attached to BZ. In addition, I couldn't reproduce the issue locally using the same set of tests and the same appliance.
I know we've already covered much of this, but https://access.redhat.com/solutions/55914 says, this happens because: 1) KILL signal... 2) a module is using too many semaphores From your logs, it looks like it's 1, and coming from systemd. For whatever reason, httpd stopping doesn't finish in 90 seconds and gets killed. Maybe you can turn up httpd logging and experiment with disabling some of the apache modules? Perhaps some of the auth/pam/saml modules are causing it to take to long to shut down? It's interesting you mention the MiqLdap auth tests. Try LogLevel debug, recreate it as is. Collect the logs, then try disabling some of the modules. Can you try to do a subset of tests? Can you do only the MiqLdap and a small number of other tests? Maybe we can eliminate tests until we get a smaller recreation step. Could you try older versions of 5.9? Maybe we can bisect this down to comparing package lists using rpm -qa |sort ?
Ievgen, I had a great suggestion by JoeV. Basically, all MiqLdap tests should not need any apache auth modules. So there's two things to try: 1) bisect which cfme versions have this problem... run the whole suite and figure out which versions of 5.9 were ok (if any) and when did it start failing... perhaps we can get a rpm list to compare and see what might have introduced the problem 2) Try to limit the test suite to narrow the problem scope. For example, are you doing external auth tests and miq ldap tests? Are they happening in the same sequence each time? Is miqldap always first? Is external auth always second? Are you using the same test setup for miqldap as you're using for external auth? Does it fail when you only run miqldap? Or only run external auth tests? JoeV said you should be able to run the miqldap tests alone without any of the auth modules for apache... so if you can recreate it on just the miqldap tests, try eliminating apache modules one at a time and retry.
1. I already tried to use a bunch of appliances with httpd LogLevel = trace6 two times but didn't manage to catch that issue. I'll try once more tomorrow. 2. 5.9.4.3 didn't have such issue. It appeared only in 5.9.4.4. So, as I asked above, could you help to find out what was changed since 5.9.4.3 ? 3. I already narrowed out which tests caused such issue but I still cannot reproduce this issue locally using those tests. I'll try to patch appliance templates which are used in automated testing to provide trace logs for httpd.
Ievgen, here's the package differences from 5.9.4.3 to 5.9.4.4. I had to manually deploy two ovas, boot them, and run rpm -qa | sort and diff them locally: diff --git a/5.9.4.3.txt b/5.9.4.4.txt index 50823af8d7..5af4a142ff 100644 --- a/5.9.4.3.txt +++ b/5.9.4.4.txt @@ -9,9 +9,9 @@ apr-1.4.8-3.el7_4.1.x86_64 apr-util-1.5.2-6.el7.x86_64 at-3.1.13-23.el7.x86_64 attr-2.4.46-13.el7.x86_64 -audit-2.8.1-3.el7.x86_64 -audit-libs-2.8.1-3.el7.x86_64 -audit-libs-python-2.8.1-3.el7.x86_64 +audit-2.8.1-3.el7_5.1.x86_64 +audit-libs-2.8.1-3.el7_5.1.x86_64 +audit-libs-python-2.8.1-3.el7_5.1.x86_64 authconfig-6.2.8-30.el7.x86_64 autofs-5.0.7-83.el7.x86_64 autogen-libopts-5.18-5.el7.x86_64 @@ -32,11 +32,11 @@ bzip2-libs-1.0.6-13.el7.x86_64 ca-certificates-2018.2.22-70.0.el7_5.noarch c-ares-1.10.0-3.el7.x86_64 certmonger-0.78.4-3.el7_5.1.x86_64 -cfme-5.9.4.3-1.el7cf.x86_64 -cfme-appliance-5.9.4.3-1.el7cf.x86_64 -cfme-appliance-common-5.9.4.3-1.el7cf.x86_64 -cfme-appliance-tools-5.9.4.3-1.el7cf.x86_64 -cfme-gemset-5.9.4.3-1.el7cf.x86_64 +cfme-5.9.4.4-1.el7cf.x86_64 +cfme-appliance-5.9.4.4-1.el7cf.x86_64 +cfme-appliance-common-5.9.4.4-1.el7cf.x86_64 +cfme-appliance-tools-5.9.4.4-1.el7cf.x86_64 +cfme-gemset-5.9.4.4-1.el7cf.x86_64 checkpolicy-2.5-6.el7.x86_64 chkconfig-1.7.4-1.el7.x86_64 chrony-3.2-2.el7.x86_64 @@ -44,7 +44,7 @@ cifs-utils-6.2-10.el7.x86_64 cjkuni-ukai-fonts-0.2.20080216.1-51.el7.noarch cjkuni-uming-fonts-0.2.20080216.1-53.el7.noarch cloud-init-0.7.9-24.el7_5.1.x86_64 -cockpit-dashboard-169-1.el7.x86_64 +cockpit-dashboard-172-2.el7.x86_64 cockpit-ws-154-3.el7.x86_64 coreutils-8.22-21.el7.x86_64 cpio-2.11-27.el7.x86_64 @@ -76,9 +76,9 @@ dhcp-common-4.2.5-68.el7_5.1.x86_64 dhcp-libs-4.2.5-68.el7_5.1.x86_64 diffutils-3.3-4.el7.x86_64 dmidecode-3.0-5.el7.x86_64 -dracut-033-535.el7.x86_64 -dracut-config-rescue-033-535.el7.x86_64 -dracut-network-033-535.el7.x86_64 +dracut-033-535.el7_5.1.x86_64 +dracut-config-rescue-033-535.el7_5.1.x86_64 +dracut-network-033-535.el7_5.1.x86_64 e2fsprogs-1.42.9-12.el7_5.x86_64 e2fsprogs-libs-1.42.9-12.el7_5.x86_64 ebtables-2.0.10-16.el7.x86_64 @@ -143,7 +143,7 @@ httpd-tools-2.4.6-80.el7_5.1.x86_64 http-parser-2.7.1-5.el7_4.x86_64 hwdata-0.252-8.8.el7.x86_64 info-5.1-5.el7.x86_64 -initscripts-9.49.41-1.el7.x86_64 +initscripts-9.49.41-1.el7_5.1.x86_64 ipa-client-4.5.4-10.el7_5.3.x86_64 ipa-client-common-4.5.4-10.el7_5.3.noarch ipa-common-4.5.4-10.el7_5.3.noarch @@ -161,15 +161,15 @@ json-glib-1.2.6-1.el7.x86_64 kbd-1.15.5-13.el7.x86_64 kbd-legacy-1.15.5-13.el7.noarch kbd-misc-1.15.5-13.el7.noarch -kernel-3.10.0-862.9.1.el7.x86_64 -kernel-tools-3.10.0-862.9.1.el7.x86_64 -kernel-tools-libs-3.10.0-862.9.1.el7.x86_64 -kexec-tools-2.0.15-13.el7.x86_64 +kernel-3.10.0-862.11.6.el7.x86_64 +kernel-tools-3.10.0-862.11.6.el7.x86_64 +kernel-tools-libs-3.10.0-862.11.6.el7.x86_64 +kexec-tools-2.0.15-13.el7_5.1.x86_64 keyutils-1.5.8-3.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 kmod-20-21.el7.x86_64 kmod-libs-20-21.el7.x86_64 -kpartx-0.4.9-119.el7.x86_64 +kpartx-0.4.9-119.el7_5.1.x86_64 krb5-libs-1.15.1-19.el7.x86_64 krb5-workstation-1.15.1-19.el7.x86_64 lasso-2.5.1-2.el7.x86_64 @@ -179,7 +179,7 @@ libaio-0.3.109-13.el7.x86_64 libassuan-2.1.0-3.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libbasicobjects-0.1.1-29.el7.x86_64 -libblkid-2.23.2-52.el7.x86_64 +libblkid-2.23.2-52.el7_5.1.x86_64 libcap-2.22-9.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcgroup-0.41-15.el7.x86_64 @@ -207,7 +207,7 @@ libgcrypt-1.5.3-14.el7.x86_64 libgnome-keyring-3.12.0-1.el7.x86_64 libgomp-4.8.5-28.el7_5.1.x86_64 libgpg-error-1.12-3.el7.x86_64 -libgudev1-219-57.el7.x86_64 +libgudev1-219-57.el7_5.1.x86_64 libicu-50.1.2-15.el7.x86_64 libidn-1.28-4.el7.x86_64 libini_config-1.3.1-29.el7.x86_64 @@ -217,7 +217,7 @@ libkadm5-1.15.1-19.el7.x86_64 libldb-1.2.2-1.el7.x86_64 libmnl-1.0.3-7.el7.x86_64 libmodman-2.0.1-8.el7.x86_64 -libmount-2.23.2-52.el7.x86_64 +libmount-2.23.2-52.el7_5.1.x86_64 libmspack-0.5-0.5.alpha.el7.x86_64 libndp-1.2-7.el7.x86_64 libnetfilter_conntrack-1.0.6-1.el7_3.x86_64 @@ -242,7 +242,7 @@ libselinux-utils-2.5-12.el7.x86_64 libsemanage-2.5-11.el7.x86_64 libsemanage-python-2.5-11.el7.x86_64 libsepol-2.5-8.1.el7.x86_64 -libsmbclient-4.7.1-6.el7.x86_64 +libsmbclient-4.7.1-9.el7_5.x86_64 libss-1.42.9-12.el7_5.x86_64 libssh-0.7.1-3.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 @@ -264,10 +264,10 @@ libunistring-0.9.3-9.el7.x86_64 libusbx-1.0.21-1.el7.x86_64 libuser-0.60-9.el7.x86_64 libutempter-1.1.6-4.el7.x86_64 -libuuid-2.23.2-52.el7.x86_64 +libuuid-2.23.2-52.el7_5.1.x86_64 libverto-0.2.5-4.el7.x86_64 libverto-tevent-0.2.5-4.el7.x86_64 -libwbclient-4.7.1-6.el7.x86_64 +libwbclient-4.7.1-9.el7_5.x86_64 libX11-1.6.5-1.el7.x86_64 libX11-common-1.6.5-1.el7.noarch libXau-1.0.8-2.1.el7.x86_64 @@ -292,7 +292,7 @@ m2crypto-0.21.1-17.el7.x86_64 mailcap-2.1.41-2.el7.noarch make-3.82-23.el7.x86_64 man-db-2.6.3-9.el7.x86_64 -mariadb-libs-5.5.56-2.el7.x86_64 +mariadb-libs-5.5.60-1.el7_5.x86_64 memcached-1.4.15-10.el7_3.1.x86_64 mercurial-2.6.2-8.el7_4.x86_64 microcode_ctl-2.1-29.10.el7_5.x86_64 @@ -474,7 +474,7 @@ python-netifaces-0.10.4-3.el7.x86_64 python-nss-0.16.0-3.el7.x86_64 python-paramiko-2.1.1-4.el7.noarch python-passlib-1.6.5-2.el7.noarch -python-perf-3.10.0-862.9.1.el7.x86_64 +python-perf-3.10.0-862.11.6.el7.x86_64 python-ply-3.4-11.el7.noarch python-prettytable-0.7.2-3.el7.noarch python-pycparser-2.14-1.el7.noarch @@ -496,7 +496,7 @@ python-yubico-1.2.3-1.el7.noarch pyusb-1.0.0-0.11.b1.el7.noarch pyxattr-0.5.1-5.el7.x86_64 PyYAML-3.10-11.el7.x86_64 -qemu-guest-agent-2.8.0-2.el7.x86_64 +qemu-guest-agent-2.8.0-2.el7_5.1.x86_64 qpid-proton-c-0.19.0-1.el7cf.x86_64 qrencode-libs-3.4.1-3.el7.x86_64 quota-4.01-17.el7.x86_64 @@ -572,17 +572,17 @@ rubygem-rdoc-4.0.0-33.el7_4.noarch rubygems-2.0.14.1-33.el7_4.noarch ruby-irb-2.0.0.648-33.el7_4.noarch ruby-libs-2.0.0.648-33.el7_4.x86_64 -samba-client-libs-4.7.1-6.el7.x86_64 -samba-common-4.7.1-6.el7.noarch -samba-common-libs-4.7.1-6.el7.x86_64 -samba-common-tools-4.7.1-6.el7.x86_64 -samba-libs-4.7.1-6.el7.x86_64 +samba-client-libs-4.7.1-9.el7_5.x86_64 +samba-common-4.7.1-9.el7_5.noarch +samba-common-libs-4.7.1-9.el7_5.x86_64 +samba-common-tools-4.7.1-9.el7_5.x86_64 +samba-libs-4.7.1-9.el7_5.x86_64 scap-security-guide-0.1.36-9.el7_5.noarch scl-utils-20130529-18.el7_4.x86_64 screen-4.1.0-0.25.20120314git3c2946.el7.x86_64 sed-4.2.2-5.el7.x86_64 -selinux-policy-3.13.1-192.el7_5.4.noarch -selinux-policy-targeted-3.13.1-192.el7_5.4.noarch +selinux-policy-3.13.1-192.el7_5.6.noarch +selinux-policy-targeted-3.13.1-192.el7_5.6.noarch setools-libs-3.3.8-2.el7.x86_64 setserial-2.17-33.el7.x86_64 setup-2.8.71-9.el7.noarch @@ -615,9 +615,9 @@ subversion-1.7.14-14.el7.x86_64 subversion-libs-1.7.14-14.el7.x86_64 sudo-1.8.19p2-14.el7_5.x86_64 supervisor-3.1.4-1.el7.noarch -systemd-219-57.el7.x86_64 -systemd-libs-219-57.el7.x86_64 -systemd-sysv-219-57.el7.x86_64 +systemd-219-57.el7_5.1.x86_64 +systemd-libs-219-57.el7_5.1.x86_64 +systemd-sysv-219-57.el7_5.1.x86_64 sysvinit-tools-2.88-14.dsf.el7.x86_64 tar-1.26-34.el7.x86_64 tcp_wrappers-7.6-77.el7.x86_64 @@ -626,14 +626,14 @@ teamd-1.27-4.el7.x86_64 traceroute-2.0.22-2.el7.x86_64 tree-1.6.0-10.el7.x86_64 trousers-0.3.14-2.el7.x86_64 -tuned-2.9.0-1.el7.noarch +tuned-2.9.0-1.el7_5.1.noarch tzdata-2018e-3.el7.noarch unzip-6.0-19.el7.x86_64 usb_modeswitch-2.5.1-1.el7.x86_64 usb_modeswitch-data-20170806-1.el7.noarch usermode-1.111-5.el7.x86_64 ustr-1.0.4-16.el7.x86_64 -util-linux-2.23.2-52.el7.x86_64 +util-linux-2.23.2-52.el7_5.1.x86_64 vim-common-7.4.160-4.el7.x86_64 vim-enhanced-7.4.160-4.el7.x86_64 vim-filesystem-7.4.160-4.el7.x86_64
Hi Ievgen, I put the rpm changes between 5.9.4.3 and 5.9.4.4 in comment #15. If you can easily redeploy and confirm 5.9.4.3 works now and 5.9.4.4 fails, maybe you can try three different tests using 3 copies of the working 5.9.4.3 appliances: 1) yum update 5.9.4.3 for all non-"cfme*" packages 2) yum update 5.9.4.3 for only "cfme*" packages 3) yum update everything It's interesting that systemd series of packages were changed. I'm not sure how to figure out what might be causing this problem other than the tests above.
Hi Ievgen, Did you try the tests described in comment #16? Thanks, joe
Hi Ievgen, I think it's possible there's a bug in httpd or one of the modules we're using in those test but unfortunately, it doesn't look to be very easy to reproduce. Frequently restarting apache is not something we do in production a lot so I don't there's much risk is closing this until we hit it somewhere else. Thank you for trying to reproduce this. Joe