Red Hat Bugzilla – Bug 1337123
EVM systemd restart/start fails in certain conditions
Last modified: 2016-07-12 11:17:41 EDT
Description of problem: EVM systemd restart fails in certain conditions Version-Release number of selected component (if applicable): 5.6.0.7 How reproducible: 100% Steps to Reproduce: 1. killall -9 ruby (simulating some kind of catastrophic failure) 2. service evmserverd restart OR service evmserverd start (as start invokes stop) 3. Actual results: evmserverd never starts up Expected results: evmserverd should start up Additional info: Here's the breakdown I noticed that during the restart, the stop portion was hanging, so I tried replicating this manually, sure enough rake evm:stop yielded a hanging process, so I ran strace and this is what I found [root@host-192-168-100-170 ~]# strace -p 17340 -f Process 17340 attached with 2 threads [pid 17340] restart_syscall(<... resuming interrupted call ...> <unfinished ...> [pid 17341] restart_syscall(<... resuming interrupted call ...> <unfinished ...> [pid 17340] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53 [pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}]) [pid 17340] recvfrom(21, "2\0\0\0\4T\0\0\4w\0&id\0\0\0F\324\0\1\0\0\0\24\0\10\377\377\377\377\0"..., 16384, 0, NULL, NULL) = 1635 [pid 17340] futex(0x11726a4, FUTEX_WAIT_BITSET_PRIVATE, 55, {5010, 72913660}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53 [pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}]) [pid 17340] recvfrom(21, "2\0\0\0\4T\0\0\4w\0&id\0\0\0F\324\0\1\0\0\0\24\0\10\377\377\377\377\0"..., 16384, 0, NULL, NULL) = 1635 [pid 17340] futex(0x11726a4, FUTEX_WAIT_BITSET_PRIVATE, 57, {5020, 77724526}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 17340] brk(0) = 0xc033000 [pid 17340] brk(0xc054000) = 0xc054000 [pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53 [pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}]) [pid 17340] recvfrom(21, "2\0\0\0\4T\0\0\4w\0&id\0\0\0F\324\0\1\0\0\0\24\0\10\377\377\377\377\0"..., 16384, 0, NULL, NULL) = 1635 [pid 17340] futex(0x11726a4, FUTEX_WAIT_BITSET_PRIVATE, 59, {5030, 81315516}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53 [pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}]) There is a polling happening to fd 21 fro pid 17340 Looking at the /proc/<pid>/fd we get this [root@host-192-168-100-170 fd]# ls -la total 0 dr-x------. 2 root root 0 May 18 05:57 . dr-xr-xr-x. 9 root root 0 May 18 05:56 .. lrwx------. 1 root root 64 May 18 06:01 0 -> /dev/pts/1 lrwx------. 1 root root 64 May 18 06:01 1 -> /dev/pts/1 l-wx------. 1 root root 64 May 18 06:01 10 -> /var/www/miq/vmdb/log/fog.log l-wx------. 1 root root 64 May 18 06:01 11 -> /var/www/miq/vmdb/log/policy.log l-wx------. 1 root root 64 May 18 06:01 12 -> /var/www/miq/vmdb/log/vim.log l-wx------. 1 root root 64 May 18 06:01 13 -> /var/www/miq/vmdb/log/rhevm.log l-wx------. 1 root root 64 May 18 06:01 14 -> /var/www/miq/vmdb/log/aws.log l-wx------. 1 root root 64 May 18 06:01 15 -> /var/www/miq/vmdb/log/kubernetes.log l-wx------. 1 root root 64 May 18 06:01 16 -> /var/www/miq/vmdb/log/middleware.log l-wx------. 1 root root 64 May 18 06:01 17 -> /var/www/miq/vmdb/log/scvmm.log l-wx------. 1 root root 64 May 18 06:01 18 -> /var/www/miq/vmdb/log/api.log l-wx------. 1 root root 64 May 18 06:01 19 -> /var/www/miq/vmdb/log/websocket.log lrwx------. 1 root root 64 May 18 05:57 2 -> /dev/pts/1 l-wx------. 1 root root 64 May 18 06:01 20 -> /var/www/miq/vmdb/log/automation.log lrwx------. 1 root root 64 May 18 06:01 21 -> socket:[58500] l-wx------. 1 root root 64 May 18 06:01 22 -> /dev/null lrwx------. 1 root root 64 May 18 06:01 23 -> socket:[51839] lr-x------. 1 root root 64 May 18 06:01 3 -> pipe:[58488] l-wx------. 1 root root 64 May 18 06:01 4 -> pipe:[58488] lr-x------. 1 root root 64 May 18 06:01 5 -> pipe:[58489] l-wx------. 1 root root 64 May 18 06:01 6 -> pipe:[58489] l-wx------. 1 root root 64 May 18 06:01 7 -> /var/www/miq/vmdb/log/evm.log l-wx------. 1 root root 64 May 18 06:01 8 -> /var/www/miq/vmdb/log/production.log l-wx------. 1 root root 64 May 18 06:01 9 -> /var/www/miq/vmdb/log/audit.log So one end of the 58500 socket is this process, the other end is something else, [root@host-192-168-100-170 fd]# lsof | grep 5850 ruby 17340 root 21u unix 0xffff88023ae4cf00 0t0 58500 socket ruby-time 17340 17341 root 21u unix 0xffff88023ae4cf00 0t0 58500 socket postgres 17350 postgres 11u unix 0xffff88023ae4d680 0t0 58501 /var/run/postgresql/.s.PGSQL.5432 This confirmed our ruby process at one end, at at the other end (as sockets are generally inc'd by one, we find postgres. As soon as I shutdown postgres, the whole process kicked off fine. However, the problem is because we didn't initiate the shutdown of postgres during the stop. Possibly because we think it's already going down as evmserverd is in a stopped state
As noted in the bug title, but not in the bug description, this affects start as well as restart.
Pete, thanks for digging into this. Can you confirm all of the workers are exited after killall -9 ruby and before you call start/restart? I ask because the web server based workers (Ui, web services and web socket) all use puma and I'm not sure if the killall -9 ruby would kill them. They should, but who knows. I also ask because we handle non-SIGKILL interrupts(SIGTERM/SIGINT/etc) in our non-puma server and worker processes but these puma workers use puma's interrupt handler. Can you get a system in this state (hanging doing 'stop') for us to look at? Thanks!
From here it looks like this ● evmserverd.service - EVM server daemon Loaded: loaded (/usr/lib/systemd/system/evmserverd.service; enabled; vendor preset: disabled) Active: deactivating (stop) since Wed 2016-05-18 06:50:42 EDT; 3h 54min ago Process: 31316 ExecStart=/bin/sh -c /bin/evmserver.sh start (code=exited, status=0/SUCCESS) Main PID: 20133 (code=killed, signal=KILL); : 31352 (evmserver.sh) CGroup: /system.slice/evmserverd.service └─control ├─31352 /bin/bash /bin/evmserver.sh stop └─31360 ruby /opt/rh/cfme-gemset/bin/rake evm:stop All server processes seem stopped
No puma processes
So I think the main issue here is that systemd should be starting up the process if it exits uncleanly (i.e. not through `systemctl stop evmserverd`) It looks like systemd is also trying to run stop when start fails, which it is doing because the PID file is still present after the kill. This is expected, but because of a bug here (https://github.com/ManageIQ/manageiq/blob/master/lib/pid_file.rb#L37) we are not recognizing that the process is no longer running and are refusing to start the server. Changing that line to check for .blank? rather than .nil? will allow systemd to start the process successfully. Will have a PR shortly.
https://github.com/ManageIQ/manageiq/pull/8793
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/10c7117924120fcba2ffda903410550ad1bc25f9 commit 10c7117924120fcba2ffda903410550ad1bc25f9 Author: Nick Carboni <ncarboni@redhat.com> AuthorDate: Wed May 18 13:33:47 2016 -0400 Commit: Nick Carboni <ncarboni@redhat.com> CommitDate: Wed May 18 13:33:47 2016 -0400 Check command_line return value for blank? rather than nil? MiqProcess.command_line was changed to return an empty string rather than nil in #6305 This means that PidFile.running? would return true even if a process with the given pid was not found. This was preventing us from starting up the server after an unclean shutdown because the pid file would still exist. https://bugzilla.redhat.com/show_bug.cgi?id=1337123 lib/pid_file.rb | 2 +- spec/lib/pid_file_spec.rb | 5 +++++ 2 files changed, 6 insertions(+), 1 deletion(-)
Verified in 5.6.0.8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1348