1337123 – EVM systemd restart/start fails in certain conditions

Bug 1337123 - EVM systemd restart/start fails in certain conditions

Summary: EVM systemd restart/start fails in certain conditions

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.6.0
Assignee:	Joe Rafaniello
QA Contact:	luke couzens
Docs Contact:
URL:
Whiteboard:	appliance
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-18 10:51 UTC by Pete Savage
Modified:	2016-07-12 15:17 UTC (History)
CC List:	8 users (show)
Fixed In Version:	5.6.0.8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-06-29 16:03:53 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1348	0	normal	SHIPPED_LIVE	CFME 5.6.0 bug fixes and enhancement update	2016-06-29 18:50:04 UTC

Description Pete Savage 2016-05-18 10:51:28 UTC

Description of problem: EVM systemd restart fails in certain conditions


Version-Release number of selected component (if applicable): 5.6.0.7


How reproducible: 100%


Steps to Reproduce:
1. killall -9 ruby (simulating some kind of catastrophic failure)
2. service evmserverd restart OR service evmserverd start (as start invokes stop)
3.

Actual results: evmserverd never starts up


Expected results: evmserverd should start up


Additional info:
Here's the breakdown

I noticed that during the restart, the stop portion was hanging, so I tried replicating this manually, sure enough

rake evm:stop

yielded a hanging process, so I ran strace and this is what I found

[root@host-192-168-100-170 ~]# strace -p 17340 -f
Process 17340 attached with 2 threads
[pid 17340] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 17341] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 17340] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53
[pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}])
[pid 17340] recvfrom(21, "2\0\0\0\4T\0\0\4w\0&id\0\0\0F\324\0\1\0\0\0\24\0\10\377\377\377\377\0"..., 16384, 0, NULL, NULL) = 1635
[pid 17340] futex(0x11726a4, FUTEX_WAIT_BITSET_PRIVATE, 55, {5010, 72913660}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53
[pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}])
[pid 17340] recvfrom(21, "2\0\0\0\4T\0\0\4w\0&id\0\0\0F\324\0\1\0\0\0\24\0\10\377\377\377\377\0"..., 16384, 0, NULL, NULL) = 1635
[pid 17340] futex(0x11726a4, FUTEX_WAIT_BITSET_PRIVATE, 57, {5020, 77724526}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17340] brk(0)                      = 0xc033000
[pid 17340] brk(0xc054000)              = 0xc054000
[pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53
[pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}])
[pid 17340] recvfrom(21, "2\0\0\0\4T\0\0\4w\0&id\0\0\0F\324\0\1\0\0\0\24\0\10\377\377\377\377\0"..., 16384, 0, NULL, NULL) = 1635
[pid 17340] futex(0x11726a4, FUTEX_WAIT_BITSET_PRIVATE, 59, {5030, 81315516}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53
[pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}])

There is a polling happening to fd 21 fro pid 17340

Looking at the /proc/<pid>/fd we get this

[root@host-192-168-100-170 fd]# ls -la
total 0
dr-x------. 2 root root  0 May 18 05:57 .
dr-xr-xr-x. 9 root root  0 May 18 05:56 ..
lrwx------. 1 root root 64 May 18 06:01 0 -> /dev/pts/1
lrwx------. 1 root root 64 May 18 06:01 1 -> /dev/pts/1
l-wx------. 1 root root 64 May 18 06:01 10 -> /var/www/miq/vmdb/log/fog.log
l-wx------. 1 root root 64 May 18 06:01 11 -> /var/www/miq/vmdb/log/policy.log
l-wx------. 1 root root 64 May 18 06:01 12 -> /var/www/miq/vmdb/log/vim.log
l-wx------. 1 root root 64 May 18 06:01 13 -> /var/www/miq/vmdb/log/rhevm.log
l-wx------. 1 root root 64 May 18 06:01 14 -> /var/www/miq/vmdb/log/aws.log
l-wx------. 1 root root 64 May 18 06:01 15 -> /var/www/miq/vmdb/log/kubernetes.log
l-wx------. 1 root root 64 May 18 06:01 16 -> /var/www/miq/vmdb/log/middleware.log
l-wx------. 1 root root 64 May 18 06:01 17 -> /var/www/miq/vmdb/log/scvmm.log
l-wx------. 1 root root 64 May 18 06:01 18 -> /var/www/miq/vmdb/log/api.log
l-wx------. 1 root root 64 May 18 06:01 19 -> /var/www/miq/vmdb/log/websocket.log
lrwx------. 1 root root 64 May 18 05:57 2 -> /dev/pts/1
l-wx------. 1 root root 64 May 18 06:01 20 -> /var/www/miq/vmdb/log/automation.log
lrwx------. 1 root root 64 May 18 06:01 21 -> socket:[58500]
l-wx------. 1 root root 64 May 18 06:01 22 -> /dev/null
lrwx------. 1 root root 64 May 18 06:01 23 -> socket:[51839]
lr-x------. 1 root root 64 May 18 06:01 3 -> pipe:[58488]
l-wx------. 1 root root 64 May 18 06:01 4 -> pipe:[58488]
lr-x------. 1 root root 64 May 18 06:01 5 -> pipe:[58489]
l-wx------. 1 root root 64 May 18 06:01 6 -> pipe:[58489]
l-wx------. 1 root root 64 May 18 06:01 7 -> /var/www/miq/vmdb/log/evm.log
l-wx------. 1 root root 64 May 18 06:01 8 -> /var/www/miq/vmdb/log/production.log
l-wx------. 1 root root 64 May 18 06:01 9 -> /var/www/miq/vmdb/log/audit.log

So one end of the 58500 socket is this process, the other end is something else,

[root@host-192-168-100-170 fd]# lsof | grep 5850
ruby      17340            root   21u     unix 0xffff88023ae4cf00       0t0      58500 socket
ruby-time 17340 17341      root   21u     unix 0xffff88023ae4cf00       0t0      58500 socket
postgres  17350        postgres   11u     unix 0xffff88023ae4d680       0t0      58501 /var/run/postgresql/.s.PGSQL.5432


This confirmed our ruby process at one end, at at the other end (as sockets are generally inc'd by one, we find postgres.

As soon as I shutdown postgres, the whole process kicked off fine. However, the problem is because we didn't initiate the shutdown of postgres during the stop. Possibly because we think it's already going down as evmserverd is in a stopped state

Comment 2 Pete Savage 2016-05-18 10:53:04 UTC

As noted in the bug title, but not in the bug description, this affects start as well as restart.

Comment 3 Joe Rafaniello 2016-05-18 13:52:58 UTC

Pete, thanks for digging into this.  Can you confirm all of the workers are exited after killall -9 ruby and before you call start/restart?  I ask because the web server based workers (Ui, web services and web socket) all use puma and I'm not sure if the killall -9 ruby would kill them.  They should, but who knows.  

I also ask because we handle non-SIGKILL interrupts(SIGTERM/SIGINT/etc) in our non-puma server and worker processes but these puma workers use puma's interrupt handler.

Can you get a system in this state (hanging doing 'stop') for us to look at?

Thanks!

Comment 4 Pete Savage 2016-05-18 14:44:08 UTC

From here it looks like this

● evmserverd.service - EVM server daemon
   Loaded: loaded (/usr/lib/systemd/system/evmserverd.service; enabled; vendor preset: disabled)
   Active: deactivating (stop) since Wed 2016-05-18 06:50:42 EDT; 3h 54min ago
  Process: 31316 ExecStart=/bin/sh -c /bin/evmserver.sh start (code=exited, status=0/SUCCESS)
 Main PID: 20133 (code=killed, signal=KILL);         : 31352 (evmserver.sh)
   CGroup: /system.slice/evmserverd.service
           └─control
             ├─31352 /bin/bash /bin/evmserver.sh stop
             └─31360 ruby /opt/rh/cfme-gemset/bin/rake evm:stop

All server processes seem stopped

Comment 5 Pete Savage 2016-05-18 14:44:38 UTC

No puma processes

Comment 6 Nick Carboni 2016-05-18 17:17:22 UTC

So I think the main issue here is that systemd should be starting up the process if it exits uncleanly (i.e. not through `systemctl stop evmserverd`)

It looks like systemd is also trying to run stop when start fails, which it is doing because the PID file is still present after the kill.

This is expected, but because of a bug here (https://github.com/ManageIQ/manageiq/blob/master/lib/pid_file.rb#L37) we are not recognizing that the process is no longer running and are refusing to start the server.

Changing that line to check for .blank? rather than .nil? will allow systemd to start the process successfully.

Will have a PR shortly.

Comment 7 Nick Carboni 2016-05-18 17:39:28 UTC

https://github.com/ManageIQ/manageiq/pull/8793

Comment 8 CFME Bot 2016-05-18 20:36:34 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/10c7117924120fcba2ffda903410550ad1bc25f9

commit 10c7117924120fcba2ffda903410550ad1bc25f9
Author:     Nick Carboni <ncarboni>
AuthorDate: Wed May 18 13:33:47 2016 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Wed May 18 13:33:47 2016 -0400

    Check command_line return value for blank? rather than nil?
    
    MiqProcess.command_line was changed to return an empty string
    rather than nil in #6305
    
    This means that PidFile.running? would return true even if a
    process with the given pid was not found. This was preventing
    us from starting up the server after an unclean shutdown because
    the pid file would still exist.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1337123

 lib/pid_file.rb           | 2 +-
 spec/lib/pid_file_spec.rb | 5 +++++
 2 files changed, 6 insertions(+), 1 deletion(-)

Comment 9 luke couzens 2016-06-02 10:07:01 UTC

Verified in 5.6.0.8

Comment 11 errata-xmlrpc 2016-06-29 16:03:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1348

Note You need to log in before you can comment on or make changes to this bug.