Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1337123 - EVM systemd restart/start fails in certain conditions
EVM systemd restart/start fails in certain conditions
Status: CLOSED ERRATA
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance (Show other bugs)
5.6.0
Unspecified Unspecified
high Severity high
: GA
: 5.6.0
Assigned To: Joe Rafaniello
luke couzens
appliance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-05-18 06:51 EDT by Pete Savage
Modified: 2016-07-12 11:17 EDT (History)
8 users (show)

See Also:
Fixed In Version: 5.6.0.8
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-06-29 12:03:53 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1348 normal SHIPPED_LIVE CFME 5.6.0 bug fixes and enhancement update 2016-06-29 14:50:04 EDT

  None (edit)
Description Pete Savage 2016-05-18 06:51:28 EDT
Description of problem: EVM systemd restart fails in certain conditions


Version-Release number of selected component (if applicable): 5.6.0.7


How reproducible: 100%


Steps to Reproduce:
1. killall -9 ruby (simulating some kind of catastrophic failure)
2. service evmserverd restart OR service evmserverd start (as start invokes stop)
3.

Actual results: evmserverd never starts up


Expected results: evmserverd should start up


Additional info:
Here's the breakdown

I noticed that during the restart, the stop portion was hanging, so I tried replicating this manually, sure enough

rake evm:stop

yielded a hanging process, so I ran strace and this is what I found

[root@host-192-168-100-170 ~]# strace -p 17340 -f
Process 17340 attached with 2 threads
[pid 17340] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 17341] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 17340] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53
[pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}])
[pid 17340] recvfrom(21, "2\0\0\0\4T\0\0\4w\0&id\0\0\0F\324\0\1\0\0\0\24\0\10\377\377\377\377\0"..., 16384, 0, NULL, NULL) = 1635
[pid 17340] futex(0x11726a4, FUTEX_WAIT_BITSET_PRIVATE, 55, {5010, 72913660}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53
[pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}])
[pid 17340] recvfrom(21, "2\0\0\0\4T\0\0\4w\0&id\0\0\0F\324\0\1\0\0\0\24\0\10\377\377\377\377\0"..., 16384, 0, NULL, NULL) = 1635
[pid 17340] futex(0x11726a4, FUTEX_WAIT_BITSET_PRIVATE, 57, {5020, 77724526}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17340] brk(0)                      = 0xc033000
[pid 17340] brk(0xc054000)              = 0xc054000
[pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53
[pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}])
[pid 17340] recvfrom(21, "2\0\0\0\4T\0\0\4w\0&id\0\0\0F\324\0\1\0\0\0\24\0\10\377\377\377\377\0"..., 16384, 0, NULL, NULL) = 1635
[pid 17340] futex(0x11726a4, FUTEX_WAIT_BITSET_PRIVATE, 59, {5030, 81315516}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 17340] futex(0x1172718, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17340] sendto(21, "B\0\0\0\36\0a4\0\0\2\0\0\0\0\0\2\0\0\0\0011\0\0\0\0011\0\1\0\0D"..., 53, MSG_NOSIGNAL, NULL, 0) = 53
[pid 17340] poll([{fd=21, events=POLLIN|POLLERR}], 1, 4294967295) = 1 ([{fd=21, revents=POLLIN}])

There is a polling happening to fd 21 fro pid 17340

Looking at the /proc/<pid>/fd we get this

[root@host-192-168-100-170 fd]# ls -la
total 0
dr-x------. 2 root root  0 May 18 05:57 .
dr-xr-xr-x. 9 root root  0 May 18 05:56 ..
lrwx------. 1 root root 64 May 18 06:01 0 -> /dev/pts/1
lrwx------. 1 root root 64 May 18 06:01 1 -> /dev/pts/1
l-wx------. 1 root root 64 May 18 06:01 10 -> /var/www/miq/vmdb/log/fog.log
l-wx------. 1 root root 64 May 18 06:01 11 -> /var/www/miq/vmdb/log/policy.log
l-wx------. 1 root root 64 May 18 06:01 12 -> /var/www/miq/vmdb/log/vim.log
l-wx------. 1 root root 64 May 18 06:01 13 -> /var/www/miq/vmdb/log/rhevm.log
l-wx------. 1 root root 64 May 18 06:01 14 -> /var/www/miq/vmdb/log/aws.log
l-wx------. 1 root root 64 May 18 06:01 15 -> /var/www/miq/vmdb/log/kubernetes.log
l-wx------. 1 root root 64 May 18 06:01 16 -> /var/www/miq/vmdb/log/middleware.log
l-wx------. 1 root root 64 May 18 06:01 17 -> /var/www/miq/vmdb/log/scvmm.log
l-wx------. 1 root root 64 May 18 06:01 18 -> /var/www/miq/vmdb/log/api.log
l-wx------. 1 root root 64 May 18 06:01 19 -> /var/www/miq/vmdb/log/websocket.log
lrwx------. 1 root root 64 May 18 05:57 2 -> /dev/pts/1
l-wx------. 1 root root 64 May 18 06:01 20 -> /var/www/miq/vmdb/log/automation.log
lrwx------. 1 root root 64 May 18 06:01 21 -> socket:[58500]
l-wx------. 1 root root 64 May 18 06:01 22 -> /dev/null
lrwx------. 1 root root 64 May 18 06:01 23 -> socket:[51839]
lr-x------. 1 root root 64 May 18 06:01 3 -> pipe:[58488]
l-wx------. 1 root root 64 May 18 06:01 4 -> pipe:[58488]
lr-x------. 1 root root 64 May 18 06:01 5 -> pipe:[58489]
l-wx------. 1 root root 64 May 18 06:01 6 -> pipe:[58489]
l-wx------. 1 root root 64 May 18 06:01 7 -> /var/www/miq/vmdb/log/evm.log
l-wx------. 1 root root 64 May 18 06:01 8 -> /var/www/miq/vmdb/log/production.log
l-wx------. 1 root root 64 May 18 06:01 9 -> /var/www/miq/vmdb/log/audit.log

So one end of the 58500 socket is this process, the other end is something else,

[root@host-192-168-100-170 fd]# lsof | grep 5850
ruby      17340            root   21u     unix 0xffff88023ae4cf00       0t0      58500 socket
ruby-time 17340 17341      root   21u     unix 0xffff88023ae4cf00       0t0      58500 socket
postgres  17350        postgres   11u     unix 0xffff88023ae4d680       0t0      58501 /var/run/postgresql/.s.PGSQL.5432


This confirmed our ruby process at one end, at at the other end (as sockets are generally inc'd by one, we find postgres.

As soon as I shutdown postgres, the whole process kicked off fine. However, the problem is because we didn't initiate the shutdown of postgres during the stop. Possibly because we think it's already going down as evmserverd is in a stopped state
Comment 2 Pete Savage 2016-05-18 06:53:04 EDT
As noted in the bug title, but not in the bug description, this affects start as well as restart.
Comment 3 Joe Rafaniello 2016-05-18 09:52:58 EDT
Pete, thanks for digging into this.  Can you confirm all of the workers are exited after killall -9 ruby and before you call start/restart?  I ask because the web server based workers (Ui, web services and web socket) all use puma and I'm not sure if the killall -9 ruby would kill them.  They should, but who knows.  

I also ask because we handle non-SIGKILL interrupts(SIGTERM/SIGINT/etc) in our non-puma server and worker processes but these puma workers use puma's interrupt handler.

Can you get a system in this state (hanging doing 'stop') for us to look at?

Thanks!
Comment 4 Pete Savage 2016-05-18 10:44:08 EDT
From here it looks like this

● evmserverd.service - EVM server daemon
   Loaded: loaded (/usr/lib/systemd/system/evmserverd.service; enabled; vendor preset: disabled)
   Active: deactivating (stop) since Wed 2016-05-18 06:50:42 EDT; 3h 54min ago
  Process: 31316 ExecStart=/bin/sh -c /bin/evmserver.sh start (code=exited, status=0/SUCCESS)
 Main PID: 20133 (code=killed, signal=KILL);         : 31352 (evmserver.sh)
   CGroup: /system.slice/evmserverd.service
           └─control
             ├─31352 /bin/bash /bin/evmserver.sh stop
             └─31360 ruby /opt/rh/cfme-gemset/bin/rake evm:stop

All server processes seem stopped
Comment 5 Pete Savage 2016-05-18 10:44:38 EDT
No puma processes
Comment 6 Nick Carboni 2016-05-18 13:17:22 EDT
So I think the main issue here is that systemd should be starting up the process if it exits uncleanly (i.e. not through `systemctl stop evmserverd`)

It looks like systemd is also trying to run stop when start fails, which it is doing because the PID file is still present after the kill.

This is expected, but because of a bug here (https://github.com/ManageIQ/manageiq/blob/master/lib/pid_file.rb#L37) we are not recognizing that the process is no longer running and are refusing to start the server.

Changing that line to check for .blank? rather than .nil? will allow systemd to start the process successfully.

Will have a PR shortly.
Comment 7 Nick Carboni 2016-05-18 13:39:28 EDT
https://github.com/ManageIQ/manageiq/pull/8793
Comment 8 CFME Bot 2016-05-18 16:36:34 EDT
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/10c7117924120fcba2ffda903410550ad1bc25f9

commit 10c7117924120fcba2ffda903410550ad1bc25f9
Author:     Nick Carboni <ncarboni@redhat.com>
AuthorDate: Wed May 18 13:33:47 2016 -0400
Commit:     Nick Carboni <ncarboni@redhat.com>
CommitDate: Wed May 18 13:33:47 2016 -0400

    Check command_line return value for blank? rather than nil?
    
    MiqProcess.command_line was changed to return an empty string
    rather than nil in #6305
    
    This means that PidFile.running? would return true even if a
    process with the given pid was not found. This was preventing
    us from starting up the server after an unclean shutdown because
    the pid file would still exist.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1337123

 lib/pid_file.rb           | 2 +-
 spec/lib/pid_file_spec.rb | 5 +++++
 2 files changed, 6 insertions(+), 1 deletion(-)
Comment 9 luke couzens 2016-06-02 06:07:01 EDT
Verified in 5.6.0.8
Comment 11 errata-xmlrpc 2016-06-29 12:03:53 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1348

Note You need to log in before you can comment on or make changes to this bug.