Bug 1394617

Summary: ovirt-engine does not shut down cleanly
Product: Red Hat Enterprise Virtualization Manager Reporter: Marcus West <mwest>
Component: ovirt-engineAssignee: Juan Hernández <juan.hernandez>
Status: CLOSED ERRATA QA Contact: Radim Hrazdil <rhrazdil>
Severity: high Docs Contact:
Priority: high    
Version: 4.0.4CC: jaeshin, jmoon, juan.hernandez, lsurette, lsvaty, lveyde, mperina, msekleta, mwest, mzheng, rbalakri, Rhev-m-bugs, sabose, srevivo, systemd-maint-list, ylavi
Target Milestone: ovirt-4.2.0Keywords: ZStream
Target Release: 4.2.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
: 1445733 (view as bug list) Environment:
Last Closed: 2018-05-15 17:38:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1445733    
Attachments:
Description Flags
sending SIGKILL after SIGTERM
none
exit success
none
console log for 4.1.0.4-0.1 none

Description Marcus West 2016-11-14 04:17:05 UTC
## Description of problem:

ovirt-engine does not shut down clean

## Version-Release number of selected component (if applicable):

ovirt-engine-4.0.4.4-0.1.el7ev.noarch
rhevm-4.0.4.4-0.1.el7ev.noarch

## How reproducible:

All the environments that I have tried so far

## Steps to Reproduce:
1. run 'systemctl stop ovirt-engine'
2.
3.

## Actual results:

systemctl status shows:

  Active: failed (Result: exit-code) since Mon 2016-11-14 14:10:44 AEST; 3s ago

and

  Nov 14 14:10:44 <hostname> ovirt-engine.py[4987]: 2016-11-14 14:10:44,618 ovirt-engine: ERROR run:532 Error: Had to kill process 5018


## Expected results:

Expect to see something along the lines of:

  Active: inactive

No no reports of processes being killed abruptly

## Additional info:

This does not appear to be causing any other issues at the moment, but concerned it could have the potential to cause issues.  For example, what would happen if ovirt-engine was abruptly killed in the middle of a lengthy transaction?

Comment 10 Jay Shin 2016-11-28 23:49:21 UTC
Created attachment 1225534 [details]
sending SIGKILL after SIGTERM

Comment 11 Jay Shin 2016-11-28 23:50:00 UTC
Created attachment 1225535 [details]
exit success

Comment 16 Sandro Bonazzola 2017-02-06 10:11:23 UTC
Lev, can you please update on this?

Comment 17 Lev Veyde 2017-02-06 16:19:37 UTC
(In reply to Sandro Bonazzola from comment #16)
> Lev, can you please update on this?

Yes, the issue is with the Java/JBoss ovirt-engine service that doesn't handle the SIGTERM properly, and thus doesn't shut down orderly.

What then happens is that the process is killed with SIGKILL after waiting for the default timeout of 10 seconds, thus the error message.

Moving the issue to infra.

Comment 18 Oved Ourfali 2017-02-06 16:36:34 UTC
Juan, can you take a look?

Comment 19 Oved Ourfali 2017-02-06 16:46:53 UTC
Marcus, does it reproduce to you in latest 4.0? In 4.1?

Comment 23 Juan Hernández 2017-02-06 17:52:06 UTC
Our stop process first send the SIGTERM signal to the application server. Then, if it doesn't finish in 10 seconds, it sends a SIGKILL. The initial SIGTERM should have triggered the execution of the shutdown hooks within the Java virtual machine, which should in run run the application server shutdown logic. But I guess that the thread that runs that logic is blocked by something, probably by something that failed to do its own shutdown cleanly, like an scheduled task that failed because it can no longer connect to the database.

To help diagnose this, it would be helpful to have a thread stack trace dump of the Java virtual machine, after sending the SIGTERM. To get it, please do the following:

1. Find the PID of the Java virtual machine:

  # pidof ovirt-engine

2. Manually send the SIGTERM signal to that process:

  # kill -SIGTERM the_pid

3. The process will most probably stay alive. Check that. And generate a thread stack dump sending the SIGQUIT signal to the process:

  # kill -SIGQUIT the_pid

4. The output will be written to the /var/log/ovirt-engine/console.log file. It will most probably contain something like this:

  "SIGTERM handler" ...
    java.lang.Thread.State: WAITING (on object monitor)
    ...

5. Repeat the SIQUIT command a couple of times, after a few seconds, to verify that the SIGTERM handler thread stays locked. Then share the console.log file with us.

Comment 27 Marcus West 2017-02-07 15:16:08 UTC
Hi Juan,

I've followed c#23 and uploaded console.log - one for version 4.0.6.3-0.1, and one 4.1.0.4-0.1 for comparison (rhevbeta-rhevm-4.1.0.4-0.1.console.log  rhevmhe1-rhevm-4.0.6.3-0.1.console.log).  I ran each -SIGQUIT about 10 times each, but kept getting thread stack trace dumps each time.

Comment 29 Marcus West 2017-02-07 15:18:22 UTC
Created attachment 1248428 [details]
console log for 4.1.0.4-0.1

Comment 30 Juan Hernández 2017-02-24 17:54:11 UTC
I found two potential issues that interfere with the normal shutdown process of the application server.

1. The Infinispan caches that we create are automatically started by the application server, but never stopped. This results in an exception during the shutdown process, because the application server detects that the cache is in use, and refuses to destroy it:

---8<---
2017-02-24 18:13:37,010+01 ERROR [stderr] (ServerService Thread Pool -- 60) Exception in thread "ServerService Thread Pool -- 60" java.lang.IllegalStateException: ISPN000371: Cannot remove cache configuration 'timeout-base' because it is in use
2017-02-24 18:13:37,010+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.infinispan.manager.DefaultCacheManager.undefineConfiguration(DefaultCacheManager.java:402)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.jboss.as.clustering.infinispan.DefaultCacheContainer.undefineConfiguration(DefaultCacheContainer.java:88)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.wildfly.clustering.infinispan.spi.service.ConfigurationBuilder.stop(ConfigurationBuilder.java:84)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.wildfly.clustering.service.AsynchronousServiceBuilder$2.run(AsynchronousServiceBuilder.java:130)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at java.lang.Thread.run(Thread.java:745)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.jboss.threads.JBossThread.run(JBossThread.java:320)
--->8---

Note that this is just written to 'stderr', so it doesn't really block the shutdown process. It is probably a bug in WildFly, as in theory the application server should stop the cache once all the objects that were using it are destroyed.

2. The 'DBSchedulerUtilQuartzImpl' scheduler is never really stopped. During the shutdown process the database connection pool that it uses is destroyed, and because of this the jobs that are pending fail, and are retried, apparently for ever. This change fixes it:

---8<---
@@ -369,7 +369,7 @@ public abstract class SchedulerUtilBaseImpl implements SchedulerUtil {
     public void shutDown() {
         try {
             if (sched != null) {
-                sched.shutdown(true);
+                sched.shutdown(false);
             }
         } catch (SchedulerException se) {
             log.error("failed to shut down the scheduler: {}", se.getMessage());
--->8---

This, not waiting for the tasks to finish, is probably acceptable. If we wait the engine will be killed anyhow, so it is just a matter of time that the jobs will be aborted.

Sahina, is it acceptable for you to not wait for the scheduled jobs to finish?

Comment 31 Sahina Bose 2017-03-07 09:09:06 UTC
Juan, yes - I think if are shutting down engine, we can abort the jobs.

Comment 32 rhev-integ 2017-04-26 10:50:44 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[FOUND CLONE FLAGS: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ]

For more info please contact: rhv-devops

Comment 35 Radim Hrazdil 2017-11-16 11:54:27 UTC
Verified by running systemctl status/stop/status, no errors reported. 
Version 4.2.0-0.0.master.20171108151837.gita7de53e.el7.centos

Comment 38 errata-xmlrpc 2018-05-15 17:38:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 39 Franta Kust 2019-05-16 13:08:40 UTC
BZ<2>Jira Resync