1394617 – ovirt-engine does not shut down cleanly

Bug 1394617 - ovirt-engine does not shut down cleanly

Summary: ovirt-engine does not shut down cleanly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.0.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	4.2.0
Assignee:	Juan Hernández
QA Contact:	Radim Hrazdil
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1445733
TreeView+	depends on / blocked

Reported:	2016-11-14 04:17 UTC by Marcus West
Modified:	2020-07-16 09:00 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Clones:	1445733 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:38:43 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sending SIGKILL after SIGTERM (2.08 MB, text/plain) 2016-11-28 23:49 UTC, Jay Shin	no flags	Details
exit success (2.02 MB, text/plain) 2016-11-28 23:50 UTC, Jay Shin	no flags	Details
console log for 4.1.0.4-0.1 (930.09 KB, text/plain) 2017-02-07 15:18 UTC, Marcus West	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2765391	None	None	None	2016-11-14 04:46:23 UTC
Red Hat Product Errata	RHEA-2018:1488	None	None	None	2018-05-15 17:40:57 UTC
oVirt gerrit	73661	master	MERGED	core: Don't wait for jobs to complete during shutdown	2021-02-19 12:39:54 UTC
oVirt gerrit	73743	ovirt-engine-4.1	MERGED	core: Don't wait for jobs to complete during shutdown	2021-02-19 12:39:54 UTC

Description Marcus West 2016-11-14 04:17:05 UTC

## Description of problem:

ovirt-engine does not shut down clean

## Version-Release number of selected component (if applicable):

ovirt-engine-4.0.4.4-0.1.el7ev.noarch
rhevm-4.0.4.4-0.1.el7ev.noarch

## How reproducible:

All the environments that I have tried so far

## Steps to Reproduce:
1. run 'systemctl stop ovirt-engine'
2.
3.

## Actual results:

systemctl status shows:

  Active: failed (Result: exit-code) since Mon 2016-11-14 14:10:44 AEST; 3s ago

and

  Nov 14 14:10:44 <hostname> ovirt-engine.py[4987]: 2016-11-14 14:10:44,618 ovirt-engine: ERROR run:532 Error: Had to kill process 5018


## Expected results:

Expect to see something along the lines of:

  Active: inactive

No no reports of processes being killed abruptly

## Additional info:

This does not appear to be causing any other issues at the moment, but concerned it could have the potential to cause issues.  For example, what would happen if ovirt-engine was abruptly killed in the middle of a lengthy transaction?

Comment 10 Jay Shin 2016-11-28 23:49:21 UTC

Created attachment 1225534 [details]
sending SIGKILL after SIGTERM

Comment 11 Jay Shin 2016-11-28 23:50:00 UTC

Created attachment 1225535 [details]
exit success

Comment 16 Sandro Bonazzola 2017-02-06 10:11:23 UTC

Lev, can you please update on this?

Comment 17 Lev Veyde 2017-02-06 16:19:37 UTC

(In reply to Sandro Bonazzola from comment #16)
> Lev, can you please update on this?

Yes, the issue is with the Java/JBoss ovirt-engine service that doesn't handle the SIGTERM properly, and thus doesn't shut down orderly.

What then happens is that the process is killed with SIGKILL after waiting for the default timeout of 10 seconds, thus the error message.

Moving the issue to infra.

Comment 18 Oved Ourfali 2017-02-06 16:36:34 UTC

Juan, can you take a look?

Comment 19 Oved Ourfali 2017-02-06 16:46:53 UTC

Marcus, does it reproduce to you in latest 4.0? In 4.1?

Comment 23 Juan Hernández 2017-02-06 17:52:06 UTC

Our stop process first send the SIGTERM signal to the application server. Then, if it doesn't finish in 10 seconds, it sends a SIGKILL. The initial SIGTERM should have triggered the execution of the shutdown hooks within the Java virtual machine, which should in run run the application server shutdown logic. But I guess that the thread that runs that logic is blocked by something, probably by something that failed to do its own shutdown cleanly, like an scheduled task that failed because it can no longer connect to the database.

To help diagnose this, it would be helpful to have a thread stack trace dump of the Java virtual machine, after sending the SIGTERM. To get it, please do the following:

1. Find the PID of the Java virtual machine:

  # pidof ovirt-engine

2. Manually send the SIGTERM signal to that process:

  # kill -SIGTERM the_pid

3. The process will most probably stay alive. Check that. And generate a thread stack dump sending the SIGQUIT signal to the process:

  # kill -SIGQUIT the_pid

4. The output will be written to the /var/log/ovirt-engine/console.log file. It will most probably contain something like this:

  "SIGTERM handler" ...
    java.lang.Thread.State: WAITING (on object monitor)
    ...

5. Repeat the SIQUIT command a couple of times, after a few seconds, to verify that the SIGTERM handler thread stays locked. Then share the console.log file with us.

Comment 27 Marcus West 2017-02-07 15:16:08 UTC

Hi Juan,

I've followed c#23 and uploaded console.log - one for version 4.0.6.3-0.1, and one 4.1.0.4-0.1 for comparison (rhevbeta-rhevm-4.1.0.4-0.1.console.log  rhevmhe1-rhevm-4.0.6.3-0.1.console.log).  I ran each -SIGQUIT about 10 times each, but kept getting thread stack trace dumps each time.

Comment 29 Marcus West 2017-02-07 15:18:22 UTC

Created attachment 1248428 [details]
console log for 4.1.0.4-0.1

Comment 30 Juan Hernández 2017-02-24 17:54:11 UTC

I found two potential issues that interfere with the normal shutdown process of the application server.

1. The Infinispan caches that we create are automatically started by the application server, but never stopped. This results in an exception during the shutdown process, because the application server detects that the cache is in use, and refuses to destroy it:

---8<---
2017-02-24 18:13:37,010+01 ERROR [stderr] (ServerService Thread Pool -- 60) Exception in thread "ServerService Thread Pool -- 60" java.lang.IllegalStateException: ISPN000371: Cannot remove cache configuration 'timeout-base' because it is in use
2017-02-24 18:13:37,010+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.infinispan.manager.DefaultCacheManager.undefineConfiguration(DefaultCacheManager.java:402)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.jboss.as.clustering.infinispan.DefaultCacheContainer.undefineConfiguration(DefaultCacheContainer.java:88)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.wildfly.clustering.infinispan.spi.service.ConfigurationBuilder.stop(ConfigurationBuilder.java:84)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.wildfly.clustering.service.AsynchronousServiceBuilder$2.run(AsynchronousServiceBuilder.java:130)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at java.lang.Thread.run(Thread.java:745)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.jboss.threads.JBossThread.run(JBossThread.java:320)
--->8---

Note that this is just written to 'stderr', so it doesn't really block the shutdown process. It is probably a bug in WildFly, as in theory the application server should stop the cache once all the objects that were using it are destroyed.

2. The 'DBSchedulerUtilQuartzImpl' scheduler is never really stopped. During the shutdown process the database connection pool that it uses is destroyed, and because of this the jobs that are pending fail, and are retried, apparently for ever. This change fixes it:

---8<---
@@ -369,7 +369,7 @@ public abstract class SchedulerUtilBaseImpl implements SchedulerUtil {
     public void shutDown() {
         try {
             if (sched != null) {
-                sched.shutdown(true);
+                sched.shutdown(false);
             }
         } catch (SchedulerException se) {
             log.error("failed to shut down the scheduler: {}", se.getMessage());
--->8---

This, not waiting for the tasks to finish, is probably acceptable. If we wait the engine will be killed anyhow, so it is just a matter of time that the jobs will be aborted.

Sahina, is it acceptable for you to not wait for the scheduled jobs to finish?

Comment 31 Sahina Bose 2017-03-07 09:09:06 UTC

Juan, yes - I think if are shutting down engine, we can abort the jobs.

Comment 32 rhev-integ 2017-04-26 10:50:44 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[FOUND CLONE FLAGS: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ]

For more info please contact: rhv-devops

Comment 35 Radim Hrazdil 2017-11-16 11:54:27 UTC

Verified by running systemctl status/stop/status, no errors reported. 
Version 4.2.0-0.0.master.20171108151837.gita7de53e.el7.centos

Comment 38 errata-xmlrpc 2018-05-15 17:38:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 39 Franta Kust 2019-05-16 13:08:40 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.