1445733 – [downstream clone - 4.1.2] ovirt-engine does not shut down cleanly

Bug 1445733 - [downstream clone - 4.1.2] ovirt-engine does not shut down cleanly

Summary: [downstream clone - 4.1.2] ovirt-engine does not shut down cleanly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.0.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.2
Target Release:	---
Assignee:	Juan Hernández
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:	1394617
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-26 12:07 UTC by rhev-integ
Modified:	2020-07-16 09:29 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1394617
Environment:
Last Closed:	2017-05-24 11:23:06 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2765391	None	None	None	2017-04-26 12:12:12 UTC
Red Hat Product Errata	RHEA-2017:1280	normal	SHIPPED_LIVE	Red Hat Virtualization Manager (ovirt-engine) 4.1.2	2017-05-24 15:18:48 UTC
oVirt gerrit	73661	master	MERGED	core: Don't wait for jobs to complete during shutdown	2021-02-19 12:40:36 UTC
oVirt gerrit	73743	ovirt-engine-4.1	MERGED	core: Don't wait for jobs to complete during shutdown	2021-02-19 12:40:37 UTC

Description rhev-integ 2017-04-26 12:07:53 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1394617 +++
======================================================================

## Description of problem:

ovirt-engine does not shut down clean

## Version-Release number of selected component (if applicable):

ovirt-engine-4.0.4.4-0.1.el7ev.noarch
rhevm-4.0.4.4-0.1.el7ev.noarch

## How reproducible:

All the environments that I have tried so far

## Steps to Reproduce:
1. run 'systemctl stop ovirt-engine'
2.
3.

## Actual results:

systemctl status shows:

  Active: failed (Result: exit-code) since Mon 2016-11-14 14:10:44 AEST; 3s ago

and

  Nov 14 14:10:44 <hostname> ovirt-engine.py[4987]: 2016-11-14 14:10:44,618 ovirt-engine: ERROR run:532 Error: Had to kill process 5018


## Expected results:

Expect to see something along the lines of:

  Active: inactive

No no reports of processes being killed abruptly

## Additional info:

This does not appear to be causing any other issues at the moment, but concerned it could have the potential to cause issues.  For example, what would happen if ovirt-engine was abruptly killed in the middle of a lengthy transaction?

(Originally by Marcus West)

Comment 12 rhev-integ 2017-04-26 12:09:18 UTC

Created attachment 1225534 [details]
sending SIGKILL after SIGTERM

(Originally by Jay Shin)

Comment 13 rhev-integ 2017-04-26 12:09:27 UTC

Created attachment 1225535 [details]
exit success

(Originally by Jay Shin)

Comment 18 rhev-integ 2017-04-26 12:10:01 UTC

Lev, can you please update on this?

(Originally by Sandro Bonazzola)

Comment 19 rhev-integ 2017-04-26 12:10:10 UTC

(In reply to Sandro Bonazzola from comment #16)
> Lev, can you please update on this?

Yes, the issue is with the Java/JBoss ovirt-engine service that doesn't handle the SIGTERM properly, and thus doesn't shut down orderly.

What then happens is that the process is killed with SIGKILL after waiting for the default timeout of 10 seconds, thus the error message.

Moving the issue to infra.

(Originally by Lev Veyde)

Comment 20 rhev-integ 2017-04-26 12:10:17 UTC

Juan, can you take a look?

(Originally by Oved Ourfali)

Comment 21 rhev-integ 2017-04-26 12:10:25 UTC

Marcus, does it reproduce to you in latest 4.0? In 4.1?

(Originally by Oved Ourfali)

Comment 25 rhev-integ 2017-04-26 12:10:53 UTC

Our stop process first send the SIGTERM signal to the application server. Then, if it doesn't finish in 10 seconds, it sends a SIGKILL. The initial SIGTERM should have triggered the execution of the shutdown hooks within the Java virtual machine, which should in run run the application server shutdown logic. But I guess that the thread that runs that logic is blocked by something, probably by something that failed to do its own shutdown cleanly, like an scheduled task that failed because it can no longer connect to the database.

To help diagnose this, it would be helpful to have a thread stack trace dump of the Java virtual machine, after sending the SIGTERM. To get it, please do the following:

1. Find the PID of the Java virtual machine:

  # pidof ovirt-engine

2. Manually send the SIGTERM signal to that process:

  # kill -SIGTERM the_pid

3. The process will most probably stay alive. Check that. And generate a thread stack dump sending the SIGQUIT signal to the process:

  # kill -SIGQUIT the_pid

4. The output will be written to the /var/log/ovirt-engine/console.log file. It will most probably contain something like this:

  "SIGTERM handler" ...
    java.lang.Thread.State: WAITING (on object monitor)
    ...

5. Repeat the SIQUIT command a couple of times, after a few seconds, to verify that the SIGTERM handler thread stays locked. Then share the console.log file with us.

(Originally by juan.hernandez)

Comment 29 rhev-integ 2017-04-26 12:11:22 UTC

Hi Juan,

I've followed c#23 and uploaded console.log - one for version 4.0.6.3-0.1, and one 4.1.0.4-0.1 for comparison (rhevbeta-rhevm-4.1.0.4-0.1.console.log  rhevmhe1-rhevm-4.0.6.3-0.1.console.log).  I ran each -SIGQUIT about 10 times each, but kept getting thread stack trace dumps each time.

(Originally by Marcus West)

Comment 31 rhev-integ 2017-04-26 12:11:36 UTC

Created attachment 1248428 [details]
console log for 4.1.0.4-0.1

(Originally by Marcus West)

Comment 32 rhev-integ 2017-04-26 12:11:43 UTC

I found two potential issues that interfere with the normal shutdown process of the application server.

1. The Infinispan caches that we create are automatically started by the application server, but never stopped. This results in an exception during the shutdown process, because the application server detects that the cache is in use, and refuses to destroy it:

---8<---
2017-02-24 18:13:37,010+01 ERROR [stderr] (ServerService Thread Pool -- 60) Exception in thread "ServerService Thread Pool -- 60" java.lang.IllegalStateException: ISPN000371: Cannot remove cache configuration 'timeout-base' because it is in use
2017-02-24 18:13:37,010+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.infinispan.manager.DefaultCacheManager.undefineConfiguration(DefaultCacheManager.java:402)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.jboss.as.clustering.infinispan.DefaultCacheContainer.undefineConfiguration(DefaultCacheContainer.java:88)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.wildfly.clustering.infinispan.spi.service.ConfigurationBuilder.stop(ConfigurationBuilder.java:84)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.wildfly.clustering.service.AsynchronousServiceBuilder$2.run(AsynchronousServiceBuilder.java:130)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at java.lang.Thread.run(Thread.java:745)
2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60)     at org.jboss.threads.JBossThread.run(JBossThread.java:320)
--->8---

Note that this is just written to 'stderr', so it doesn't really block the shutdown process. It is probably a bug in WildFly, as in theory the application server should stop the cache once all the objects that were using it are destroyed.

2. The 'DBSchedulerUtilQuartzImpl' scheduler is never really stopped. During the shutdown process the database connection pool that it uses is destroyed, and because of this the jobs that are pending fail, and are retried, apparently for ever. This change fixes it:

---8<---
@@ -369,7 +369,7 @@ public abstract class SchedulerUtilBaseImpl implements SchedulerUtil {
     public void shutDown() {
         try {
             if (sched != null) {
-                sched.shutdown(true);
+                sched.shutdown(false);
             }
         } catch (SchedulerException se) {
             log.error("failed to shut down the scheduler: {}", se.getMessage());
--->8---

This, not waiting for the tasks to finish, is probably acceptable. If we wait the engine will be killed anyhow, so it is just a matter of time that the jobs will be aborted.

Sahina, is it acceptable for you to not wait for the scheduled jobs to finish?

(Originally by juan.hernandez)

Comment 33 rhev-integ 2017-04-26 12:11:51 UTC

Juan, yes - I think if are shutting down engine, we can abort the jobs.

(Originally by Sahina Bose)

Comment 34 rhev-integ 2017-04-26 12:11:58 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[FOUND CLONE FLAGS: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ]

For more info please contact: rhv-devops

(Originally by rhev-integ)

Comment 36 Lukas Svaty 2017-05-10 13:49:06 UTC

[root@ls-engine1 ~]# systemctl status ovirt-engine
● ovirt-engine.service - oVirt Engine
   Loaded: loaded (/usr/lib/systemd/system/ovirt-engine.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2017-05-10 15:47:59 CEST; 16s ago
 Main PID: 17317 (code=exited, status=0/SUCCESS)

May 10 08:15:29 ls-engine1.rhev.lab.eng.brq.redhat.com systemd[1]: Starting oVirt Engine...
May 10 08:15:29 ls-engine1.rhev.lab.eng.brq.redhat.com ovirt-engine.py[17317]: 2017-05-10 08:15:29,381+0200 ovirt-engine: INFO _detectJBossVersion:187 Detecting JBoss version. Running: /usr/lib/jvm...600000', '-
May 10 08:15:30 ls-engine1.rhev.lab.eng.brq.redhat.com ovirt-engine.py[17317]: 2017-05-10 08:15:30,333+0200 ovirt-engine: INFO _detectJBossVersion:207 Return code: 0,  | stdout: '[u'JBoss EAP 7.0.5...tderr: '[]'
May 10 08:15:30 ls-engine1.rhev.lab.eng.brq.redhat.com systemd[1]: Started oVirt Engine.
May 10 15:47:58 ls-engine1.rhev.lab.eng.brq.redhat.com systemd[1]: Stopping oVirt Engine...
May 10 15:47:59 ls-engine1.rhev.lab.eng.brq.redhat.com systemd[1]: Stopped oVirt Engine.
Hint: Some lines were ellipsized, use -l to show in full.

verified in ovirt-engine-4.1.2.1-0.1.el7.noarch

Comment 38 errata-xmlrpc 2017-05-24 11:23:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1280

Note You need to log in before you can comment on or make changes to this bug.