+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1394617 +++ ====================================================================== ## Description of problem: ovirt-engine does not shut down clean ## Version-Release number of selected component (if applicable): ovirt-engine-4.0.4.4-0.1.el7ev.noarch rhevm-4.0.4.4-0.1.el7ev.noarch ## How reproducible: All the environments that I have tried so far ## Steps to Reproduce: 1. run 'systemctl stop ovirt-engine' 2. 3. ## Actual results: systemctl status shows: Active: failed (Result: exit-code) since Mon 2016-11-14 14:10:44 AEST; 3s ago and Nov 14 14:10:44 <hostname> ovirt-engine.py[4987]: 2016-11-14 14:10:44,618 ovirt-engine: ERROR run:532 Error: Had to kill process 5018 ## Expected results: Expect to see something along the lines of: Active: inactive No no reports of processes being killed abruptly ## Additional info: This does not appear to be causing any other issues at the moment, but concerned it could have the potential to cause issues. For example, what would happen if ovirt-engine was abruptly killed in the middle of a lengthy transaction? (Originally by Marcus West)
Created attachment 1225534 [details] sending SIGKILL after SIGTERM (Originally by Jay Shin)
Created attachment 1225535 [details] exit success (Originally by Jay Shin)
Lev, can you please update on this? (Originally by Sandro Bonazzola)
(In reply to Sandro Bonazzola from comment #16) > Lev, can you please update on this? Yes, the issue is with the Java/JBoss ovirt-engine service that doesn't handle the SIGTERM properly, and thus doesn't shut down orderly. What then happens is that the process is killed with SIGKILL after waiting for the default timeout of 10 seconds, thus the error message. Moving the issue to infra. (Originally by Lev Veyde)
Juan, can you take a look? (Originally by Oved Ourfali)
Marcus, does it reproduce to you in latest 4.0? In 4.1? (Originally by Oved Ourfali)
Our stop process first send the SIGTERM signal to the application server. Then, if it doesn't finish in 10 seconds, it sends a SIGKILL. The initial SIGTERM should have triggered the execution of the shutdown hooks within the Java virtual machine, which should in run run the application server shutdown logic. But I guess that the thread that runs that logic is blocked by something, probably by something that failed to do its own shutdown cleanly, like an scheduled task that failed because it can no longer connect to the database. To help diagnose this, it would be helpful to have a thread stack trace dump of the Java virtual machine, after sending the SIGTERM. To get it, please do the following: 1. Find the PID of the Java virtual machine: # pidof ovirt-engine 2. Manually send the SIGTERM signal to that process: # kill -SIGTERM the_pid 3. The process will most probably stay alive. Check that. And generate a thread stack dump sending the SIGQUIT signal to the process: # kill -SIGQUIT the_pid 4. The output will be written to the /var/log/ovirt-engine/console.log file. It will most probably contain something like this: "SIGTERM handler" ... java.lang.Thread.State: WAITING (on object monitor) ... 5. Repeat the SIQUIT command a couple of times, after a few seconds, to verify that the SIGTERM handler thread stays locked. Then share the console.log file with us. (Originally by juan.hernandez)
Hi Juan, I've followed c#23 and uploaded console.log - one for version 4.0.6.3-0.1, and one 4.1.0.4-0.1 for comparison (rhevbeta-rhevm-4.1.0.4-0.1.console.log rhevmhe1-rhevm-4.0.6.3-0.1.console.log). I ran each -SIGQUIT about 10 times each, but kept getting thread stack trace dumps each time. (Originally by Marcus West)
Created attachment 1248428 [details] console log for 4.1.0.4-0.1 (Originally by Marcus West)
I found two potential issues that interfere with the normal shutdown process of the application server. 1. The Infinispan caches that we create are automatically started by the application server, but never stopped. This results in an exception during the shutdown process, because the application server detects that the cache is in use, and refuses to destroy it: ---8<--- 2017-02-24 18:13:37,010+01 ERROR [stderr] (ServerService Thread Pool -- 60) Exception in thread "ServerService Thread Pool -- 60" java.lang.IllegalStateException: ISPN000371: Cannot remove cache configuration 'timeout-base' because it is in use 2017-02-24 18:13:37,010+01 ERROR [stderr] (ServerService Thread Pool -- 60) at org.infinispan.manager.DefaultCacheManager.undefineConfiguration(DefaultCacheManager.java:402) 2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60) at org.jboss.as.clustering.infinispan.DefaultCacheContainer.undefineConfiguration(DefaultCacheContainer.java:88) 2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60) at org.wildfly.clustering.infinispan.spi.service.ConfigurationBuilder.stop(ConfigurationBuilder.java:84) 2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60) at org.wildfly.clustering.service.AsynchronousServiceBuilder$2.run(AsynchronousServiceBuilder.java:130) 2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60) at java.lang.Thread.run(Thread.java:745) 2017-02-24 18:13:37,011+01 ERROR [stderr] (ServerService Thread Pool -- 60) at org.jboss.threads.JBossThread.run(JBossThread.java:320) --->8--- Note that this is just written to 'stderr', so it doesn't really block the shutdown process. It is probably a bug in WildFly, as in theory the application server should stop the cache once all the objects that were using it are destroyed. 2. The 'DBSchedulerUtilQuartzImpl' scheduler is never really stopped. During the shutdown process the database connection pool that it uses is destroyed, and because of this the jobs that are pending fail, and are retried, apparently for ever. This change fixes it: ---8<--- @@ -369,7 +369,7 @@ public abstract class SchedulerUtilBaseImpl implements SchedulerUtil { public void shutDown() { try { if (sched != null) { - sched.shutdown(true); + sched.shutdown(false); } } catch (SchedulerException se) { log.error("failed to shut down the scheduler: {}", se.getMessage()); --->8--- This, not waiting for the tasks to finish, is probably acceptable. If we wait the engine will be killed anyhow, so it is just a matter of time that the jobs will be aborted. Sahina, is it acceptable for you to not wait for the scheduled jobs to finish? (Originally by juan.hernandez)
Juan, yes - I think if are shutting down engine, we can abort the jobs. (Originally by Sahina Bose)
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [FOUND CLONE FLAGS: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ] For more info please contact: rhv-devops (Originally by rhev-integ)
[root@ls-engine1 ~]# systemctl status ovirt-engine ● ovirt-engine.service - oVirt Engine Loaded: loaded (/usr/lib/systemd/system/ovirt-engine.service; enabled; vendor preset: disabled) Active: inactive (dead) since Wed 2017-05-10 15:47:59 CEST; 16s ago Main PID: 17317 (code=exited, status=0/SUCCESS) May 10 08:15:29 ls-engine1.rhev.lab.eng.brq.redhat.com systemd[1]: Starting oVirt Engine... May 10 08:15:29 ls-engine1.rhev.lab.eng.brq.redhat.com ovirt-engine.py[17317]: 2017-05-10 08:15:29,381+0200 ovirt-engine: INFO _detectJBossVersion:187 Detecting JBoss version. Running: /usr/lib/jvm...600000', '- May 10 08:15:30 ls-engine1.rhev.lab.eng.brq.redhat.com ovirt-engine.py[17317]: 2017-05-10 08:15:30,333+0200 ovirt-engine: INFO _detectJBossVersion:207 Return code: 0, | stdout: '[u'JBoss EAP 7.0.5...tderr: '[]' May 10 08:15:30 ls-engine1.rhev.lab.eng.brq.redhat.com systemd[1]: Started oVirt Engine. May 10 15:47:58 ls-engine1.rhev.lab.eng.brq.redhat.com systemd[1]: Stopping oVirt Engine... May 10 15:47:59 ls-engine1.rhev.lab.eng.brq.redhat.com systemd[1]: Stopped oVirt Engine. Hint: Some lines were ellipsized, use -l to show in full. verified in ovirt-engine-4.1.2.1-0.1.el7.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1280