Created attachment 1408493 [details] engine.log, console.log from engine, vdsm.log from host Description of problem: After vdsm was restarted on a host, the host is stuck in "connecting" state forever. engine.log contains almost no messages, we see: 2018-03-09 03:41:01,130+02 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'engineScheduled' is using 0 threads out of 100 and 100 tasks are waiting in the queue. 2018-03-09 03:41:01,130+02 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'hostUpdatesChecker' is using 0 threads out of 5 and 5 tasks are waiting in the queue. Having zero threads is probably not very effective :-) The host is accessible: # ping voodoo6.tlv.redhat.com PING voodoo6.tlv.redhat.com (10.35.0.110) 56(84) bytes of data. 64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=1 ttl=64 time=0.519 ms 64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=2 ttl=64 time=0.514 ms 64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=3 ttl=64 time=0.545 ms 64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=4 ttl=64 time=0.542 ms 64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=5 ttl=64 time=0.542 ms And in vdsm logs we don't see any connection attempt from engine. Version-Release number of selected component (if applicable): Version 4.2.2.1-1.el7.centos (recent ovirt-4.2 release) How reproducible: Happened once Steps to Reproduce: Not clear what triggers this, but the last things I did was: 1. Stop vdsm on the host 2. Host moved to "connecting" state 3. Start vdsm on the host 4. Engine stuck in "connecting" state forever Actual results: System is not operational Expected results: Host connection suceeds
Note: I did several thread dumps, see console.log.
Restarting engine "fixes" the issue.
I am unable to reproduce this on both master and 4.2.2 The host consistently comes back up in a reasonable amount of time (2 minutes). The behavior is the same in 4.1 My setup One host with nfs storage domain 1. stop vdsm, wait for host status to go to connecting 2. stop nfs-server 3. start vdsm after the completion of InitVdsOnUpCommand the status is "Up". It takes 2 minutes for the InitVdsOnUpCommand to execute and fail to connect the host to the storage domain the status is updated in the database and the host status moves to Up.
Added a patch to update host status in the database before invoking InitVdsOnUp
Lowering severity to high as host status is always changed after some time and this issue exists for a long time.
(In reply to Martin Perina from comment #5) > Lowering severity to high as host status is always changed after some time > and this issue exists for a long time. I don't know what is "some time", but in my case engine was stuck at least for 30 minutes, and restart was required.
Also, how do explain "using 0 threads out of 5 and 5 tasks are waiting in the queue"? The posted patch will fix the issue of having zero threads in the executor?
Martin, we have a second report from rjones about this issue. This time engine stopped extending imageio ticket, causing image transfer to fail after 30 minutes. See https://www.redhat.com/archives/libguestfs/2018-March/msg00207.html In engine logs we see: 1. Ticket was extended 2018-03-29 12:54:25,236+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [189a3bcd-8a40-4f81-8a56-d2efcecc65ea] FINISH, ExtendImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: fc99636 2018-03-29 12:54:25,237+01 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [189a3bcd-8a40-4f81-8a56-d2efcecc65ea] Transfer session with ticket id 4e8567f6-939d-471f-8ea8-e64d172c0a26 extended, timeout 300 seconds The next extend should happen 4 minutes later, but we don't see any other extend. 2. We start to see messages about thread pools with *0* threads and tasks in the queue: 2018-03-29 12:54:35,738+01 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'default' is using 0 threads out of 1 and 5 tasks are waiting in the queue. 2018-03-29 12:54:35,738+01 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'engineScheduled' is using 0 threads out of 100 and 100 tasks are waiting in the queue. 2018-03-29 12:54:35,738+01 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'hostUpdatesChecker' is using 0 threads out of 5 and 2 tasks are waiting in the queue. 3. These messages repeat many times: $ xzgrep "Thread pool 'default' is using 0 threads out of 1 and 5 tasks are waiting in the queue" engine.log.xz | wc -l 295 $ xzgrep "Thread pool 'engineScheduled' is using 0 threads out of 100 and 100 tasks are waiting in the queue" engine.log.xz | wc -l 294 $ xzgrep "Thread pool 'hostUpdatesChecker' is using 0 threads out of 5 and 2 tasks are waiting in the queue" engine.log.xz | wc -l 125 It looks like these thread pool will never make any progress. This blocks virt-v2v integration for 4.2. I think the host stuck in connecting state is just a symptom and not the real issue. The issue seems to be the thread pools with no threads, with pending tasks that will never be executed.
FYI I'm running this (both engine and 1 node) on baremetal, on newly installed CentOS 7.4 servers. ovirt-engine-4.2.2.5-1.el7.centos.noarch
(In reply to Richard W.M. Jones from comment #9) > FYI I'm running this (both engine and 1 node) on baremetal, on > newly installed CentOS 7.4 servers. > ovirt-engine-4.2.2.5-1.el7.centos.noarch Richard, can you attach a thread dump from engine when this issue is reproduced? "kill -QUIT engine-pid" should write a thread dump to /var/log/ovirt-engine/console.log.
Created attachment 1415095 [details] engine log sent by Richard Adding the engine log sent by Richard.
Ravi, could you please take a look?
To be clear here, "engine-pid" refers to the Java process: 22186 ? Sl 22:45 ovirt-engine -server -XX:+TieredCompilation [...] ?
(In reply to Nir Soffer from comment #7) > Also, how do explain "using 0 threads out of 5 and 5 tasks are waiting in the > queue"? > > The posted patch will fix the issue of having zero threads in the executor? We haven't found any issue with the thread pool. The posted patch will just faster changes host status to Up before InitVdsOnUp services are executed, so host should moved directly from Up to NonOperation without seeing it in Connecting status. But was only optimization as Ravi haven't been able to reproduce the issue in master, 4.2 not 4.1, host was always switched to NonOperation up to 2 minutes and never stucked in Connecting. So we are probably missing something here
(In reply to Richard W.M. Jones from comment #13) > To be clear here, "engine-pid" refers to the Java process: > > 22186 ? Sl 22:45 ovirt-engine -server -XX:+TieredCompilation [...] Yes.
Looking at the engine code logging the "Thread 'XXX' is using 0 threads" messages, turns out that the log messages are confusing. When we see this message: Thread pool 'Foo' is using 0 threads out of 5 and 2 tasks are waiting in the queue. It means that: - 0 threads in state NEW, RUNNABLE, or BLOCKED - 2 threads in state TIMED_WAITING or WAITING - number of tasks in queue is unknown So this does not show any issue with the thread pools. I posted https://gerrit.ovirt.org/#/c/89648 to fix the log message text. We will open another bug for tickets expiring during upload, it should be checked first by the storage team.
*** Bug 1566072 has been marked as a duplicate of this bug. ***
Thread log message is much less confusing 2018-04-19 16:23:40,734+02 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'engineScheduled' is using 0 threads out of 100, 100 threads waiting for tasks. # yum list ovirt-engine Loaded plugins: product-id, search-disabled-repos, versionlock Installed Packages ovirt-engine.noarch 4.2.3.2-0.1.el7
This bugzilla is included in oVirt 4.2.3 release, published on May 4th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.