1556971 – Host stays in "connecting" state for longer time than necessary

Bug 1556971 - Host stays in "connecting" state for longer time than necessary

Summary: Host stays in "connecting" state for longer time than necessary

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Infra
Sub Component:
Version:	4.2.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.3
Target Release:	4.2.3
Assignee:	Ravi Nori
QA Contact:	Pavol Brilla
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1566072 (view as bug list)
Depends On:
Blocks:	1560666
TreeView+	depends on / blocked

Reported:	2018-03-15 16:13 UTC by Nir Soffer
Modified:	2018-05-10 06:31 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-engine-4.2.3
Clone Of:
Environment:
Last Closed:	2018-05-10 06:31:37 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ lsvaty: testing_ack+

Attachments	(Terms of Use)
engine.log, console.log from engine, vdsm.log from host (599.94 KB, application/x-gzip) 2018-03-15 16:13 UTC, Nir Soffer	no flags	Details
engine log sent by Richard (158.03 KB, application/x-xz) 2018-03-30 11:29 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	89205	master	MERGED	engine : Host stuck in "connecting" state	2020-05-27 10:34:37 UTC
oVirt gerrit	89515	ovirt-engine-4.2	MERGED	engine : Host stuck in "connecting" state	2020-05-27 10:34:37 UTC
oVirt gerrit	89648	master	MERGED	core: Fix thread pool monitoring logs	2020-05-27 10:34:37 UTC
oVirt gerrit	89741	ovirt-engine-4.2	MERGED	core: Fix thread pool monitoring logs	2020-05-27 10:34:37 UTC

Description Nir Soffer 2018-03-15 16:13:10 UTC

Created attachment 1408493 [details]
engine.log, console.log from engine, vdsm.log from host

Description of problem:

After vdsm was restarted on a host, the host is stuck in "connecting" state forever.

engine.log contains almost no messages, we see:

2018-03-09 03:41:01,130+02 INFO  [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'engineScheduled' is using 0 threads out of 100 and 100 tasks are waiting in the queue.

2018-03-09 03:41:01,130+02 INFO  [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'hostUpdatesChecker' is using 0 threads out of 5 and 5 tasks are waiting in the queue.

Having zero threads is probably not very effective :-)

The host is accessible:
# ping voodoo6.tlv.redhat.com
PING voodoo6.tlv.redhat.com (10.35.0.110) 56(84) bytes of data.
64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=1 ttl=64 time=0.519 ms
64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=2 ttl=64 time=0.514 ms
64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=3 ttl=64 time=0.545 ms
64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=4 ttl=64 time=0.542 ms
64 bytes from voodoo6.tlv.redhat.com (10.35.0.110): icmp_seq=5 ttl=64 time=0.542 ms

And in vdsm logs we don't see any connection attempt from engine.

Version-Release number of selected component (if applicable):
Version 4.2.2.1-1.el7.centos (recent ovirt-4.2 release)

How reproducible:
Happened once

Steps to Reproduce:
Not clear what triggers this, but the last things I did was:
1. Stop vdsm on the host
2. Host moved to "connecting" state
3. Start vdsm on the host
4. Engine stuck in "connecting" state forever

Actual results:
System is not operational

Expected results:
Host connection suceeds

Comment 1 Nir Soffer 2018-03-15 16:14:26 UTC

Note: I did several thread dumps, see console.log.

Comment 2 Nir Soffer 2018-03-15 16:27:13 UTC

Restarting engine "fixes" the issue.

Comment 3 Ravi Nori 2018-03-19 15:47:16 UTC

I am unable to reproduce this on both master and 4.2.2 

The host consistently comes back up in a reasonable amount of time (2 minutes). 
The behavior is the same in 4.1

My setup

One host with nfs storage domain

1. stop vdsm, wait for host status to go to connecting
2. stop nfs-server
3. start vdsm

after the completion of InitVdsOnUpCommand the status is "Up". It takes 2 minutes for the InitVdsOnUpCommand to execute and fail to connect the host to the storage domain the status is updated in the database and the host status moves to Up.

Comment 4 Ravi Nori 2018-03-19 17:16:46 UTC

Added a patch to update host status in the database before invoking InitVdsOnUp

Comment 5 Martin Perina 2018-03-20 14:54:41 UTC

Lowering severity to high as host status is always changed after some time and this issue exists for a long time.

Comment 6 Nir Soffer 2018-03-20 17:34:31 UTC

(In reply to Martin Perina from comment #5)
> Lowering severity to high as host status is always changed after some time
> and this issue exists for a long time.

I don't know what is "some time", but in my case engine was stuck at least for 30
minutes, and restart was required.

Comment 7 Nir Soffer 2018-03-20 17:36:22 UTC

Also, how do explain "using 0 threads out of 5 and 5 tasks are waiting in the
queue"?

The posted patch will fix the issue of having zero threads in the executor?

Comment 8 Nir Soffer 2018-03-30 11:19:43 UTC

Martin, we have a second report from rjones about this issue.

This time engine stopped extending imageio ticket, causing image
transfer to fail after 30 minutes.
See https://www.redhat.com/archives/libguestfs/2018-March/msg00207.html

In engine logs we see:

1. Ticket was extended

2018-03-29 12:54:25,236+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [189a3bcd-8a40-4f81-8a56-d2efcecc65ea] FINISH, ExtendImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: fc99636
2018-03-29 12:54:25,237+01 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [189a3bcd-8a40-4f81-8a56-d2efcecc65ea] Transfer session with ticket id 4e8567f6-939d-471f-8ea8-e64d172c0a26 extended, timeout 300 seconds

The next extend should happen 4 minutes later, but we don't see any other extend.

2. We start to see messages about thread pools with *0* threads and tasks in
the queue:

2018-03-29 12:54:35,738+01 INFO  [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'default' is using 0 threads out of 1 and 5 tasks are waiting in the queue.

2018-03-29 12:54:35,738+01 INFO  [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'engineScheduled' is using 0 threads out of 100 and 100 tasks are waiting in the queue.

2018-03-29 12:54:35,738+01 INFO  [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'hostUpdatesChecker' is using 0 threads out of 5 and 2 tasks are waiting in the queue.

3. These messages repeat many times:

$ xzgrep "Thread pool 'default' is using 0 threads out of 1 and 5 tasks are waiting in the queue" engine.log.xz | wc -l
295

$ xzgrep "Thread pool 'engineScheduled' is using 0 threads out of 100 and 100 tasks are waiting in the queue" engine.log.xz | wc -l
294

$ xzgrep "Thread pool 'hostUpdatesChecker' is using 0 threads out of 5 and 2 tasks are waiting in the queue" engine.log.xz | wc -l
125

It looks like these thread pool will never make any progress.

This blocks virt-v2v integration for 4.2.

I think the host stuck in connecting state is just a symptom and not
the real issue. The issue seems to be the thread pools with no threads,
with pending tasks that will never be executed.

Comment 9 Richard W.M. Jones 2018-03-30 11:22:44 UTC

FYI I'm running this (both engine and 1 node) on baremetal, on
newly installed CentOS 7.4 servers.
ovirt-engine-4.2.2.5-1.el7.centos.noarch

Comment 10 Nir Soffer 2018-03-30 11:28:04 UTC

(In reply to Richard W.M. Jones from comment #9)
> FYI I'm running this (both engine and 1 node) on baremetal, on
> newly installed CentOS 7.4 servers.
> ovirt-engine-4.2.2.5-1.el7.centos.noarch

Richard, can you attach a thread dump from engine when this issue is reproduced?

"kill -QUIT engine-pid" should write a thread dump to
/var/log/ovirt-engine/console.log.

Comment 11 Nir Soffer 2018-03-30 11:29:21 UTC

Created attachment 1415095 [details]
engine log sent by Richard

Adding the engine log sent by Richard.

Comment 12 Martin Perina 2018-03-30 11:31:02 UTC

Ravi, could you please take a look?

Comment 13 Richard W.M. Jones 2018-03-30 11:51:36 UTC

To be clear here, "engine-pid" refers to the Java process:

22186 ?        Sl    22:45 ovirt-engine -server -XX:+TieredCompilation [...]

?

Comment 14 Martin Perina 2018-03-30 12:15:20 UTC

(In reply to Nir Soffer from comment #7)
> Also, how do explain "using 0 threads out of 5 and 5 tasks are waiting in the
> queue"?
> 
> The posted patch will fix the issue of having zero threads in the executor?

We haven't found any issue with the thread pool. The posted patch will just faster changes host status to Up before InitVdsOnUp services are executed, so host should moved directly from Up to NonOperation without seeing it in Connecting status. But was only optimization as Ravi haven't been able to reproduce the issue in master, 4.2 not 4.1, host was always switched to NonOperation up to 2 minutes and never stucked in Connecting. So we are probably missing something here

Comment 15 Nir Soffer 2018-03-30 23:26:50 UTC

(In reply to Richard W.M. Jones from comment #13)
> To be clear here, "engine-pid" refers to the Java process:
> 
> 22186 ?        Sl    22:45 ovirt-engine -server -XX:+TieredCompilation [...]

Yes.

Comment 16 Nir Soffer 2018-04-01 09:51:25 UTC

Looking at the engine code logging the "Thread 'XXX' is using 0 threads" messages,
turns out that the log messages are confusing.

When we see this message:

    Thread pool 'Foo' is using 0 threads out of 5 and 2 tasks are waiting in the queue.

It means that:

- 0 threads in state NEW, RUNNABLE, or BLOCKED
- 2 threads in state TIMED_WAITING or WAITING
- number of tasks in queue is unknown

So this does not show any issue with the thread pools.

I posted https://gerrit.ovirt.org/#/c/89648 to fix the log message text.

We will open another bug for tickets expiring during upload, it should be checked
first by the storage team.

Comment 17 Simone Tiraboschi 2018-04-11 14:05:15 UTC

*** Bug 1566072 has been marked as a duplicate of this bug. ***

Comment 18 Pavol Brilla 2018-04-19 14:46:21 UTC

Thread log message is much less confusing

2018-04-19 16:23:40,734+02 INFO  [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'engineScheduled' is using 0 threads out of 100, 100 threads waiting for tasks.

# yum list ovirt-engine
Loaded plugins: product-id, search-disabled-repos, versionlock
Installed Packages
ovirt-engine.noarch                             4.2.3.2-0.1.el7

Comment 19 Sandro Bonazzola 2018-05-10 06:31:37 UTC

This bugzilla is included in oVirt 4.2.3 release, published on May 4th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.