1684554 – Host activation task gets stuck after many retries

Bug 1684554 - Host activation task gets stuck after many retries

Summary: Host activation task gets stuck after many retries

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Backend.Core
Sub Component:
Version:	4.3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.3.3
Target Release:	---
Assignee:	Ravi Nori
QA Contact:	Liran Rotenberg
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-01 14:24 UTC by Netbulae
Modified:	2019-04-16 13:58 UTC (History)
CC List:	4 users (show)
Fixed In Version:	ovirt-engine-4.3.3.1
Clone Of:
Environment:
Last Closed:	2019-04-16 13:58:18 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.3+

Attachments	(Terms of Use)
screenshot of tasks (49.28 KB, image/png) 2019-03-01 14:24 UTC, Netbulae	no flags	Details
piece of engine.log (33.98 KB, text/plain) 2019-03-01 14:45 UTC, Netbulae	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	98633	0	master	MERGED	engine : InitVdsOnUpCommand should handle errors when connecting to storage domain	2020-04-08 16:02:25 UTC
oVirt gerrit	98674	0	ovirt-engine-4.3	MERGED	engine : InitVdsOnUpCommand should handle errors when connecting to storage domain	2020-04-08 16:02:25 UTC

Description Netbulae 2019-03-01 14:24:57 UTC

Created attachment 1539826 [details]
screenshot of tasks

Description of problem:

We have a host that cannot connect to the storage domain. When trying to activate it, it fails. Then engine tries again and again.

After x amount of retries the activation task get's stuck and some other new tasks fail.

Also I cannot put the host in maintenance until an engine restart.

Version-Release number of selected component (if applicable):

4.3.0

How reproducible:
Always

Steps to Reproduce:
1. Add host that has storage issues
2. wait


On a related note, can we define a max amount of activation attempts parameter somewhere? If it fails after x times it should go into maintenance and not try anymore.

Comment 1 Netbulae 2019-03-01 14:44:54 UTC

I don't know if this is the log of the problem, I'll attach a piece of the engine log hereafter.

2019-03-01 10:00:30,391+01 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [6a99b660-5991-4f5b-9a7b-112965033613] Command CopyData id: 'a8126c50-0bcb-4d97-848b-c70f5a864430': couldn't get the status of job '35270925-e550-472e-a2e1-743b87e41e73' on host 'node10' (id: 'd301c1d7-d94e-45ba-90c9-dff70afb774c'), assuming it's still running

Comment 2 Netbulae 2019-03-01 14:45:27 UTC

Created attachment 1539841 [details]
piece of engine.log

Comment 3 Liran Rotenberg 2019-04-04 07:41:35 UTC

Verified on:
ovirt-engine-4.3.3.1-0.1.el7.noarch

Steps:
1. Block the host to the storage.
For example run on the host:
# iptables --insert INPUT 1 --source <storage> --jump DROP --protocol all
2. Add the host to the environment.

Results:
At the end of the host installation, there are tries to activate the host.
Task is seen in the engine as "stuck". In few minutes the task ended setting the new host as non-operational.
The engine tries to activate the host again as part of the normal engine's behavior. 
Setting the host into maintenance is possible but only if you do so when the host is in non-operational mode - need to catch in between the activation tries.

Comment 4 Liran Rotenberg 2019-04-04 11:03:56 UTC

Ravi, if the intention was to stop the auto-activation after the host is non-operational we will need to move this bug back to assigned.

Comment 5 Ravi Nori 2019-04-04 13:58:55 UTC

The intention is not to stop auto activation, if we stop auto activation the host will not automatically come back up when storage domain problem is fixed.

HostMonitoring should see the host can be reached and try to connect to storage domain in the next cycle.

Comment 6 Netbulae 2019-04-04 14:16:49 UTC

For me I would like an optional max-retries parameter. 

When you have racks full of nodes fencing and activating it generates additional load on power usage and ovirt tasks. That can lead to cascade effects through your infrastructure.

Also it's not good for the nodes to power cycle all the time. There should be someone noticing this and monitoring etc., but I've seen things go wrong too many times in my career ;-)

Comment 7 Ravi Nori 2019-04-04 14:18:49 UTC

Please open an RFE to add max-retries that can be configured per cluster. We will address it

Comment 8 Sandro Bonazzola 2019-04-16 13:58:18 UTC

This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.