Bug 1684554 - Host activation task gets stuck after many retries
Summary: Host activation task gets stuck after many retries
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.3.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.3.3
: ---
Assignee: Ravi Nori
QA Contact: Liran Rotenberg
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-01 14:24 UTC by Netbulae
Modified: 2019-04-16 13:58 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.3.3.1
Clone Of:
Environment:
Last Closed: 2019-04-16 13:58:18 UTC
oVirt Team: Infra
Embargoed:
pm-rhel: ovirt-4.3+


Attachments (Terms of Use)
screenshot of tasks (49.28 KB, image/png)
2019-03-01 14:24 UTC, Netbulae
no flags Details
piece of engine.log (33.98 KB, text/plain)
2019-03-01 14:45 UTC, Netbulae
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 98633 0 master MERGED engine : InitVdsOnUpCommand should handle errors when connecting to storage domain 2020-04-08 16:02:25 UTC
oVirt gerrit 98674 0 ovirt-engine-4.3 MERGED engine : InitVdsOnUpCommand should handle errors when connecting to storage domain 2020-04-08 16:02:25 UTC

Description Netbulae 2019-03-01 14:24:57 UTC
Created attachment 1539826 [details]
screenshot of tasks

Description of problem:

We have a host that cannot connect to the storage domain. When trying to activate it, it fails. Then engine tries again and again.

After x amount of retries the activation task get's stuck and some other new tasks fail.

Also I cannot put the host in maintenance until an engine restart.

Version-Release number of selected component (if applicable):

4.3.0

How reproducible:
Always

Steps to Reproduce:
1. Add host that has storage issues
2. wait


On a related note, can we define a max amount of activation attempts parameter somewhere? If it fails after x times it should go into maintenance and not try anymore.

Comment 1 Netbulae 2019-03-01 14:44:54 UTC
I don't know if this is the log of the problem, I'll attach a piece of the engine log hereafter.

2019-03-01 10:00:30,391+01 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [6a99b660-5991-4f5b-9a7b-112965033613] Command CopyData id: 'a8126c50-0bcb-4d97-848b-c70f5a864430': couldn't get the status of job '35270925-e550-472e-a2e1-743b87e41e73' on host 'node10' (id: 'd301c1d7-d94e-45ba-90c9-dff70afb774c'), assuming it's still running

Comment 2 Netbulae 2019-03-01 14:45:27 UTC
Created attachment 1539841 [details]
piece of engine.log

Comment 3 Liran Rotenberg 2019-04-04 07:41:35 UTC
Verified on:
ovirt-engine-4.3.3.1-0.1.el7.noarch

Steps:
1. Block the host to the storage.
For example run on the host:
# iptables --insert INPUT 1 --source <storage> --jump DROP --protocol all
2. Add the host to the environment.

Results:
At the end of the host installation, there are tries to activate the host.
Task is seen in the engine as "stuck". In few minutes the task ended setting the new host as non-operational.
The engine tries to activate the host again as part of the normal engine's behavior. 
Setting the host into maintenance is possible but only if you do so when the host is in non-operational mode - need to catch in between the activation tries.

Comment 4 Liran Rotenberg 2019-04-04 11:03:56 UTC
Ravi, if the intention was to stop the auto-activation after the host is non-operational we will need to move this bug back to assigned.

Comment 5 Ravi Nori 2019-04-04 13:58:55 UTC
The intention is not to stop auto activation, if we stop auto activation the host will not automatically come back up when storage domain problem is fixed.

HostMonitoring should see the host can be reached and try to connect to storage domain in the next cycle.

Comment 6 Netbulae 2019-04-04 14:16:49 UTC
For me I would like an optional max-retries parameter. 

When you have racks full of nodes fencing and activating it generates additional load on power usage and ovirt tasks. That can lead to cascade effects through your infrastructure.

Also it's not good for the nodes to power cycle all the time. There should be someone noticing this and monitoring etc., but I've seen things go wrong too many times in my career ;-)

Comment 7 Ravi Nori 2019-04-04 14:18:49 UTC
Please open an RFE to add max-retries that can be configured per cluster. We will address it

Comment 8 Sandro Bonazzola 2019-04-16 13:58:18 UTC
This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.