Bug 1684554

Summary:

Host activation task gets stuck after many retries

Product:

[oVirt] ovirt-engine

Reporter:

Netbulae <info>

Component:

Backend.Core

Assignee:

Ravi Nori <rnori>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Liran Rotenberg <lrotenbe>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.3.0

CC:

bugs, michal.skrivanek, mperina, rnori

Target Milestone:

ovirt-4.3.3

Flags:

pm-rhel: ovirt-4.3+

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

ovirt-engine-4.3.3.1

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-04-16 13:58:18 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
screenshot of tasks	none
piece of engine.log	none

Description Netbulae 2019-03-01 14:24:57 UTC

Created attachment 1539826 [details]
screenshot of tasks

Description of problem:

We have a host that cannot connect to the storage domain. When trying to activate it, it fails. Then engine tries again and again.

After x amount of retries the activation task get's stuck and some other new tasks fail.

Also I cannot put the host in maintenance until an engine restart.

Version-Release number of selected component (if applicable):

4.3.0

How reproducible:
Always

Steps to Reproduce:
1. Add host that has storage issues
2. wait


On a related note, can we define a max amount of activation attempts parameter somewhere? If it fails after x times it should go into maintenance and not try anymore.

Comment 1 Netbulae 2019-03-01 14:44:54 UTC

I don't know if this is the log of the problem, I'll attach a piece of the engine log hereafter.

2019-03-01 10:00:30,391+01 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [6a99b660-5991-4f5b-9a7b-112965033613] Command CopyData id: 'a8126c50-0bcb-4d97-848b-c70f5a864430': couldn't get the status of job '35270925-e550-472e-a2e1-743b87e41e73' on host 'node10' (id: 'd301c1d7-d94e-45ba-90c9-dff70afb774c'), assuming it's still running

Comment 2 Netbulae 2019-03-01 14:45:27 UTC

Created attachment 1539841 [details]
piece of engine.log

Comment 3 Liran Rotenberg 2019-04-04 07:41:35 UTC

Verified on:
ovirt-engine-4.3.3.1-0.1.el7.noarch

Steps:
1. Block the host to the storage.
For example run on the host:
# iptables --insert INPUT 1 --source <storage> --jump DROP --protocol all
2. Add the host to the environment.

Results:
At the end of the host installation, there are tries to activate the host.
Task is seen in the engine as "stuck". In few minutes the task ended setting the new host as non-operational.
The engine tries to activate the host again as part of the normal engine's behavior. 
Setting the host into maintenance is possible but only if you do so when the host is in non-operational mode - need to catch in between the activation tries.

Comment 4 Liran Rotenberg 2019-04-04 11:03:56 UTC

Ravi, if the intention was to stop the auto-activation after the host is non-operational we will need to move this bug back to assigned.

Comment 5 Ravi Nori 2019-04-04 13:58:55 UTC

The intention is not to stop auto activation, if we stop auto activation the host will not automatically come back up when storage domain problem is fixed.

HostMonitoring should see the host can be reached and try to connect to storage domain in the next cycle.

Comment 6 Netbulae 2019-04-04 14:16:49 UTC

For me I would like an optional max-retries parameter. 

When you have racks full of nodes fencing and activating it generates additional load on power usage and ovirt tasks. That can lead to cascade effects through your infrastructure.

Also it's not good for the nodes to power cycle all the time. There should be someone noticing this and monitoring etc., but I've seen things go wrong too many times in my career ;-)

Comment 7 Ravi Nori 2019-04-04 14:18:49 UTC

Please open an RFE to add max-retries that can be configured per cluster. We will address it

Comment 8 Sandro Bonazzola 2019-04-16 13:58:18 UTC

This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.