Bug 1684554
Summary: | Host activation task gets stuck after many retries | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Netbulae <info> | ||||||
Component: | Backend.Core | Assignee: | Ravi Nori <rnori> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Liran Rotenberg <lrotenbe> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.3.0 | CC: | bugs, michal.skrivanek, mperina, rnori | ||||||
Target Milestone: | ovirt-4.3.3 | Flags: | pm-rhel:
ovirt-4.3+
|
||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | ovirt-engine-4.3.3.1 | Doc Type: | No Doc Update | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-04-16 13:58:18 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
I don't know if this is the log of the problem, I'll attach a piece of the engine log hereafter. 2019-03-01 10:00:30,391+01 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [6a99b660-5991-4f5b-9a7b-112965033613] Command CopyData id: 'a8126c50-0bcb-4d97-848b-c70f5a864430': couldn't get the status of job '35270925-e550-472e-a2e1-743b87e41e73' on host 'node10' (id: 'd301c1d7-d94e-45ba-90c9-dff70afb774c'), assuming it's still running Created attachment 1539841 [details]
piece of engine.log
Verified on: ovirt-engine-4.3.3.1-0.1.el7.noarch Steps: 1. Block the host to the storage. For example run on the host: # iptables --insert INPUT 1 --source <storage> --jump DROP --protocol all 2. Add the host to the environment. Results: At the end of the host installation, there are tries to activate the host. Task is seen in the engine as "stuck". In few minutes the task ended setting the new host as non-operational. The engine tries to activate the host again as part of the normal engine's behavior. Setting the host into maintenance is possible but only if you do so when the host is in non-operational mode - need to catch in between the activation tries. Ravi, if the intention was to stop the auto-activation after the host is non-operational we will need to move this bug back to assigned. The intention is not to stop auto activation, if we stop auto activation the host will not automatically come back up when storage domain problem is fixed. HostMonitoring should see the host can be reached and try to connect to storage domain in the next cycle. For me I would like an optional max-retries parameter. When you have racks full of nodes fencing and activating it generates additional load on power usage and ovirt tasks. That can lead to cascade effects through your infrastructure. Also it's not good for the nodes to power cycle all the time. There should be someone noticing this and monitoring etc., but I've seen things go wrong too many times in my career ;-) Please open an RFE to add max-retries that can be configured per cluster. We will address it This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |
Created attachment 1539826 [details] screenshot of tasks Description of problem: We have a host that cannot connect to the storage domain. When trying to activate it, it fails. Then engine tries again and again. After x amount of retries the activation task get's stuck and some other new tasks fail. Also I cannot put the host in maintenance until an engine restart. Version-Release number of selected component (if applicable): 4.3.0 How reproducible: Always Steps to Reproduce: 1. Add host that has storage issues 2. wait On a related note, can we define a max amount of activation attempts parameter somewhere? If it fails after x times it should go into maintenance and not try anymore.