Bug 2184056
| Summary: | Stonith device left in stopped state but overcloud deployment considered successful | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eric Nothen <enothen> |
| Component: | documentation | Assignee: | Ian Frangs <ifrangs> |
| Status: | CLOSED COMPLETED | QA Contact: | RHOS Documentation Team <rhos-docs> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 17.1 (Wallaby) | CC: | jjoyce, jschluet, lmiccini, mariel, mburns, nobody, rhos-maint, slinaber, tvignaud |
| Target Milestone: | --- | Keywords: | Documentation, Triaged |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 2184014 | Environment: | |
| Last Closed: | 2024-12-16 14:01:09 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2184014 | ||
| Bug Blocks: | |||
|
Description
Eric Nothen
2023-04-03 14:19:15 UTC
The expected results assumption is not 100% correct. It is of course desirable to have stonith resources showing up as "running" but this doesn't prevent the cluster from using them even if they are in stopped state whenever a cluster node needs to be fenced. We'll try and have a look to see if there is anything we can do to maybe return a warning in the overcloud deploy output, but we don't feel that having a stonith resource in stopped state should make the overall overcloud deployment fail. (In reply to Luca Miccini from comment #1) > The expected results assumption is not 100% correct. > It is of course desirable to have stonith resources showing up as "running" > but this doesn't prevent the cluster from using them even if they are in > stopped state whenever a cluster node needs to be fenced. > We'll try and have a look to see if there is anything we can do to maybe > return a warning in the overcloud deploy output, but we don't feel that > having a stonith resource in stopped state should make the overall overcloud > deployment fail. I'm not sure I follow the reasoning. If the stonith device is in stopped state because the console password is wrong, how is this device usable? (In reply to Eric Nothen from comment #3) > (In reply to Luca Miccini from comment #1) > > The expected results assumption is not 100% correct. > > It is of course desirable to have stonith resources showing up as "running" > > but this doesn't prevent the cluster from using them even if they are in > > stopped state whenever a cluster node needs to be fenced. > > We'll try and have a look to see if there is anything we can do to maybe > > return a warning in the overcloud deploy output, but we don't feel that > > having a stonith resource in stopped state should make the overall overcloud > > deployment fail. > > I'm not sure I follow the reasoning. If the stonith device is in stopped > state because the console password is wrong, how is this device usable? From a puppet perspective we currently have no way of knowing if the resource is stopped because of a bogus password, a missing firewall rule, wrong credentials, ipmi timeout too short, temporary network glitch, etc. Puppet simply hands control over to pcs/pacemaker to make sure the resource definition is syntactically correct. The current (naive) assumption is that if you can register a node in ironic and deploy it this is enough guarantee that the ipmi credentials are valid and functional. The only way to deal with this failure scenario during the deployment would be to introspect the cluster logs or actively use those fencing credentials via ipmitools or similar and/or to implement a wrapper around puppet so that this introspection can take place, but this is something that would require a complete rewrite of the way stonith is configured so it is unlikely to happen, especially in 16.1. Another possible approach could be to use the validation framework to double-check every credential before the deployment. (In reply to Luca Miccini from comment #4) > From a puppet perspective we currently have no way of knowing if the > resource is stopped because of a bogus password, a missing firewall rule, > wrong credentials, ipmi timeout too short, temporary network glitch, etc. > Puppet simply hands control over to pcs/pacemaker to make sure the resource > definition is syntactically correct. > The current (naive) assumption is that if you can register a node in ironic > and deploy it this is enough guarantee that the ipmi credentials are valid > and functional. > The only way to deal with this failure scenario during the deployment would > be to introspect the cluster logs or actively use those fencing credentials > via ipmitools or similar and/or to implement a wrapper around puppet so that > this introspection can take place, but this is something that would require > a complete rewrite of the way stonith is configured so it is unlikely to > happen, especially in 16.1. > Another possible approach could be to use the validation framework to > double-check every credential before the deployment. Thank you. I understand the complexity, the fact that puppet just runs a command to configure the device and does not actually check functionality, and that it's very likely this issue won't be fixed before TripleO is dropped. In this case, I suggest we put a note in the documentation saying that TripleO will do "best effort" to complete fencing configuration, because in this case it is still true that the stonith device is misconfigured, unusable, and there's no way to fence the controller affected. |