Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2184056

Summary: Stonith device left in stopped state but overcloud deployment considered successful
Product: Red Hat OpenStack Reporter: Eric Nothen <enothen>
Component: documentationAssignee: Ian Frangs <ifrangs>
Status: CLOSED COMPLETED QA Contact: RHOS Documentation Team <rhos-docs>
Severity: medium Docs Contact:
Priority: low    
Version: 17.1 (Wallaby)CC: jjoyce, jschluet, lmiccini, mariel, mburns, nobody, rhos-maint, slinaber, tvignaud
Target Milestone: ---Keywords: Documentation, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2184014 Environment:
Last Closed: 2024-12-16 14:01:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2184014    
Bug Blocks:    

Description Eric Nothen 2023-04-03 14:19:15 UTC
+++ This bug was initially created as a clone of Bug #2184014 +++

Description of problem:

Overcloud deploy completes and is considered successful, but at least one fence device is in "stopped" state (as a result of BZ#2184014, but also if any wrong password is used in fencing.yaml).


Version-Release number of selected component (if applicable):
Tested in 16.1

How reproducible:
Always reproducible when the password of at least one console is incorrect in fencing.yaml, or when using $ in the password

Steps to Reproduce:
1. Create nodes.yaml where at least one remote console has its password include $:

~~~
(undercloud) [stack.lab ~]$ egrep "name|pm_type|pm_user|pm_password" nodes.yaml |head -4
  - name: "overcloud-controller-0.keller1618.lab"
    pm_type: "ipmi"
    pm_user: "admin"
    pm_password: "some$tring"
~~~

2. Generate fencing.yaml:

~~~
(undercloud) [stack.lab ~]$ openstack overcloud generate fencing --output /home/stack/templates/fencing.yaml /home/stack/nodes.yaml
(undercloud) [stack.lab ~]$ head -13 /home/stack/templates/fencing.yaml
parameter_defaults:
  EnableFencing: true
  FencingConfig:
    devices:
    - agent: fence_ipmilan
      host_mac: 52:54:00:db:a1:22
      params:
        ipaddr: 192.168.122.1
        ipport: '16021'
        lanplus: true
        login: admin
        passwd: some$tring
        pcmk_host_list: overcloud-controller-0
~~~

3. Add nodes, tag profiles, introspect, etc.

4. Deploy overcloud

Actual results:

Deployment completes successfully. However, the stonith device corresponding to overcloud-controller-0 is in Stopped state:

~~~
Monday 03 April 2023  16:15:13 +0200 (0:00:00.064)       0:38:13.719 ********** 
=============================================================================== 
Wait for containers to start for step 2 using paunch ------------------ 583.78s
Pre-fetch all the containers ------------------------------------------ 228.22s
Wait for containers to start for step 3 using paunch ------------------ 175.19s
Wait for containers to start for step 4 using paunch ------------------ 161.06s
Wait for container-puppet tasks (generate config) to finish ----------- 158.91s
Pre-fetch all the containers ------------------------------------------ 110.19s
Wait for containers to start for step 5 using paunch ------------------- 74.16s
Wait for puppet host configuration to finish --------------------------- 62.40s
Wait for puppet host configuration to finish --------------------------- 42.24s
Run puppet on the host to apply IPtables rules ------------------------- 31.66s
Wait for puppet host configuration to finish --------------------------- 31.49s
tripleo-container-tag : Pull satellite.keller.lab:443/keller-library-osp_16_1_cv-rhosp16_containers-cinder-volume:16.1.8 image -- 31.35s
Wait for puppet host configuration to finish --------------------------- 31.26s
Wait for container-puppet tasks (bootstrap tasks) for step 4 to finish -- 21.08s
tripleo-container-tag : Pull satellite.keller.lab:443/keller-library-osp_16_1_cv-rhosp16_containers-mariadb:16.1.8 image -- 21.06s
Wait for puppet host configuration to finish --------------------------- 21.04s
Wait for containers to start for step 1 using paunch ------------------- 20.99s
Wait for container-puppet tasks (bootstrap tasks) for step 5 to finish -- 11.11s
Wait for container-puppet tasks (bootstrap tasks) for step 3 to finish -- 10.97s
Wait for container-puppet tasks (bootstrap tasks) for step 2 to finish -- 10.83s

Ansible passed.
Overcloud configuration completed.
Overcloud Endpoint: http://192.168.24.14:5000
Overcloud Horizon Dashboard URL: http://192.168.24.14:80/dashboard
Overcloud rc file: /home/stack/overcloudrc
Overcloud Deployed without error
(undercloud) [stack.lab ~]$
(undercloud) [stack.lab ~]$ ssh heat-admin@overcloud-controller-0 sudo pcs stonith
The authenticity of host 'overcloud-controller-0 (192.168.24.13)' can't be established.
ECDSA key fingerprint is SHA256:zlVasRXb0tPpOqrFP5xGAarepv84B2g+hy+SPQGhT6Q.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'overcloud-controller-0,192.168.24.13' (ECDSA) to the list of known hosts.
  * stonith-fence_ipmilan-525400cd455b	(stonith:fence_ipmilan):	Started overcloud-controller-0
  * stonith-fence_ipmilan-525400210a64	(stonith:fence_ipmilan):	Started overcloud-controller-2
  * stonith-fence_ipmilan-525400dba122	(stonith:fence_ipmilan):	Stopped
 Target: overcloud-controller-0
   Level 1 - stonith-fence_ipmilan-525400dba122
 Target: overcloud-controller-1
   Level 1 - stonith-fence_ipmilan-525400210a64
 Target: overcloud-controller-2
   Level 1 - stonith-fence_ipmilan-525400cd455b
(undercloud) [stack.lab ~]$ 
~~~


Expected results:

If fencing is defined in environment files, it is assumed that the desired end state is that stonith devices are correctly configured and fully functional. When this does not happen, the overcloud deploy should be considered not successful.


Additional info:

Comment 1 Luca Miccini 2023-04-04 13:37:25 UTC
The expected results assumption is not 100% correct. 
It is of course desirable to have stonith resources showing up as "running" but this doesn't prevent the cluster from using them even if they are in stopped state whenever a cluster node needs to be fenced.
We'll try and have a look to see if there is anything we can do to maybe return a warning in the overcloud deploy output, but we don't feel that having a stonith resource in stopped state should make the overall overcloud deployment fail.

Comment 3 Eric Nothen 2023-04-04 13:44:12 UTC
(In reply to Luca Miccini from comment #1)
> The expected results assumption is not 100% correct. 
> It is of course desirable to have stonith resources showing up as "running"
> but this doesn't prevent the cluster from using them even if they are in
> stopped state whenever a cluster node needs to be fenced.
> We'll try and have a look to see if there is anything we can do to maybe
> return a warning in the overcloud deploy output, but we don't feel that
> having a stonith resource in stopped state should make the overall overcloud
> deployment fail.

I'm not sure I follow the reasoning. If the stonith device is in stopped state because the console password is wrong, how is this device usable?

Comment 4 Luca Miccini 2023-04-04 14:19:59 UTC
(In reply to Eric Nothen from comment #3)
> (In reply to Luca Miccini from comment #1)
> > The expected results assumption is not 100% correct. 
> > It is of course desirable to have stonith resources showing up as "running"
> > but this doesn't prevent the cluster from using them even if they are in
> > stopped state whenever a cluster node needs to be fenced.
> > We'll try and have a look to see if there is anything we can do to maybe
> > return a warning in the overcloud deploy output, but we don't feel that
> > having a stonith resource in stopped state should make the overall overcloud
> > deployment fail.
> 
> I'm not sure I follow the reasoning. If the stonith device is in stopped
> state because the console password is wrong, how is this device usable?

From a puppet perspective we currently have no way of knowing if the resource is stopped because of a bogus password, a missing firewall rule, wrong credentials, ipmi timeout too short, temporary network glitch, etc. Puppet simply hands control over to pcs/pacemaker to make sure the resource definition is syntactically correct.
The current (naive) assumption is that if you can register a node in ironic and deploy it this is enough guarantee that the ipmi credentials are valid and functional.
The only way to deal with this failure scenario during the deployment would be to introspect the cluster logs or actively use those fencing credentials via ipmitools or similar and/or to implement a wrapper around puppet so that this introspection can take place, but this is something that would require a complete rewrite of the way stonith is configured so it is unlikely to happen, especially in 16.1.
Another possible approach could be to use the validation framework to double-check every credential before the deployment.

Comment 5 Eric Nothen 2023-04-04 14:49:26 UTC
(In reply to Luca Miccini from comment #4)
> From a puppet perspective we currently have no way of knowing if the
> resource is stopped because of a bogus password, a missing firewall rule,
> wrong credentials, ipmi timeout too short, temporary network glitch, etc.
> Puppet simply hands control over to pcs/pacemaker to make sure the resource
> definition is syntactically correct.
> The current (naive) assumption is that if you can register a node in ironic
> and deploy it this is enough guarantee that the ipmi credentials are valid
> and functional.
> The only way to deal with this failure scenario during the deployment would
> be to introspect the cluster logs or actively use those fencing credentials
> via ipmitools or similar and/or to implement a wrapper around puppet so that
> this introspection can take place, but this is something that would require
> a complete rewrite of the way stonith is configured so it is unlikely to
> happen, especially in 16.1.
> Another possible approach could be to use the validation framework to
> double-check every credential before the deployment.

Thank you. I understand the complexity, the fact that puppet just runs a command to configure the device and does not actually check functionality, and that it's very likely this issue won't be fixed before TripleO is dropped. In this case, I suggest we put a note in the documentation saying that TripleO will do "best effort" to complete fencing configuration, because in this case it is still true that the stonith device is misconfigured, unusable, and there's no way to fence the controller affected.