Description of problem: A customer has reported an issue with instance ha. In their case, some compute nodes were not rebooted after network failure but all of the instances on these nodes were evacuated. Because reboot of the compute nodes was not triggered, the instances were still running in source compute node with all volume connections lost. Looking at pacemaker resources in the deployment, I found that these compute nodes didn't have fence resources(fence_ipmilan resources). Even in this situation, these compute node "can be fenced" by fence_compute and pacemaker completes fencing by only executing fence_compute action which triggers evacuation. Later it turned out the custmer manually updated FencingConfig with wrong mac addresses, and it is why fence_ipmilan resources were not created properly. IMO we should not let deployment success with such incomplete and risky configuration, and it would be better to implement a validation to ensure fence resources to reboot ComputeInstanceHA nodes are created. Version-Release number of selected component (if applicable): The issue was initially found in z16 deployment How reproducible: Always Steps to Reproduce: 1. Create a fencing.yaml file without records for compute nodes 2. Deploy overcloud nodes with instance ha enabled. Actual results: Deployment succeeds with incomplete fence resources in pacemaker Expected results: Deployment fails because of missing fence resources for ComputeInstanceHA nodes Additional info:
as a quick fix/test we can try something like: [root@compute-0 ~]# diff /etc/puppet/modules/tripleo/manifests/fencing.pp /usr/share/openstack-puppet/modules/tripleo/manifests/fencing.pp 145,153d144 < # let's store the number of stonith devices to be created for this server < # this will be used to detect if there is a least one (when instance_ha is configured) < $data_num = [ < length($xvm_devices), length($ironic_devices), length($redfish_devices), < length($ipmilan_devices), length($kdump_devices), length($rhev_devices) < ] < < $sum = $data_num.reduce |$memo, $value| { $memo + $value } < 175,177d165 < } < elsif $sum == 0 and ($enable_instanceha and $is_compute_instanceha_node) { < fail('Instance HA requires at least one valid stonith device') ... $all_levels.each |$index, $levelx_devices |{ $level = regsubst($index, 'level', '', 'G') $all_devices = $levelx_devices if $::deployment_type != 'containers' { $xvm_devices = local_fence_devices('fence_xvm', $all_devices) create_resources('pacemaker::stonith::fence_xvm', $xvm_devices, $common_params) } $ironic_devices = local_fence_devices('fence_ironic', $all_devices) create_resources('pacemaker::stonith::fence_ironic', $ironic_devices, $common_params) $redfish_devices = local_fence_devices('fence_redfish', $all_devices) create_resources('pacemaker::stonith::fence_redfish', $redfish_devices, $common_params) $ipmilan_devices = local_fence_devices('fence_ipmilan', $all_devices) create_resources('pacemaker::stonith::fence_ipmilan', $ipmilan_devices, $common_params) $kdump_devices = local_fence_devices('fence_kdump', $all_devices) create_resources('pacemaker::stonith::fence_kdump', $kdump_devices, $common_params) $rhev_devices = local_fence_devices('fence_rhevm', $all_devices) create_resources('pacemaker::stonith::fence_rhevm', $rhev_devices, $common_params) $data = { 'xvm' => $xvm_devices, 'ironic' => $ironic_devices, 'redfish' => $redfish_devices, 'ipmilan' => $ipmilan_devices, 'kdump' => $kdump_devices, 'rhevm' => $rhev_devices } # let's store the number of stonith devices to be created for this server # this will be used to detect if there is a least one (when instance_ha is configured) $data_num = [ length($xvm_devices), length($ironic_devices), length($redfish_devices), length($ipmilan_devices), length($kdump_devices), length($rhev_devices) ] $sum = $data_num.reduce |$memo, $value| { $memo + $value } $data.each |$items| { $driver = $items[0] $driver_devices = $items[1] if $driver_devices and length($driver_devices) == 1 { $mac = keys($driver_devices)[0] $safe_mac = regsubst($mac, ':', '', 'G') if ($enable_instanceha and $is_compute_instanceha_node) { $stonith_resources = [ "stonith-fence_${driver}-${safe_mac}", 'stonith-fence_compute-fence-nova' ] } else { $stonith_resources = [ "stonith-fence_${driver}-${safe_mac}" ] } pacemaker::stonith::level{ "stonith-${level}-${safe_mac}": level => $level, target => '$(/usr/sbin/crm_node -n)', stonith_resources => $stonith_resources, tries => $tries, try_sleep => $try_sleep, } Pcmk_stonith<||> -> Pcmk_stonith_level<||> } elsif $sum == 0 and ($enable_instanceha and $is_compute_instanceha_node) { fail('Instance HA requires at least one valid stonith device') } } } } - we store the number of devices to be defined for each 'kind' (xvm, ipmi, etc) and sum them. - if the sum is zero this means there is no stonith devices that will be created for the specific server puppet is running on. - at this point we check if the three conditions are met at the same time: 1. sum == 0 (aka no stonith devices) 2. $enable_instanceha = true (aka instanceha is enabled in this overcloud) 3. $is_compute_instanceha_node = true (aka the server where puppet is running is a compute node that should be configured with iha) if this condition is met then we raise a message and fail. thoughts? this would have the added benefit of restoring the old behavior of deployments failing when there was no stonith resource for computes (and iha) that I broke while implementing stonith_levels.
Hi Luca, The approach looks good to me. IIUC with that change it is no longer allowed to use different level of fence devices in the cluster. For example if a controller node has fence_kdump enabled in level 1 and fence_impilan enabled in level 2, the it is required to have fence device for both levels in ComputeInstanceHA nodes. I don't think users are likely to try such deployment but it might be useful if we have a single parameter to bypass that validation if needed. ~~~ # let's store the number of stonith devices to be created for this server # this will be used to detect if there is a least one (when instance_ha is configured) $data_num = [ length($xvm_devices), length($ironic_devices), length($redfish_devices), length($ipmilan_devices), length($kdump_devices), length($rhev_devices) ] $sum = $data_num.reduce |$memo, $value| { $memo + $value } ~~~ I've not yet tested this but I guess we could use $data directly here like; $num_fence_devices = $data.reduce |$memo, $value| { $memo + len($value[1]) } ~~~ $data.each |$items| { ... elsif $sum == 0 and ($enable_instanceha and $is_compute_instanceha_node) { fail('Instance HA requires at least one valid stonith device') } } ~~~ I think this validation can be executed outside of this $data.each |$items| block ? Also, now I'm worried how pacemaker resources are defined if multi layer fencing is defined for ComputeInstanceHA nodes with fence compute enabled but that would be a different topic...
The 'stonith level' point is a good argument against my initial approach as we would end up (wrongly) evaluating that condition for each level instead of once for each server. There doesn't seem to be an easy way to persist a value across .each iterations so either I need to refactor how the whole stonith resources+levels are set up or I need to figure out something else. I'll try spending some more time on this.
May 26 06:12:27 compute-0.redhat.local puppet-user[313621]: Error: Evaluation Error: Error while evaluating a Function Call, Instance HA requires at least one valid stonith device (file: /etc/puppet/modules/tripleo/manifests/fencing.pp, line: 166, column: 9) on node compute-0.redhat.local
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543