Bug 2098069

Summary: Intermittent issues during overcloud deployment with clustercheck on compute nodes
Product: Red Hat OpenStack Reporter: schari
Component: mariadb-galeraAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED INSUFFICIENT_DATA QA Contact: pkomarov
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: bshephar, dciabrin, lmiccini, mbayer, mburns, ramishra, schari
Target Milestone: ---Keywords: Scale, Triaged
Target Release: ---Flags: ifrangs: needinfo? (dciabrin)
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-18 04:07:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description schari 2022-06-17 07:54:33 UTC
Description of problem:
This is an OSP 17 environment with a RHEL 9 undercloud and 110 overcloud nodes currently. While deploying the overcloud, in some deployments, clustercheck fails on 1 or 2 compute nodes randomly.

2022-06-16 09:20:39.932746 | 0cc47afa-1a12-e033-f7c6-00000008b65b |    CHANGED | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_2 | compute1029p-12
2022-06-16 09:20:39.933746 | 0cc47afa-1a12-e033-f7c6-00000008b65b |     TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_2 | compute1029p-12 | 0:49:38.558025 | 4.25s
2022-06-16 09:21:03.828476 | 0cc47afa-1a12-e033-f7c6-00000008b65c |      FATAL | Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/step_2 | compute1029p-12 | error={"changed": false, "msg": "Service clustercheck has not started yet"}

Triggering overcloud deployment again fixes the issue. [1] has the overcloud deployment logs, sosreport from the compute node where clustercheck failed, and the templates used for deployment.

[1] http://perfscale.perf.lab.eng.bos.redhat.com/pub/schari/osp17_250_node_scale_clustercheck_issue/

Version-Release number of selected component (if applicable):
RHOS-17.0-RHEL-9-20220401.n.1