Bug 2098069
| Summary: | Intermittent issues during overcloud deployment with clustercheck on compute nodes | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | schari |
| Component: | mariadb-galera | Assignee: | Damien Ciabrini <dciabrin> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | pkomarov |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 17.0 (Wallaby) | CC: | bshephar, dciabrin, lmiccini, mbayer, mburns, ramishra, schari |
| Target Milestone: | --- | Keywords: | Scale, Triaged |
| Target Release: | --- | Flags: | ifrangs:
needinfo?
(dciabrin) |
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-07-18 04:07:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Description of problem: This is an OSP 17 environment with a RHEL 9 undercloud and 110 overcloud nodes currently. While deploying the overcloud, in some deployments, clustercheck fails on 1 or 2 compute nodes randomly. 2022-06-16 09:20:39.932746 | 0cc47afa-1a12-e033-f7c6-00000008b65b | CHANGED | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_2 | compute1029p-12 2022-06-16 09:20:39.933746 | 0cc47afa-1a12-e033-f7c6-00000008b65b | TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_2 | compute1029p-12 | 0:49:38.558025 | 4.25s 2022-06-16 09:21:03.828476 | 0cc47afa-1a12-e033-f7c6-00000008b65c | FATAL | Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/step_2 | compute1029p-12 | error={"changed": false, "msg": "Service clustercheck has not started yet"} Triggering overcloud deployment again fixes the issue. [1] has the overcloud deployment logs, sosreport from the compute node where clustercheck failed, and the templates used for deployment. [1] http://perfscale.perf.lab.eng.bos.redhat.com/pub/schari/osp17_250_node_scale_clustercheck_issue/ Version-Release number of selected component (if applicable): RHOS-17.0-RHEL-9-20220401.n.1