Bug 2098069 - Intermittent issues during overcloud deployment with clustercheck on compute nodes [NEEDINFO]
Summary: Intermittent issues during overcloud deployment with clustercheck on compute ...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: mariadb-galera
Version: 17.0 (Wallaby)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Damien Ciabrini
QA Contact: pkomarov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-17 07:54 UTC by schari
Modified: 2023-08-03 15:46 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-18 04:07:45 UTC
Target Upstream Version:
Embargoed:
ifrangs: needinfo? (dciabrin)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-15812 0 None None None 2022-06-17 08:01:28 UTC

Description schari 2022-06-17 07:54:33 UTC
Description of problem:
This is an OSP 17 environment with a RHEL 9 undercloud and 110 overcloud nodes currently. While deploying the overcloud, in some deployments, clustercheck fails on 1 or 2 compute nodes randomly.

2022-06-16 09:20:39.932746 | 0cc47afa-1a12-e033-f7c6-00000008b65b |    CHANGED | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_2 | compute1029p-12
2022-06-16 09:20:39.933746 | 0cc47afa-1a12-e033-f7c6-00000008b65b |     TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_2 | compute1029p-12 | 0:49:38.558025 | 4.25s
2022-06-16 09:21:03.828476 | 0cc47afa-1a12-e033-f7c6-00000008b65c |      FATAL | Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/step_2 | compute1029p-12 | error={"changed": false, "msg": "Service clustercheck has not started yet"}

Triggering overcloud deployment again fixes the issue. [1] has the overcloud deployment logs, sosreport from the compute node where clustercheck failed, and the templates used for deployment.

[1] http://perfscale.perf.lab.eng.bos.redhat.com/pub/schari/osp17_250_node_scale_clustercheck_issue/

Version-Release number of selected component (if applicable):
RHOS-17.0-RHEL-9-20220401.n.1


Note You need to log in before you can comment on or make changes to this bug.