Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2178614

Summary: "Create Cluster tripleo_cluster" fails if it's the second attempt
Product: Red Hat OpenStack Reporter: David Hill <dhill>
Component: puppet-pacemakerAssignee: Luca Miccini <lmiccini>
Status: CLOSED WONTFIX QA Contact: Nobody <nobody>
Severity: high Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: jjoyce, jmarcian, jschluet, lmiccini, slinaber, tvignaud
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-04-26 12:09:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Hill 2023-03-15 12:34:58 UTC
Description of problem:
"Create Cluster tripleo_cluster" fails if it's the second attempt.  In this customer case (not the first time we see this), the authentication failed for some reasons (MTU size, etc) and then, second deployment fails with :
~~~
<13>Mar 13 14:39:12 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: Error: Hosts 'overcloud-controller-1', 'overcloud-controller-2' are not known to pcs, try to authenticate the hosts using 'pcs host auth overcloud-controller-1 overcloud-controller-2' command
~~~

    Exec <|tag == 'pacemaker-auth'|>
    ->
    exec {"Create Cluster ${cluster_name}":
      creates   => '/etc/cluster/cluster.conf',
      command   => $cluster_setup_cmd,
      timeout   => $cluster_start_timeout,
      tries     => $cluster_start_tries,
      try_sleep => $cluster_start_try_sleep,
      unless    => '/usr/bin/test -f /etc/corosync/corosync.conf',
      require   => Class['pacemaker::install'],
    }
    ->


Version-Release number of selected component (if applicable):
All

How reproducible:
If the first "Create Cluster tripleo_cluster" wasn't executed for some reasons.

Steps to Reproduce:
1. idk exactly what happened but it happened and hacluster password was set, then auth happened (probably) and "Create Cluster tripleo_cluster" didn't complete or wasn't even executed
2. Retry deployment
3.

Actual results:
Fails because pcsd is not authenticated to all hosts

Expected results:
It should authenticate if it's not authenticated

Additional info:
It's not the first time we see this behavior but it's the first time we open a BZ for this.

Comment 8 Luca Miccini 2024-04-26 12:09:14 UTC
We won't be able to fix it as the risk of rewriting parts of the puppet manifest is too high this far into the product lifecycle. Our advice is to troubleshoot the failure and re-trigger the deployment (in case of FFU) or to delete the overcloud and start from a clean state if it is a new deployment.
The upcoming RHOSP18 release will not make use of pacemaker preventing this issue altogether.