1722082 – [OSP15] Controller replacement procedure for OSP15 contains invalid commands

Bug 1722082 - [OSP15] Controller replacement procedure for OSP15 contains invalid commands

Summary: [OSP15] Controller replacement procedure for OSP15 contains invalid commands

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	documentation
Sub Component:
Version:	15.0 (Stein)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z2
Target Release:	15.0 (Stein)
Assignee:	Dan Macpherson
QA Contact:	RHOS Documentation Team
Docs Contact:
URL:
Whiteboard:	docs-accepted
Duplicates (1):	1740124 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-19 13:19 UTC by Artem Hrechanychenko
Modified:	2020-02-20 14:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-09 20:29:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	674925	0	'None'	MERGED	pcs 0.10: authenticate nodes before adding them to the cluster	2020-09-30 05:25:56 UTC
Red Hat Bugzilla	1663350	0	unspecified	CLOSED	Update controller replacement procedure	2023-12-15 16:23:21 UTC

Description Artem Hrechanychenko 2019-06-19 13:19:20 UTC

Description of problem:
According to update documentation for OSP15 for controller replacement procedure need to change next items:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/director_installation_and_usage/index#preparing-for-controller-replacement

7. Check the following parameters on each node of the overcloud MariaDB cluster:

...
Use the following command to check these parameters on each running Controller node. In this example, the Controller node IP addresses are 192.168.0.47 and 192.168.0.46:

(undercloud) $ for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo podman exec -it $(sudo podman ps --filter name=galera-bundle -q) mysql -e \"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done

8. Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the status:

(undercloud) $ ssh heat-admin.0.47 "sudo podman exec \$(sudo podman ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"

12.2. Removing a Ceph Monitor daemon

5. Remove the monitor from the cluster:

# ceph mon remove <mon_id>
should be executed inside container
sudo podman exec -it ceph-mon-controller-0 ceph mon remove overcloud-controller-1

6.On all Controller nodes, remove the monitor entry from /etc/ceph/ceph.conf. For example, if you remove controller-1, then remove the IP and hostname for controller-1.
that section now contains changed content:
example
mon host = [v2:172.17.3.34:3300,v1:172.17.3.34:6789],[v2:172.17.3.89:3300,v1:172.17.3.89:6789],[v2:172.17.3.96:3300,v1:172.17.3.96:6789]
mon initial members = controller-0,controller-2,controller-3

12.3. Preparing the cluster for Controller replacement
3.After stopping Pacemaker on the old node, delete the old node from the Corosync configuration on each node and restart Corosync. To check the status of Pacemaker on the old node, run the pcs status command and verify that the status is Stopped.
...
(undercloud) $ for NAME in overcloud-controller-0 overcloud-controller-2; do IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f 2) ; ssh heat-admin@$IP "sudo pcs cluster node remove overcloud-controller-1"; done

Version-Release number of selected component (if applicable):
OSP15

How reproducible:
always

Steps to Reproduce:
1.Try to replace controller using documentation from OSP15

Actual results:
changed content and commands for OSP15

Expected results:
no changes

Additional info:

Comment 1 Artem Hrechanychenko 2019-06-19 13:21:13 UTC

Damien
Can you check for next items:
(undercloud) $ for NAME in overcloud-controller-0 overcloud-controller-2; do IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f 2) ; ssh heat-admin@$IP "sudo pcs cluster localnode remove overcloud-controller-1; sudo pcs cluster reload corosync"; done
is it valid to use sudo pcs cluster node remove controller-1 instead of sudo pcs cluster localnode remove overcloud-controller-1

Comment 2 Artem Hrechanychenko 2019-06-19 13:22:42 UTC

John, 
can you check for next items:

 Remove the monitor from the cluster:

# ceph mon remove <mon_id>

is it valid to run that command inside of ceph-mon container?

6.On all Controller nodes, remove the monitor entry from /etc/ceph/ceph.conf. For example, if you remove controller-1, then remove the IP and hostname for controller-1. 
that section now contains changed content:
example
mon host = [v2:172.17.3.34:3300,v1:172.17.3.34:6789],[v2:172.17.3.89:3300,v1:172.17.3.89:6789],[v2:172.17.3.96:3300,v1:172.17.3.96:6789]
mon initial members = controller-0,controller-2,controller-3


is that ok to remove for v1,v2 in mon host list?

Comment 3 John Fulton 2019-07-22 16:47:18 UTC

(In reply to Artem Hrechanychenko from comment #2)
> John, 
> can you check for next items:
> 
>  Remove the monitor from the cluster:
> 
> # ceph mon remove <mon_id>
> 
> is it valid to run that command inside of ceph-mon container?

Yes, that's the preferred way as the 'ceph' command may not always be on the overcloud image. 

> 6.On all Controller nodes, remove the monitor entry from
> /etc/ceph/ceph.conf. For example, if you remove controller-1, then remove
> the IP and hostname for controller-1. 
> that section now contains changed content:
> example
> mon host =
> [v2:172.17.3.34:3300,v1:172.17.3.34:6789],[v2:172.17.3.89:3300,v1:172.17.3.
> 89:6789],[v2:172.17.3.96:3300,v1:172.17.3.96:6789]
> mon initial members = controller-0,controller-2,controller-3
> 
> 
> is that ok to remove for v1,v2 in mon host list?

Yes, if the node is removed, then both it's v1 and v2 entries should be removed.

Comment 4 John Fulton 2019-07-22 16:48:44 UTC

(In reply to Artem Hrechanychenko from comment #1)
> Damien
> Can you check for next items:
> (undercloud) $ for NAME in overcloud-controller-0 overcloud-controller-2; do
> IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f
> 2) ; ssh heat-admin@$IP "sudo pcs cluster localnode remove
> overcloud-controller-1; sudo pcs cluster reload corosync"; done
> is it valid to use sudo pcs cluster node remove controller-1 instead of sudo
> pcs cluster localnode remove overcloud-controller-1

Setting needinfo back to dciabrin for the above question

Comment 7 Artem Hrechanychenko 2019-07-27 10:22:43 UTC

According to OSP15-beta documentation commands are still invalid

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/15-beta/html-single/director_installation_and_usage/index#replacing-controller-nodes
 Check the following parameters on each node of the overcloud MariaDB cluster:

    wsrep_local_state_comment: Synced

    wsrep_cluster_size: 2

    Use the following command to check these parameters on each running Controller node. In this example, the Controller node IP addresses are 192.168.0.47 and 192.168.0.46:

    (undercloud) $ for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql -p\$(sudo hiera -c /etc/puppet/hiera.yaml mysql::server::root_password) --execute=\"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done

(undercloud) [stack@undercloud-0 ~]$ for i in 192.168.24.6 192.168.24.7 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql -p\$(sudo hiera -c /etc/puppet/hiera.yaml mysql::server::root_password) --execute=\"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
*** 192.168.24.6 ***
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
sudo: mysql: command not found
*** 192.168.24.7 ***
Warning: Permanently added '192.168.24.7' (ECDSA) to the list of known hosts.
sudo: mysql: command not found

(undercloud) [stack@undercloud-0 ~]$ for i in 192.168.24.6 192.168.24.7 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo podman exec \$(sudo podman ps --filter name=galera-bundle -q) mysql -e \"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
*** 192.168.24.6 ***
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
Variable_name	Value
wsrep_local_state_comment	Synced
Variable_name	Value
wsrep_cluster_size	3
*** 192.168.24.7 ***
Warning: Permanently added '192.168.24.7' (ECDSA) to the list of known hosts.
Variable_name	Value
wsrep_local_state_comment	Synced
Variable_name	Value
wsrep_cluster_size	3




undercloud) [stack@undercloud-0 ~]$ for NAME in controller-0 controller-2; do IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f 2) ; ssh heat-admin@$IP "sudo pcs cluster localnode remove controller-1; sudo pcs cluster reload corosync"; done

Warning: Permanently added '192.168.24.7' (ECDSA) to the list of known hosts.

Usage: pcs cluster ...
    setup <cluster name> (<node name> [addr=<node address>]...)...
            [transport knet|udp|udpu
                [<transport options>] [link <link options>]
                [compression <compression options>] [crypto <crypto options>]
            ] [totem <totem options>] [quorum <quorum options>]
            [--enable] [--start [--wait[=<n>]]] [--no-keys-sync]
        Create a cluster from the listed nodes and synchronize cluster
        configuration files to them.

        Nodes are specified by their names and optionally their addresses. If
        no addresses are specified for a node, pcs will configure corosync to
        communicate with that node using an address provided in 'pcs host auth'
        command. Otherwise, pcs will configure corosync to communicate with the
        node using the specified addresses.

        Transport knet:
        This is the default transport. It allows configuring traffic encryption
        and compression as well as using multiple addresses (links) for nodes.
        Transport options are:
            ip_version, knet_pmtud_interval, link_mode
        Link options are:
            ip_version, link_priority, linknumber, mcastport, ping_interval,
            ping_precision, ping_timeout, pong_count, transport (udp or sctp)
        Compression options are:
            level, model, threshold
        Crypto options are:
            cipher, hash, model
            By default, encryption is enabled with cipher=aes256 and
            hash=sha256. To disable encryption, set cipher=none and hash=none.

        Transports udp and udpu:
        WARNING: These transports are not supported in RHEL 8.
        These transports are limited to one address per node. They do not
        support traffic encryption nor compression.
        Transport options are:
            ip_version, netmtu
        Link options are:
            bindnetaddr, broadcast, mcastaddr, mcastport, ttl

        Totem and quorum can be configured regardles of used transport.
        Totem options are:
            consensus, downcheck, fail_recv_const, heartbeat_failures_allowed,
            hold, join, max_messages, max_network_delay, merge,
            miss_count_const, send_join, seqno_unchanged_const, token,
            token_coefficient, token_retransmit,
            token_retransmits_before_loss_const, window_size
        Quorum options are:
            auto_tie_breaker, last_man_standing, last_man_standing_window,
            wait_for_all

        Transports and their options, link, compression, crypto and totem
        options are all documented in corosync.conf(5) man page; knet link
        options are prefixed 'knet_' there, compression options are prefixed
        'knet_compression_' and crypto options are prefixed 'crypto_'. Quorum
        options are documented in votequorum(5) man page.

        --enable will configure the cluster to start on nodes boot.
        --start will start the cluster right after creating it.
        --wait will wait up to 'n' seconds for the cluster to start.
        --no-keys-sync will skip creating and distributing pcsd SSL certificate
            and key and corosync and pacemaker authkey files. Use this if you
            provide your own certificates and keys.

        Examples:
        Create a cluster with default settings:
            pcs cluster setup newcluster node1 node2
        Create a cluster using two links:
            pcs cluster setup newcluster node1 addr=10.0.1.11 addr=10.0.2.11 \
                node2 addr=10.0.1.12 addr=10.0.2.12
        Create a cluster using udp transport with a non-default port:
            pcs cluster setup newcluster node1 node2 transport udp link \
                mcastport=55405

    start [--all | <node>... ] [--wait[=<n>]] [--request-timeout=<seconds>]
        Start a cluster on specified node(s). If no nodes are specified then
        start a cluster on the local node. If --all is specified then start
        a cluster on all nodes. If the cluster has many nodes then the start
        request may time out. In that case you should consider setting
        --request-timeout to a suitable value. If --wait is specified, pcs
        waits up to 'n' seconds for the cluster to get ready to provide
        services after the cluster has successfully started.

    stop [--all | <node>... ] [--request-timeout=<seconds>]
        Stop a cluster on specified node(s). If no nodes are specified then
        stop a cluster on the local node. If --all is specified then stop
        a cluster on all nodes. If the cluster is running resources which take
        long time to stop then the stop request may time out before the cluster
        actually stops. In that case you should consider setting
        --request-timeout to a suitable value.

    kill
        Force corosync and pacemaker daemons to stop on the local node
        (performs kill -9). Note that init system (e.g. systemd) can detect that
        cluster is not running and start it again. If you want to stop cluster
        on a node, run pcs cluster stop on that node.

    enable [--all | <node>... ]
        Configure cluster to run on node boot on specified node(s). If node is
        not specified then cluster is enabled on the local node. If --all is
        specified then cluster is enabled on all nodes.

    disable [--all | <node>... ]
        Configure cluster to not run on node boot on specified node(s). If node
        is not specified then cluster is disabled on the local node. If --all
        is specified then cluster is disabled on all nodes.

    auth [-u <username>] [-p <password>]
        Authenticate pcs/pcsd to pcsd on nodes configured in the local cluster.

    status
        View current cluster status (an alias of 'pcs status cluster').

    pcsd-status [<node>]...
        Show current status of pcsd on nodes specified, or on all nodes
        configured in the local cluster if no nodes are specified.

    sync
        Sync cluster configuration (files which are supported by all
        subcommands of this command) to all cluster nodes.

    sync corosync
        Sync corosync configuration to all nodes found from current
        corosync.conf file.

    cib [filename] [scope=<scope> | --config]
        Get the raw xml from the CIB (Cluster Information Base).  If a filename
        is provided, we save the CIB to that file, otherwise the CIB is
        printed.  Specify scope to get a specific section of the CIB.  Valid
        values of the scope are: configuration, nodes, resources, constraints,
        crm_config, rsc_defaults, op_defaults, status.  --config is the same as
        scope=configuration.  Do not specify a scope if you want to edit
        the saved CIB using pcs (pcs -f <command>).

    cib-push <filename> [--wait[=<n>]]
            [diff-against=<filename_original> | scope=<scope> | --config]
        Push the raw xml from <filename> to the CIB (Cluster Information Base).
        You can obtain the CIB by running the 'pcs cluster cib' command, which
        is recommended first step when you want to perform desired
        modifications (pcs -f <command>) for the one-off push.
        If diff-against is specified, pcs diffs contents of filename against
        contents of filename_original and pushes the result to the CIB.
        Specify scope to push a specific section of the CIB.  Valid values
        of the scope are: configuration, nodes, resources, constraints,
        crm_config, rsc_defaults, op_defaults.  --config is the same as
        scope=configuration.  Use of --config is recommended.  Do not specify
        a scope if you need to push the whole CIB or be warned in the case
        of outdated CIB.
        If --wait is specified wait up to 'n' seconds for changes to be applied.
        WARNING: the selected scope of the CIB will be overwritten by the
        current content of the specified file.
        Example:
            pcs cluster cib > original.xml
            cp original.xml new.xml
            pcs -f new.xml constraint location apache prefers node2
            pcs cluster cib-push new.xml diff-against=original.xml

    cib-upgrade
        Upgrade the CIB to conform to the latest version of the document schema.

    edit [scope=<scope> | --config]
        Edit the cib in the editor specified by the $EDITOR environment
        variable and push out any changes upon saving.  Specify scope to edit
        a specific section of the CIB.  Valid values of the scope are:
        configuration, nodes, resources, constraints, crm_config, rsc_defaults,
        op_defaults.  --config is the same as scope=configuration.  Use of
        --config is recommended.  Do not specify a scope if you need to edit
        the whole CIB or be warned in the case of outdated CIB.

    node add <node name> [addr=<node address>]... [watchdog=<watchdog path>]
            [device=<SBD device path>]... [--start [--wait[=<n>]]] [--enable]
            [--no-watchdog-validation]
        Add the node to the cluster and synchronize all relevant configuration
        files to the new node. This command can only be run on an existing
        cluster node.

        The new node is specified by its name and optionally its addresses. If
        no addresses are specified for the node, pcs will configure corosync to
        communicate with the node using an address provided in 'pcs host auth'
        command. Otherwise, pcs will configure corosync to communicate with the
        node using the specified addresses.

        Use 'watchdog' to specify a path to a watchdog on the new node, when
        SBD is enabled in the cluster. If SBD is configured with shared storage,
        use 'device' to specify path to shared device(s) on the new node.

        If --start is specified also start cluster on the new node, if --wait
        is specified wait up to 'n' seconds for the new node to start. If
        --enable is specified configure cluster to start on the new node on
        boot. If --no-watchdog-validation is specified, validation of watchdog
        will be skipped.

        WARNING: By default, it is tested whether the specified watchdog is
                 supported. This may cause a restart of the system when
                 a watchdog with no-way-out-feature enabled is present. Use
                 --no-watchdog-validation to skip watchdog validation.

    node delete <node name> [<node name>]...
        Shutdown specified nodes and remove them from the cluster.

    node remove <node name> [<node name>]...
        Shutdown specified nodes and remove them from the cluster.

    node add-remote <node name> [<node address>] [options]
           [op <operation action> <operation options> [<operation action>
           <operation options>]...] [meta <meta options>...] [--wait[=<n>]]
        Add the node to the cluster as a remote node. Sync all relevant
        configuration files to the new node. Start the node and configure it to
        start the cluster on boot.
        Options are port and reconnect_interval. Operations and meta
        belong to an underlying connection resource (ocf:pacemaker:remote).
        If node address is not specified for the node, pcs will configure
        pacemaker to communicate with the node using an address provided in
        'pcs host auth' command. Otherwise, pcs will configure pacemaker to
        communicate with the node using the specified addresses.
        If --wait is specified, wait up to 'n' seconds for the node to start.

    node delete-remote <node identifier>
        Shutdown specified remote node and remove it from the cluster.
        The node-identifier can be the name of the node or the address of the
        node.

    node remove-remote <node identifier>
        Shutdown specified remote node and remove it from the cluster.
        The node-identifier can be the name of the node or the address of the
        node.

    node add-guest <node name> <resource id> [options] [--wait[=<n>]]
        Make the specified resource a guest node resource. Sync all relevant
        configuration files to the new node. Start the node and configure it to
        start the cluster on boot.
        Options are remote-addr, remote-port and remote-connect-timeout.
        If remote-addr is not specified for the node, pcs will configure
        pacemaker to communicate with the node using an address provided in
        'pcs host auth' command. Otherwise, pcs will configure pacemaker to
        communicate with the node using the specified addresses.
        If --wait is specified, wait up to 'n' seconds for the node to start.


(undercloud) [stack@undercloud-0 ~]$ for NAME in controller-0 controller-2; do IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f 2) ; ssh heat-admin@$IP "sudo pcs cluster node remove controller-1"; done
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
Destroying cluster on hosts: 'controller-1'...
controller-1: Successfully destroyed cluster
Sending updated corosync.conf to nodes...
controller-2: Succeeded
controller-0: Succeeded
controller-2: Corosync configuration reloaded
Warning: Permanently added '192.168.24.7' (ECDSA) to the list of known hosts.
Error: Node 'controller-1' does not appear to exist in configuration
Error: Errors have occurred, therefore pcs is unable to continue

H

    node delete-guest <node identifier>
        Shutdown specified guest node and remove it from the cluster.
        The node-identifier can be the name of the node or the address of the
        node or id of the resource that is used as the guest node.

    node remove-guest <node identifier>
        Shutdown specified guest node and remove it from the cluster.
        The node-identifier can be the name of the node or the address of the
        node or id of the resource that is used as the guest node.

    node clear <node name>
        Remove specified node from various cluster caches. Use this if a
        removed node is still considered by the cluster to be a member of the
        cluster.

    uidgid
        List the current configured uids and gids of users allowed to connect
        to corosync.

    uidgid add [uid=<uid>] [gid=<gid>]
        Add the specified uid and/or gid to the list of users/groups
        allowed to connect to corosync.

    uidgid delete [uid=<uid>] [gid=<gid>]
        Remove the specified uid and/or gid from the list of users/groups
        allowed to connect to corosync.

    uidgid remove [uid=<uid>] [gid=<gid>]
        Remove the specified uid and/or gid from the list of users/groups
        allowed to connect to corosync.

    corosync [node]
        Get the corosync.conf from the specified node or from the current node
        if node not specified.

    reload corosync
        Reload the corosync configuration on the current node.

    destroy [--all]
        Permanently destroy the cluster on the current node, killing all
        cluster processes and removing all cluster configuration files. Using
        --all will attempt to destroy the cluster on all nodes in the local
        cluster.
        WARNING: This command permanently removes any cluster configuration that
        has been created. It is recommended to run 'pcs cluster stop' before
        destroying the cluster.

    verify [--full] [-f <filename>]
        Checks the pacemaker configuration (CIB) for syntax and common
        conceptual errors. If no filename is specified the check is performed
        on the currently running cluster. If --full is used more verbose output
        will be printed.

    report [--from "YYYY-M-D H:M:S" [--to "YYYY-M-D H:M:S"]] <dest>
        Create a tarball containing everything needed when reporting cluster
        problems.  If --from and --to are not used, the report will include
        the past 24 hours.

Corosync reloaded

Comment 8 Artem Hrechanychenko 2019-07-27 10:24:08 UTC

Hi Damien,
looks like for NAME in overcloud-controller-0 overcloud-controller-2; do IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f 2) ; ssh heat-admin@$IP "sudo pcs cluster node remove overcloud-controller-1"; done
is also invalid

after deleting controller-1 from pcs on controller-0 it automatically deleted from controller-2 as well?

Comment 9 Damien Ciabrini 2019-08-06 14:46:59 UTC

So the replacement procedure times out, and I can see that controller-3 was never added into the cluster while it should have been automatically.
This is managed by puppet on the host which is the bootstrap node.

From the journal I can see:

# journalctl -t puppet-user
[...]
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: Error: Host 'controller-3' is not known to pcs, try>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: Error: None of hosts is known to pcs.
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: Error: Errors have occurred, therefore pcs is unabl>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Error: '/sbin/pcs cluster node add controller-3 addr=172.17.1.148 --start --wait' returned 1 instead of one of [0]
Aug 06 12:08:22 controller-0 puppet-user[595062]: Error: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: change from 'notrun' to ['0'] failed: '/sbin/pcs clu>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[node-cluster-start-controller-3]: Dependency Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster] has failur>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Warning: /Stage[main]/Pacemaker::Corosync/Exec[node-cluster-start-controller-3]: Skipping because of failed dependencies
Aug 06 12:08:23 controller-0 puppet-user[595062]: Notice: Applied catalog in 222.91 seconds

The pcs in RHEL8 probably requires some additional setup calls.

Comment 10 Damien Ciabrini 2019-08-07 16:10:41 UTC

The timeout was due to a puppet-pacemaker bug, tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1737456.

There is still a doc bug though, since the procedure changed a bit for OSP15.
Here is the paragraphs to update in the doc [1]:


Step 14.3 "Preparing the cluster for Controller replacement"

remove step 14.3.3 and 14.3.4 and replace it with the following:

14.3._x_ After stopping Pacemaker on the old node, delete that old node from the pacemaker cluster and remove it from the list of hosts known by pcsd.

The following example command logs in to overcloud-controller-0 to remove overcloud-controller-1:

(undercloud) $ ssh heat-admin.0.47
[heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster node remove overcloud-controller-1
[heat-admin@overcloud-controller-0 ~]$ sudo pcs host deauth overcloud-controller-1



[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/director_installation_and_usage/index#preparing-for-controller-replacement

Comment 19 Dan Macpherson 2019-08-21 16:23:09 UTC

*** Bug 1740124 has been marked as a duplicate of this bug. ***

Comment 24 Dan Macpherson 2019-09-16 17:08:28 UTC

This content has been tested by Artem for OSP15 (thanks Artem!) and since the new content is mostly command substitutions, no peer review is required.

Switching to VERIFIED.

Note You need to log in before you can comment on or make changes to this bug.