Created attachment 1582194 [details] ansible and ceph-ansible log files Description of problem: OSP15 Attempt to replace controller failed on executing ceph-ansible 2019-06-19 08:16:19,367 p=33144 u=root | failed: [controller-3 -> 192.168.24.21] (item=controller-3) => changed=true ansible_loop_var: item cmd: - /usr/bin/env - bash - /tmp/restart_mon_daemon.sh delta: '0:01:04.885699' end: '2019-06-19 12:16:19.315997' invocation: module_args: _raw_params: /usr/bin/env bash /tmp/restart_mon_daemon.sh _uses_shell: false argv: null chdir: null creates: null executable: null removes: null stdin: null stdin_add_newline: true strip_empty_ends: true warn: true item: controller-3 msg: non-zero return code rc: 1 start: '2019-06-19 12:15:14.430298' stderr: |- exit status 1 unable to exec into ceph-mon-controller-3: no container with name or ID ceph-mon-controller-3 found: no such container [heat-admin@controller-3 ~]$ sudo podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 97b62d4dfd29 192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 35 minutes ago Up 35 minutes ago ceph-mon-controller-3 614ad196736c 192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 35 minutes ago Up 35 minutes ago ceph-mgr-controller-3 Version-Release number of selected component (if applicable): ceph-ansible-4.0.0-0.1.rc9.el8cp.noarch openstack-tripleo-heat-templates-10.5.1-0.20190614201227.9fee07b.el8ost.noarch How reproducible: Steps to Reproduce: 1.Deploy OSP15 with 3 controller 3 compute 3 ceph 2.Try to replace controller using documenation from OSP14 with next changes: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/director_installation_and_usage/index#preparing-for-controller-replacement Check the following parameters on each node of the overcloud MariaDB cluster: Use the following command to check these parameters on each running Controller node sudo podman exec -it $(sudo podman ps --filter name=galera-bundle -q) mysql -e "SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';" Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the status: ssh heat-admin.0.47 "sudo podman exec \$(sudo podman ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status" 12.2. Removing a Ceph Monitor daemon sudo podman exec -it ceph-mon-controller-0 ceph mon remove controller-1 12.3. Preparing the cluster for Controller replacement The following example command logs in to overcloud-controller-0 and overcloud-controller-2 to remove overcloud-controller-1: (undercloud) $ for NAME in overcloud-controller-0 overcloud-controller-2; do IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f 2) ; ssh heat-admin@$IP "sudo pcs cluster node remove controller-1; sudo pcs cluster reload corosync"; done Actual results: failed on executing ceph-ansible command unable to exec into ceph-mon-controller-3: no container with name or ID ceph-mon-controller-3 found: no such container Expected results: pass Additional info:
Asking for blocker flag because regression scenario
*** This bug has been marked as a duplicate of bug 1719013 ***
I think these are different. There's a similarity to bug 1719013 but I'm re-opening to dig into it more. I think this might be a duplicate of a different bug.
Please retry this test but add the following to the deployment: CephAnsibleExtraConfig: handler_health_mon_check_retries: 10 handler_health_mon_check_delay: 20
Until ceph-ansible bug 1718981 is resolved you'll need to do the workaround in #4 so I'm marking it as a blocker to this bug.
(In reply to John Fulton from comment #4) > Please retry this test but add the following to the deployment: > > CephAnsibleExtraConfig: > handler_health_mon_check_retries: 10 > handler_health_mon_check_delay: 20 [stack@undercloud-0 ~]$ cat overcloud_replace.sh #!/bin/bash openstack overcloud deploy \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack overcloud \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ -e /home/stack/virt/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/network/dvr-override.yaml \ -e /home/stack/virt/enable-tls.yaml \ -e /home/stack/virt/inject-trust-anchor.yaml \ -e /home/stack/virt/public_vip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \ -e /home/stack/virt/hostnames.yml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /home/stack/virt/nodes_data.yaml \ -e ~/containers-prepare-parameter.yaml \ -e /home/stack/virt/extra_templates.yaml \ -e ~/remove-controller.yaml \ -e ~/ceph_wa.yaml \ --log-file overcloud_deployment_92.log [stack@undercloud-0 ~]$ cat ceph_wa.yaml parameter_defaults: CephAnsibleExtraConfig: handler_health_mon_check_retries: 10 handler_health_mon_check_delay: 20 "<192.168.24.21> Failed to connect to the host via ssh: ", "failed: [controller-3 -> 192.168.24.21] (item=controller-3) => changed=true ", " - /usr/bin/env", " - bash", " - /tmp/restart_mon_daemon.sh", " delta: '0:03:47.469168'", " end: '2019-06-19 17:12:37.805898'", " _raw_params: /usr/bin/env bash /tmp/restart_mon_daemon.sh", " start: '2019-06-19 17:08:50.336730'", " exit status 1", " unable to exec into ceph-mon-controller-3: no container with name or ID ceph-mon-controller-3 found: no such container", " Error with quorum.",
PR 1410 has not yet merged
(undercloud) [stack@undercloud-0 ~]$ rpm -qa ceph-ansible ceph-ansible-4.0.0-0.1.rc10.el8cp.noarch "failed: [ceph-2 -> 192.168.24.8] (item=[{'application': 'openstack_gnocchi', 'name': 'metrics', 'pg_num': 32, 'rule_name': 'replicated_rule'}, {'msg': 'non-zero return code', 'cmd': ['podman', 'exec', 'ceph-mon-controller-0', 'ce ph', '--cluster', 'ceph', 'osd', 'pool', 'get', 'metrics', 'size'], 'stdout': '', 'stderr': 'unable to exec into ceph-mon-controller-0: no container with name or ID ceph-mon-controller-0 found: no such container', 'rc': 125, 'start': '201 9-07-17 16:49:47.920625', 'end': '2019-07-17 16:49:47.966148', 'delta': '0:00:00.045523', 'changed': True, 'failed': False, 'invocation': {'module_args': {'_raw_params': 'podman exec ceph-mon-controller-0 ceph --cluster ceph osd pool get metrics size\\n', 'warn': True, '_uses_shell': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': [], 'stderr_lin es': ['unable to exec into ceph-mon-controller-0: no container with name or ID ceph-mon-controller-0 found: no such container'], 'failed_when_result': False, 'item': {'application': 'openstack_gnocchi', 'name': 'metrics', 'pg_num': 32, 'r ule_name': 'replicated_rule'}, 'ansible_loop_var': 'item'}]) => changed=false ", " delta: '0:00:00.053923'", " end: '2019-07-17 16:49:49.504360'", " podman exec ceph-mon-controller-0 ceph --cluster ceph osd pool create metrics 32 32 replicated_rule 1", " - application: openstack_gnocchi", " - metrics", " delta: '0:00:00.045523'", " end: '2019-07-17 16:49:47.966148'", " podman exec ceph-mon-controller-0 ceph --cluster ceph osd pool get metrics size", " application: openstack_gnocchi", " name: metrics", " start: '2019-07-17 16:49:47.920625'", " start: '2019-07-17 16:49:49.450437'", [heat-admin@ceph-2 ~]$ sudo podman ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 77e3cf880b9c 192.168.24.1:8787/rhosp15/openstack-cron:20190711.1 dumb-init --singl... 23 hours ago Up 23 hours ago logrotate_crond 9947cb175aed 192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 23 hours ago Up 23 hours ago ceph-osd-8 6321d76031e1 192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 23 hours ago Up 23 hours ago ceph-osd-5 00ddb30cbf84 192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 23 hours ago Up 23 hours ago ceph-osd-14 b83a4a18df38 192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 23 hours ago Up 23 hours ago ceph-osd-11 47242e9e34b7 192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 23 hours ago Up 23 hours ago ceph-osd-1
Created attachment 1591736 [details] oc logs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0312