Bug 1722066 - Replace controller scenario - RUNNING HANDLER [ceph-handler : restart ceph mon daemon(s) - container] failed with "unable to exec into ceph-mon-controller-3: no container with name or ID ceph-mon-controller-3 found: no such container"
Summary: Replace controller scenario - RUNNING HANDLER [ceph-handler : restart ceph mo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Ansible
Version: 4.0
Hardware: x86_64
OS: Linux
urgent
medium
Target Milestone: rc
: 4.0
Assignee: Dimitri Savineau
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On: 1718981
Blocks: 1594251
TreeView+ depends on / blocked
 
Reported: 2019-06-19 13:00 UTC by Artem Hrechanychenko
Modified: 2020-02-05 07:35 UTC (History)
15 users (show)

Fixed In Version: ceph-ansible-4.0.0-0.1.rc10.el8cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1732157 (view as bug list)
Environment:
Last Closed: 2020-01-31 12:46:20 UTC
Target Upstream Version:


Attachments (Terms of Use)
ansible and ceph-ansible log files (3.22 MB, application/gzip)
2019-06-19 13:00 UTC, Artem Hrechanychenko
no flags Details
oc logs (10.43 MB, application/gzip)
2019-07-18 09:39 UTC, Artem Hrechanychenko
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-container pull 1410 0 'None' closed daemon/start_mon.sh: add mon without port value 2021-01-08 01:26:23 UTC
Github ceph ceph-container pull 1412 0 'None' closed daemon/start_mon.sh: add mon without port value (bp #1410) 2021-01-08 01:25:44 UTC
Red Hat Product Errata RHBA-2020:0312 0 None None None 2020-01-31 12:46:56 UTC

Internal Links: 1719013

Description Artem Hrechanychenko 2019-06-19 13:00:48 UTC
Created attachment 1582194 [details]
ansible and ceph-ansible log files

Description of problem:
OSP15 
Attempt to replace controller failed on executing ceph-ansible

2019-06-19 08:16:19,367 p=33144 u=root |  failed: [controller-3 -> 192.168.24.21] (item=controller-3) => changed=true 
  ansible_loop_var: item
  cmd:
  - /usr/bin/env
  - bash
  - /tmp/restart_mon_daemon.sh
  delta: '0:01:04.885699'
  end: '2019-06-19 12:16:19.315997'
  invocation:
    module_args:
      _raw_params: /usr/bin/env bash /tmp/restart_mon_daemon.sh
      _uses_shell: false
      argv: null
      chdir: null
      creates: null
      executable: null
      removes: null
      stdin: null
      stdin_add_newline: true
      strip_empty_ends: true
      warn: true
  item: controller-3
  msg: non-zero return code
  rc: 1
  start: '2019-06-19 12:15:14.430298'
  stderr: |-
    exit status 1
    unable to exec into ceph-mon-controller-3: no container with name or ID ceph-mon-controller-3 found: no such container


[heat-admin@controller-3 ~]$ sudo podman ps
CONTAINER ID  IMAGE                                           COMMAND               CREATED         STATUS             PORTS  NAMES
97b62d4dfd29  192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest  /opt/ceph-contain...  35 minutes ago  Up 35 minutes ago         ceph-mon-controller-3
614ad196736c  192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest  /opt/ceph-contain...  35 minutes ago  Up 35 minutes ago         ceph-mgr-controller-3


Version-Release number of selected component (if applicable):
ceph-ansible-4.0.0-0.1.rc9.el8cp.noarch
openstack-tripleo-heat-templates-10.5.1-0.20190614201227.9fee07b.el8ost.noarch

How reproducible:


Steps to Reproduce:
1.Deploy OSP15 with 3 controller  3 compute 3 ceph
2.Try to replace controller using documenation from OSP14 with next changes:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/director_installation_and_usage/index#preparing-for-controller-replacement

Check the following parameters on each node of the overcloud MariaDB cluster: 
Use the following command to check these parameters on each running Controller node
sudo podman exec -it $(sudo podman ps --filter name=galera-bundle -q) mysql -e "SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';"

Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the status: 

ssh heat-admin@192.168.0.47 "sudo podman exec \$(sudo podman ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"


12.2. Removing a Ceph Monitor daemon
 sudo podman exec -it ceph-mon-controller-0 ceph mon remove controller-1

12.3. Preparing the cluster for Controller replacement
 The following example command logs in to overcloud-controller-0 and overcloud-controller-2 to remove overcloud-controller-1:

(undercloud) $ for NAME in overcloud-controller-0 overcloud-controller-2; do IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f 2) ; ssh heat-admin@$IP "sudo pcs cluster node remove controller-1; sudo pcs cluster reload corosync"; done



Actual results:
failed on executing ceph-ansible command
unable to exec into ceph-mon-controller-3: no container with name or ID ceph-mon-controller-3 found: no such container


Expected results:
pass

Additional info:

Comment 1 Artem Hrechanychenko 2019-06-19 13:01:18 UTC
Asking for blocker flag because regression scenario

Comment 2 Yogev Rabl 2019-06-19 14:04:42 UTC

*** This bug has been marked as a duplicate of bug 1719013 ***

Comment 3 John Fulton 2019-06-19 15:18:21 UTC
I think these are different. There's a similarity to bug 1719013 but I'm re-opening to dig into it more. I think this might be a duplicate of a different bug.

Comment 4 John Fulton 2019-06-19 15:20:58 UTC
Please retry this test but add the following to the deployment:

CephAnsibleExtraConfig:
  handler_health_mon_check_retries: 10
  handler_health_mon_check_delay: 20

Comment 5 John Fulton 2019-06-19 15:22:13 UTC
Until ceph-ansible bug 1718981 is resolved you'll need to do the workaround in #4 so I'm marking it as a blocker to this bug.

Comment 6 Artem Hrechanychenko 2019-06-20 09:12:22 UTC
(In reply to John Fulton from comment #4)
> Please retry this test but add the following to the deployment:
> 
> CephAnsibleExtraConfig:
>   handler_health_mon_check_retries: 10
>   handler_health_mon_check_delay: 20

[stack@undercloud-0 ~]$ cat overcloud_replace.sh 
#!/bin/bash

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/network/dvr-override.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
-e /home/stack/virt/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/virt/extra_templates.yaml \
-e ~/remove-controller.yaml \
-e ~/ceph_wa.yaml \
--log-file overcloud_deployment_92.log

[stack@undercloud-0 ~]$ cat ceph_wa.yaml 
parameter_defaults:
  CephAnsibleExtraConfig:
    handler_health_mon_check_retries: 10
    handler_health_mon_check_delay: 20



       "<192.168.24.21> Failed to connect to the host via ssh: ",
        "failed: [controller-3 -> 192.168.24.21] (item=controller-3) => changed=true ",
        "  - /usr/bin/env",
        "  - bash",
        "  - /tmp/restart_mon_daemon.sh",
        "  delta: '0:03:47.469168'",
        "  end: '2019-06-19 17:12:37.805898'",
        "      _raw_params: /usr/bin/env bash /tmp/restart_mon_daemon.sh",
        "  start: '2019-06-19 17:08:50.336730'",
        "    exit status 1",
        "    unable to exec into ceph-mon-controller-3: no container with name or ID ceph-mon-controller-3 found: no such container",
        "    Error with quorum.",

Comment 11 John Fulton 2019-07-02 18:59:13 UTC
PR 1410 has not yet merged

Comment 16 Artem Hrechanychenko 2019-07-18 09:31:19 UTC
(undercloud) [stack@undercloud-0 ~]$ rpm -qa ceph-ansible
ceph-ansible-4.0.0-0.1.rc10.el8cp.noarch


 "failed: [ceph-2 -> 192.168.24.8] (item=[{'application': 'openstack_gnocchi', 'name': 'metrics', 'pg_num': 32, 'rule_name': 'replicated_rule'}, {'msg': 'non-zero return code', 'cmd': ['podman', 'exec', 'ceph-mon-controller-0', 'ce
ph', '--cluster', 'ceph', 'osd', 'pool', 'get', 'metrics', 'size'], 'stdout': '', 'stderr': 'unable to exec into ceph-mon-controller-0: no container with name or ID ceph-mon-controller-0 found: no such container', 'rc': 125, 'start': '201
9-07-17 16:49:47.920625', 'end': '2019-07-17 16:49:47.966148', 'delta': '0:00:00.045523', 'changed': True, 'failed': False, 'invocation': {'module_args': {'_raw_params': 'podman exec ceph-mon-controller-0 ceph --cluster ceph osd pool get
metrics size\\n', 'warn': True, '_uses_shell': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': [], 'stderr_lin
es': ['unable to exec into ceph-mon-controller-0: no container with name or ID ceph-mon-controller-0 found: no such container'], 'failed_when_result': False, 'item': {'application': 'openstack_gnocchi', 'name': 'metrics', 'pg_num': 32, 'r
ule_name': 'replicated_rule'}, 'ansible_loop_var': 'item'}]) => changed=false ",
"  delta: '0:00:00.053923'",
        "  end: '2019-07-17 16:49:49.504360'",
        "        podman exec ceph-mon-controller-0 ceph --cluster ceph osd pool create metrics 32 32 replicated_rule 1",
        "  - application: openstack_gnocchi",
        "    - metrics",
        "    delta: '0:00:00.045523'",
        "    end: '2019-07-17 16:49:47.966148'",
        "          podman exec ceph-mon-controller-0 ceph --cluster ceph osd pool get metrics size",
        "      application: openstack_gnocchi",
        "      name: metrics",
        "    start: '2019-07-17 16:49:47.920625'",
        "  start: '2019-07-17 16:49:49.450437'",

[heat-admin@ceph-2 ~]$ sudo podman ps -a
CONTAINER ID  IMAGE                                                COMMAND               CREATED       STATUS           PORTS  NAMES
77e3cf880b9c  192.168.24.1:8787/rhosp15/openstack-cron:20190711.1  dumb-init --singl...  23 hours ago  Up 23 hours ago         logrotate_crond
9947cb175aed  192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest       /opt/ceph-contain...  23 hours ago  Up 23 hours ago         ceph-osd-8
6321d76031e1  192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest       /opt/ceph-contain...  23 hours ago  Up 23 hours ago         ceph-osd-5
00ddb30cbf84  192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest       /opt/ceph-contain...  23 hours ago  Up 23 hours ago         ceph-osd-14
b83a4a18df38  192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest       /opt/ceph-contain...  23 hours ago  Up 23 hours ago         ceph-osd-11
47242e9e34b7  192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest       /opt/ceph-contain...  23 hours ago  Up 23 hours ago         ceph-osd-1

Comment 18 Artem Hrechanychenko 2019-07-18 09:39:26 UTC
Created attachment 1591736 [details]
oc logs

Comment 37 errata-xmlrpc 2020-01-31 12:46:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0312


Note You need to log in before you can comment on or make changes to this bug.