Bug 1555002 - Ceph ansible playbooks set via CephAnsiblePlaybook run in parallel instead of serial
Summary: Ceph ansible playbooks set via CephAnsiblePlaybook run in parallel instead of...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: beta
: 13.0 (Queens)
Assignee: Giulio Fidente
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-13 18:44 UTC by Marius Cornea
Modified: 2018-06-27 13:37 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-common-8.5.1-0.20180304032202.e8d9da9.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:36:14 UTC
Target Upstream Version:


Attachments (Terms of Use)
ceph-install-workflow.log (833.38 KB, text/plain)
2018-03-13 18:44 UTC, Marius Cornea
no flags Details


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 553364 None MERGED Ensure ceph-ansible playbooks are run one at a time 2020-08-05 08:41:57 UTC
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 13:37:23 UTC

Description Marius Cornea 2018-03-13 18:44:37 UTC
Created attachment 1407700 [details]
ceph-install-workflow.log

Description of problem:
infrastructure-playbooks/rolling_update.yml fails while running container | waiting for the containerized monitor to join the quorum... task during OSP10 -> OSP13 Fast Forward Upgrade

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.27-1.el7cp.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph osd nodes
2. Run through the FFU procedure to upgrade to OSP13
3. Run the step to upgrade ceph services by migrating to containers

Actual results:
rolling_update.yml fails.

Expected results:
Upgrade succeeds without issues.

Additional info:
Attaching /var/log/mistral/ceph-install-workflow.log

Running the command manually it seems it returns correct output

[root@controller-1 ~]# docker exec ceph-mon-controller-1 ceph --cluster "ceph" -s --format json|jq .quorum_names                                                                                                     │························
[                                                                                                                                                                                                                    │························
  "controller-2",                                                                                                                                                                                                    │························
  "controller-1",                                                                                                                                                                                                    │························
  "controller-0"                                                                                                                                                                                                     │························
]

Comment 3 Marius Cornea 2018-03-13 18:45:09 UTC
This is the error:

2018-03-13 13:52:43,220 p=25772 u=mistral |  TASK [container | waiting for the containerized monitor to join the quorum...] ***
2018-03-13 13:52:43,339 p=25771 u=mistral |  FAILED - RETRYING: wait for monitor socket to exist (4 retries left).
2018-03-13 13:52:44,308 p=25772 u=mistral |  fatal: [192.168.24.11]: FAILED! => {"msg": "The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)[\"quorum_names\"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout | from_json)[\"quorum_names\"]\n' failed. The error was: No JSON object could be decoded"}
2018-03-13 13:52:44,309 p=25772 u=mistral |  PLAY RECAP *********************************************************************

Comment 4 Marius Cornea 2018-03-13 21:24:31 UTC
I tried adding a debug task to see what ceph_health_raw gets registered to and I noticed that the docker exec command fails because of missing container:

https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L171


2018-03-13 17:13:39,040 p=30957 u=mistral |  fatal: [192.168.24.11]: FAILED! => {"changed": true, "cmd": ["docker", "exec", "ceph-mon-controller-0", "ceph", "--cluster", "ceph", "-s", "--format", "json"], "delta": "0:00:00.034273", "end":
 "2018-03-13 21:13:37.636285", "msg": "non-zero return code", "rc": 1, "start": "2018-03-13 21:13:37.602012", "stderr": "Error response from daemon: No such container: ceph-mon-controller-0", "stderr_lines": ["Error response from daemon: 
No such container: ceph-mon-controller-0"], "stdout": "", "stdout_lines": []}


Note that 192.168.24.11 is controller-1 not controller-0 on my environment:

[stack@undercloud-0 ~]$ ssh heat-admin@192.168.24.11 'cat /etc/hostname'
Warning: Permanently added '192.168.24.11' (ECDSA) to the list of known hosts.
controller-1

Comment 5 Giulio Fidente 2018-03-13 22:20:07 UTC
Marius, the task is delegated to one node only (mon_host), set to the first member of the mons group. Can you try running the command on controller 0 and see what happens?

Comment 6 Marius Cornea 2018-03-13 23:24:23 UTC
(In reply to Giulio Fidente from comment #5)
> Marius, the task is delegated to one node only (mon_host), set to the first
> member of the mons group. Can you try running the command on controller 0
> and see what happens?

Ok, so I delegated the task to mon_host and it looks that the ceph-mon-controller-0 container is not running at the time when the task is failing. I saved the output of docker ps in /tmp/docker_ps.log right before the failing task:

[root@controller-0 ~]# grep ceph /tmp/docker_ps.log 
225923accfd8        registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest                                                "/entrypoint.sh"         24 minutes ago      Up 24 minutes                                 ceph-mgr-controller-0
[root@controller-0 ~]# 

Nevertheless after the failure I can see the container is started:

[root@controller-0 ~]# docker ps | grep ceph
e5e652635578        registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest                                                "/entrypoint.sh"         5 minutes ago       Up 5 minutes                                  ceph-mgr-controller-0
4b2577a5e9d5        registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest                                                "/entrypoint.sh"         8 minutes ago       Up 8 minutes                                  ceph-mon-controller-0

Comment 7 Marius Cornea 2018-03-14 00:33:47 UTC
Update: I commented out https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml#L86-L90 and this allowed the openstack overcloud deploy command to successfully complete. I noticed that the ceph-mon container is managed via systemd so I guess this task was stopping it right executing the docker", "exec", "ceph-mon-controller-0" ... command

Comment 9 Marius Cornea 2018-03-14 15:00:11 UTC
After some debugging we found that the issue was caused by mistral running the playbooks in parallel and not serially.

Comment 15 leseb 2018-03-15 14:13:10 UTC
Pretty sure this is a firewalling issue. You need to open port for the ceph-mgr to talk to OSDs. IIRC port is 6800.

Comment 18 Yogev Rabl 2018-05-01 00:42:20 UTC
Verified on openstack-tripleo-common-8.6.1-4.el7ost.noarch

Comment 20 errata-xmlrpc 2018-06-27 13:36:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.