Created attachment 1407700 [details] ceph-install-workflow.log Description of problem: infrastructure-playbooks/rolling_update.yml fails while running container | waiting for the containerized monitor to join the quorum... task during OSP10 -> OSP13 Fast Forward Upgrade Version-Release number of selected component (if applicable): ceph-ansible-3.0.27-1.el7cp.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph osd nodes 2. Run through the FFU procedure to upgrade to OSP13 3. Run the step to upgrade ceph services by migrating to containers Actual results: rolling_update.yml fails. Expected results: Upgrade succeeds without issues. Additional info: Attaching /var/log/mistral/ceph-install-workflow.log Running the command manually it seems it returns correct output [root@controller-1 ~]# docker exec ceph-mon-controller-1 ceph --cluster "ceph" -s --format json|jq .quorum_names │························ [ │························ "controller-2", │························ "controller-1", │························ "controller-0" │························ ]
This is the error: 2018-03-13 13:52:43,220 p=25772 u=mistral | TASK [container | waiting for the containerized monitor to join the quorum...] *** 2018-03-13 13:52:43,339 p=25771 u=mistral | FAILED - RETRYING: wait for monitor socket to exist (4 retries left). 2018-03-13 13:52:44,308 p=25772 u=mistral | fatal: [192.168.24.11]: FAILED! => {"msg": "The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)[\"quorum_names\"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout | from_json)[\"quorum_names\"]\n' failed. The error was: No JSON object could be decoded"} 2018-03-13 13:52:44,309 p=25772 u=mistral | PLAY RECAP *********************************************************************
I tried adding a debug task to see what ceph_health_raw gets registered to and I noticed that the docker exec command fails because of missing container: https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L171 2018-03-13 17:13:39,040 p=30957 u=mistral | fatal: [192.168.24.11]: FAILED! => {"changed": true, "cmd": ["docker", "exec", "ceph-mon-controller-0", "ceph", "--cluster", "ceph", "-s", "--format", "json"], "delta": "0:00:00.034273", "end": "2018-03-13 21:13:37.636285", "msg": "non-zero return code", "rc": 1, "start": "2018-03-13 21:13:37.602012", "stderr": "Error response from daemon: No such container: ceph-mon-controller-0", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-controller-0"], "stdout": "", "stdout_lines": []} Note that 192.168.24.11 is controller-1 not controller-0 on my environment: [stack@undercloud-0 ~]$ ssh heat-admin.24.11 'cat /etc/hostname' Warning: Permanently added '192.168.24.11' (ECDSA) to the list of known hosts. controller-1
Marius, the task is delegated to one node only (mon_host), set to the first member of the mons group. Can you try running the command on controller 0 and see what happens?
(In reply to Giulio Fidente from comment #5) > Marius, the task is delegated to one node only (mon_host), set to the first > member of the mons group. Can you try running the command on controller 0 > and see what happens? Ok, so I delegated the task to mon_host and it looks that the ceph-mon-controller-0 container is not running at the time when the task is failing. I saved the output of docker ps in /tmp/docker_ps.log right before the failing task: [root@controller-0 ~]# grep ceph /tmp/docker_ps.log 225923accfd8 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 24 minutes ago Up 24 minutes ceph-mgr-controller-0 [root@controller-0 ~]# Nevertheless after the failure I can see the container is started: [root@controller-0 ~]# docker ps | grep ceph e5e652635578 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 5 minutes ago Up 5 minutes ceph-mgr-controller-0 4b2577a5e9d5 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 8 minutes ago Up 8 minutes ceph-mon-controller-0
Update: I commented out https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml#L86-L90 and this allowed the openstack overcloud deploy command to successfully complete. I noticed that the ceph-mon container is managed via systemd so I guess this task was stopping it right executing the docker", "exec", "ceph-mon-controller-0" ... command
After some debugging we found that the issue was caused by mistral running the playbooks in parallel and not serially.
Pretty sure this is a firewalling issue. You need to open port for the ceph-mgr to talk to OSDs. IIRC port is 6800.
Verified on openstack-tripleo-common-8.6.1-4.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086