Description of problem: ======================== It's failing in MON upgrade hence failed_QA TASK [waiting for the containerized monitor to join the quorum...] ************* task path: /root/slavv_new/rolling_update.yml:153 FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (5 retries left). FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (4 retries left). FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (3 retries left). FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (2 retries left). FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (1 retries left). fatal: [magna029 -> magna027]: FAILED! => {"attempts": 5, "changed": true, "cmd": "docker exec ceph-mon-magna027 ceph -s --cluster slave | grep quorum | sed 's/.*quorum//' | egrep -sq magna029", "delta": "0:00:00.019534", "end": "2017-05-31 22:09:58.148914", "failed": true, "rc": 1, "start": "2017-05-31 22:09:58.129380", "stderr": "Error response from daemon: Container c9d3588bd0766f7f43d3680cc42bfe7eb7bcb9556d480b0cffe18be806e3f82e is not running", "stdout": "", "stdout_lines": [], "warnings": []} to retry, use: --limit @/root/slavv_new/rolling_update.retry PLAY RECAP ********************************************************************* localhost : ok=1 changed=0 unreachable=0 failed=0 magna025 : ok=46 changed=5 unreachable=0 failed=0 magna027 : ok=45 changed=5 unreachable=0 failed=0 magna029 : ok=44 changed=4 unreachable=0 failed=1 magna036 : ok=3 changed=0 unreachable=0 failed=0 magna043 : ok=3 changed=0 unreachable=0 failed=0 magna048 : ok=3 changed=0 unreachable=0 failed=0 magna067 : ok=3 changed=0 unreachable=0 failed=0 had following version:- ====================== ceph-ansible-2.2.7-1.el7scon.noarch ansible-2.2.3.0-1.el7.noarch ceph-2-rhel-7-docker-candidate-20170526111545 How reproducible: ================ always Steps to Reproduce: ==================== upgrade to version - ceph-2-rhel-7-docker-candidate-20170530155520 Steps: ====== 1. cluster was installed using ceph-ansible. Use same setup/variable/group_var files. cd to that Dir 2. cp infrastructure-playbooks/rolling_update.yml . 3. update group_vars/all.yml ceph_docker_image_tag: ceph-2-rhel-7-docker-candidate-20170530155520 mon_containerized_deployment: true (Rest variables/files are not changed) 4.ansible-playbook rolling_update.yml -i /etc/ansible/slave -vv
This bug is created to track issues mentioned in Bug 1449159 - [ceph container]:- docker instances of osd and mons will not be spinned with new image after a docker pull update (comment #44)
Did you just copy/paste the content of Bug 1449159 or did try again with the new image and ended up with the same issue?
please provide info requested in c4
(In reply to seb from comment #4) > Did you just copy/paste the content of Bug 1449159 or did try again with the > new image and ended up with the same issue? Copy/Pasted from Bug 1449159 but as you can see in description we were trying to upgrade to latest available version at that time ("upgrade to version - ceph-2-rhel-7-docker-candidate-20170530155520 ")
Andrew please triage this.
I have seen this behavior when a field in the groups/all.yml file is incorrect. I once made a typo in the public_network field and there were no errors noticed until this point. I will try to reproduce this today.
So based on the errors that I have seen in the past, it would not surprise me if the 'ceph_docker_image_tag: ceph-2-rhel-7-docker-candidate-20170530155520' line is incorrect.
Reproduced the 'FAILED - RETRYING: waiting for the containerized monitor to join the quorum...' error. I needed to copy the [mons] and [osds] portions of the /etc/ansible/hosts file to /etc/ansible/slave, and make sure that root could ssh between my nodes before I got the following: TASK [waiting for the containerized monitor to join the quorum...] ************* task path: /usr/share/ceph-ansible/rolling_update.yml:153 FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (5 retries left). FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (4 retries left). FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (3 retries left). FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (2 retries left). FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (1 retries left). fatal: [magna045 -> magna060]: FAILED! => {"attempts": 5, "changed": true, "cmd": "docker exec ceph-mon-magna060 ceph -s --cluster ceph | grep quorum | sed 's/.*quorum//' | egrep -sq magna045", "delta": "0:00:00.019702", "end": "2017-06-06 17:33:35.050290", "failed": true, "rc": 1, "start": "2017-06-06 17:33:35.030588", "stderr": "Error response from daemon: No such container: ceph-mon-magna060", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-magna060"], "stdout": "", "stdout_lines": []} to retry, use: --limit @/usr/share/ceph-ansible/rolling_update.retry PLAY RECAP ********************************************************************* localhost : ok=1 changed=0 unreachable=0 failed=0 magna045 : ok=45 changed=5 unreachable=0 failed=1 magna055 : ok=3 changed=0 unreachable=0 failed=0 magna060 : ok=3 changed=0 unreachable=0 failed=0
I'm missing something here, if ceph_docker_image_tag is properly set then ceph-ansible will change the mon systemd unit file and the /usr/share/ceph-osd-run.sh for osd to point to the new image. If you test with the last container image the monitor should not fail.
(In reply to seb from comment #19) > I'm missing something here, if ceph_docker_image_tag is properly set then > ceph-ansible will change the mon systemd unit file and the > /usr/share/ceph-osd-run.sh for osd to point to the new image. > > If you test with the last container image the monitor should not fail. The failure I see is that after being updated a MON does not rejoin the quorum. This is the same issue as mentioned above. I've created an upstream testing scenario that should expose the issue. https://github.com/ceph/ceph-ansible/pull/1599
I would like to see logs from journalctl to properly understand what's going on.
(In reply to seb from comment #21) > I would like to see logs from journalctl to properly understand what's going > on. I seems as though the rolling_update.yml playbook just needed a higher value for 'health_mon_check_delay' for the playbook to continue. There were also some bugs in the task that performs the PG check for updated OSDs. I've created an upstream PR with those fixes and a testing scenario for containerized rolling updates: https://github.com/ceph/ceph-ansible/pull/1599 With those patches I am able to use rolling_update.yml to update a containerized cluster.
This won't be in 2.3. I still think we can use rolling_update and simply increase "health_mon_check_delay". Andrew what do you think?
Created attachment 1336379 [details] File contains ansible-playbook log Hi, rolling update is failing in the task "waiting for the containerized monitor to join the quorum..." saying "The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)[\"quorum_names\"]\n' failed. The error was: No JSON object could be decoded" I tried on single mon cluster. I observed that the mon was successfully updated, and new container was up and running, quoram was there. $ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f26f118b3eb7 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-91008-20171006180220 "/entrypoint.sh" 6 minutes ago Up 5 minutes ceph-mon-magna012 56a9fe73ac55 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-31370-20171003232256 "/entrypoint.sh" 4 days ago Up 4 days ceph-mgr-magna012 [ubuntu@magna012 ~]$ sudo docker exec ceph-mon-magna012 ceph -v ceph version 12.2.1-10.el7cp (5ba1c3fa606d7bf16f72756b0026f04a40297673) luminous (stable) [ubuntu@magna012 ~]$ sudo docker exec ceph-mgr-magna012 ceph -v ceph version 12.2.1-9.el7cp (3972a2f60763dcf1be2e26457eee677515a2705d) luminous (stable) Moving back to ASSIGNED state, please let m eknow if there are any concerns. Regards, Vasishta
Vasishta, Your log shows: TASK [waiting for the monitor to join the quorum...] *************************** changed: [magna068 -> magna068] TASK [waiting for the containerized monitor to join the quorum...] ************* skipping: [magna068] Which means: containerized_deployment is set to False, correct? Can I see your group_vars/all.yml? Thanks.
Hi Sebastien, No, I had set containerized_deployment to true, my all.yml looks like this - <magna012> $ cat group_vars/all.yml | egrep -v ^# | grep -v ^$ --- dummy: fetch_directory: ~/ceph-ansible-keys cluster: "2017_sub" ceph_origin: repository ceph_repository: rhcs monitor_interface: eno1 public_network: 10.8.128.0/21 radosgw_interface: eno1 ceph_docker_image: "rhceph" ceph_docker_image_tag: "ceph-3.0-rhel-7-docker-candidate-91008-20171006180220" containerized_deployment: true ceph_docker_registry: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888 Regards, Vasishta
Well, it's not what the play says... I can't reproduce your issue, can you get me an env where you can reproduce this every time? Thanks.
Created attachment 1338072 [details] File contains contents ansible-playbook log Hi Sebastien, I'm really sorry, I noticed that the attachment was inappropriate, my bad. I'm facing this issue again (as mentioned in Comment 29 ), this time I'm attaching the proper one, please take a look. Regards, Vasishta
Ok, I understand the issue, you are upgrading a single monitor. When running this task: https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L161, the first command returns 1, so the task exits immediately. This is not a supported configuration. You **must** test with a minimum of 3 monitors. I"m closing this as invalid.
Hi Sebastien, Sorry, my bad. I tried again with 3 monitor set up and it worked fine in the task "waiting for the containerized monitor to join the quorum..." I observed that rolling_update doesn't support upgrading rbd-mirroring and nfs-ganesha. Moving back to ASSIGNED state, please let me know if there are any concerns. Regards, Vasishta
fixed, will be in 3.0.3
Will be in 3.0.3, release upstream is here: https://github.com/ceph/ceph-ansible/releases/tag/v3.0.3 Ken, can you build a package? Thanks.
Tried using ceph-ansible-3.0.3-1.el7cp, Working fine. Moving to VERIFIED state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3388