Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1458024 - [ceph-ansible] [ceph-container] : upgrade of containerized cluster fails
[ceph-ansible] [ceph-container] : upgrade of containerized cluster fails
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Container (Show other bugs)
2.3
x86_64 Linux
high Severity high
: rc
: 3.0
Assigned To: leseb
Vasishta
Bara Ancincova
: Reopened
Depends On:
Blocks: 1437916 1494421
  Show dependency treegraph
 
Reported: 2017-06-01 15:26 EDT by Rachana Patel
Modified: 2017-12-05 18:18 EST (History)
14 users (show)

See Also:
Fixed In Version: RHEL: ceph-ansible-3.0.3-1.el7cp Ubuntu: ceph-ansible_3.0.3-2redhat1
Doc Type: Bug Fix
Doc Text:
.Upgrading a containerized Ceph cluster by using `rolling_update.yml` is supported Previously, after upgrading a containerized Ceph cluster by using the `rolling_update.yml` playbook, the `ceph-mon` daemons were not restarted. As a consequence, they were unable to join the quorum after the upgrade. With this update, upgrading containerized Ceph clusters with `rolling_update.yml` works as expected. For details, see the link:https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/container_guide/#upgrading-a-red-hat-ceph-storage-cluster-that-runs-in-containers[Upgrading a Red Hat Ceph Storage Cluster That Runs in Containers] section in the Container Guide for Red Hat Ceph Storage 3.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-12-05 18:18:20 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
File contains ansible-playbook log (171.41 KB, text/plain)
2017-10-09 08:15 EDT, Vasishta
no flags Details
File contains contents ansible-playbook log (369.70 KB, text/plain)
2017-10-13 01:35 EDT, Vasishta
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Github ceph/ceph-ansible/pull/2055 None None None 2017-10-16 08:17 EDT
Red Hat Product Errata RHEA-2017:3388 normal SHIPPED_LIVE new container image: rhceph-3-rhel7 2017-12-05 21:43:29 EST

  None (edit)
Description Rachana Patel 2017-06-01 15:26:23 EDT
Description of problem:
========================
It's failing in MON upgrade hence failed_QA


TASK [waiting for the containerized monitor to join the quorum...] *************
task path: /root/slavv_new/rolling_update.yml:153
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (5 retries left).
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (4 retries left).
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (3 retries left).
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (2 retries left).
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (1 retries left).
fatal: [magna029 -> magna027]: FAILED! => {"attempts": 5, "changed": true, "cmd": "docker exec ceph-mon-magna027 ceph -s --cluster slave | grep quorum | sed 's/.*quorum//' | egrep -sq magna029", "delta": "0:00:00.019534", "end": "2017-05-31 22:09:58.148914", "failed": true, "rc": 1, "start": "2017-05-31 22:09:58.129380", "stderr": "Error response from daemon: Container c9d3588bd0766f7f43d3680cc42bfe7eb7bcb9556d480b0cffe18be806e3f82e is not running", "stdout": "", "stdout_lines": [], "warnings": []}
	to retry, use: --limit @/root/slavv_new/rolling_update.retry

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0   
magna025                   : ok=46   changed=5    unreachable=0    failed=0   
magna027                   : ok=45   changed=5    unreachable=0    failed=0   
magna029                   : ok=44   changed=4    unreachable=0    failed=1   
magna036                   : ok=3    changed=0    unreachable=0    failed=0   
magna043                   : ok=3    changed=0    unreachable=0    failed=0   
magna048                   : ok=3    changed=0    unreachable=0    failed=0   
magna067                   : ok=3    changed=0    unreachable=0    failed=0  


had following version:-
======================
ceph-ansible-2.2.7-1.el7scon.noarch
ansible-2.2.3.0-1.el7.noarch
ceph-2-rhel-7-docker-candidate-20170526111545




How reproducible:
================
always

Steps to Reproduce:
====================
upgrade to version - ceph-2-rhel-7-docker-candidate-20170530155520 

Steps:
======
1. cluster was installed using ceph-ansible. Use same setup/variable/group_var files. 
cd to that Dir
2. cp infrastructure-playbooks/rolling_update.yml .
3. update group_vars/all.yml 
ceph_docker_image_tag: ceph-2-rhel-7-docker-candidate-20170530155520
mon_containerized_deployment: true
(Rest variables/files are not changed)
4.ansible-playbook rolling_update.yml -i /etc/ansible/slave -vv
Comment 3 Rachana Patel 2017-06-01 15:30:22 EDT
This bug is created to track issues mentioned in Bug 1449159 - [ceph container]:- docker instances of osd and mons will not be spinned with new image after a docker pull update (comment #44)
Comment 4 seb 2017-06-02 05:23:52 EDT
Did you just copy/paste the content of Bug 1449159 or did try again with the new image and ended up with the same issue?
Comment 5 Gregory Meno 2017-06-05 11:11:19 EDT
please provide info requested in c4
Comment 6 Rachana Patel 2017-06-05 11:16:32 EDT
(In reply to seb from comment #4)
> Did you just copy/paste the content of Bug 1449159 or did try again with the
> new image and ended up with the same issue?

Copy/Pasted from Bug 1449159 but as you can see in description we were trying to upgrade to latest available version at that time ("upgrade to version - ceph-2-rhel-7-docker-candidate-20170530155520 ")
Comment 7 Gregory Meno 2017-06-05 14:24:10 EDT
Andrew please triage this.
Comment 9 Warren 2017-06-06 11:42:02 EDT
I have seen this behavior when a field in the groups/all.yml file is incorrect.  I once made a typo in the public_network field and there were no errors noticed until this point.  I will try to reproduce this today.
Comment 10 Warren 2017-06-06 11:57:15 EDT
So based on the errors that I have seen in the past, it would not surprise me if the 'ceph_docker_image_tag: ceph-2-rhel-7-docker-candidate-20170530155520' line is incorrect.
Comment 12 Warren 2017-06-06 13:38:07 EDT
Reproduced the 'FAILED - RETRYING: waiting for the containerized monitor to join the quorum...' error. I needed to copy the [mons] and [osds] portions of the  /etc/ansible/hosts file to /etc/ansible/slave, and make sure that root could ssh between my nodes before I got the following: 

TASK [waiting for the containerized monitor to join the quorum...] *************
task path: /usr/share/ceph-ansible/rolling_update.yml:153
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (5 retries left).
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (4 retries left).
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (3 retries left).
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (2 retries left).
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (1 retries left).
fatal: [magna045 -> magna060]: FAILED! => {"attempts": 5, "changed": true, "cmd": "docker exec ceph-mon-magna060 ceph -s --cluster ceph | grep quorum | sed 's/.*quorum//' | egrep -sq magna045", "delta": "0:00:00.019702", "end": "2017-06-06 17:33:35.050290", "failed": true, "rc": 1, "start": "2017-06-06 17:33:35.030588", "stderr": "Error response from daemon: No such container: ceph-mon-magna060", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-magna060"], "stdout": "", "stdout_lines": []}
	to retry, use: --limit @/usr/share/ceph-ansible/rolling_update.retry

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0   
magna045                   : ok=45   changed=5    unreachable=0    failed=1   
magna055                   : ok=3    changed=0    unreachable=0    failed=0   
magna060                   : ok=3    changed=0    unreachable=0    failed=0
Comment 19 seb 2017-06-12 09:28:23 EDT
I'm missing something here, if ceph_docker_image_tag is properly set then ceph-ansible will change the mon systemd unit file and the /usr/share/ceph-osd-run.sh for osd to point to the new image.

If you test with the last container image the monitor should not fail.
Comment 20 Andrew Schoen 2017-06-12 09:41:59 EDT
(In reply to seb from comment #19)
> I'm missing something here, if ceph_docker_image_tag is properly set then
> ceph-ansible will change the mon systemd unit file and the
> /usr/share/ceph-osd-run.sh for osd to point to the new image.
> 
> If you test with the last container image the monitor should not fail.

The failure I see is that after being updated a MON does not rejoin the quorum. This is the same issue as mentioned above. I've created an upstream testing scenario that should expose the issue.

https://github.com/ceph/ceph-ansible/pull/1599
Comment 21 seb 2017-06-12 11:21:23 EDT
I would like to see logs from journalctl to properly understand what's going on.
Comment 22 Andrew Schoen 2017-06-12 17:40:42 EDT
(In reply to seb from comment #21)
> I would like to see logs from journalctl to properly understand what's going
> on.

I seems as though the rolling_update.yml playbook just needed a higher value for 'health_mon_check_delay' for the playbook to continue. There were also some bugs in the task that performs the PG check for updated OSDs.

I've created an upstream PR with those fixes and a testing scenario for containerized rolling updates: https://github.com/ceph/ceph-ansible/pull/1599

With those patches I am able to use rolling_update.yml to update a containerized cluster.
Comment 24 seb 2017-06-14 05:06:35 EDT
This won't be in 2.3.
I still think we can use rolling_update and simply increase "health_mon_check_delay".

Andrew what do you think?
Comment 29 Vasishta 2017-10-09 08:15 EDT
Created attachment 1336379 [details]
File contains ansible-playbook log

Hi, 

rolling update is failing in the task "waiting for the containerized monitor to join the quorum..." saying  "The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)[\"quorum_names\"]\n' failed. The error was: No JSON object could be decoded"

I tried on single mon cluster. I observed that the mon was successfully updated, and new container was up and running, quoram was there.


$ sudo docker ps
CONTAINER ID        IMAGE                                                                                                               COMMAND             CREATED             STATUS              PORTS               NAMES
f26f118b3eb7        brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-91008-20171006180220   "/entrypoint.sh"    6 minutes ago       Up 5 minutes                            ceph-mon-magna012
56a9fe73ac55        brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-31370-20171003232256   "/entrypoint.sh"    4 days ago          Up 4 days                               ceph-mgr-magna012
[ubuntu@magna012 ~]$ sudo docker exec ceph-mon-magna012 ceph -v 
ceph version 12.2.1-10.el7cp (5ba1c3fa606d7bf16f72756b0026f04a40297673) luminous (stable)
[ubuntu@magna012 ~]$ sudo docker exec ceph-mgr-magna012 ceph -v 
ceph version 12.2.1-9.el7cp (3972a2f60763dcf1be2e26457eee677515a2705d) luminous (stable)


Moving back to ASSIGNED state, please let m eknow if there are any concerns.

Regards,
Vasishta
Comment 31 leseb 2017-10-09 11:10:50 EDT
Vasishta,

Your log shows:


TASK [waiting for the monitor to join the quorum...] ***************************
changed: [magna068 -> magna068]

TASK [waiting for the containerized monitor to join the quorum...] *************
skipping: [magna068]

Which means: containerized_deployment is set to False, correct?
Can I see your group_vars/all.yml?

Thanks.
Comment 32 Vasishta 2017-10-09 11:18:50 EDT
Hi Sebastien,

No, I had set containerized_deployment to true, my all.yml looks like this -

<magna012> $ cat group_vars/all.yml | egrep -v ^# | grep -v ^$
---
dummy:
fetch_directory: ~/ceph-ansible-keys
cluster: "2017_sub" 
ceph_origin: repository
ceph_repository: rhcs
monitor_interface: eno1
public_network: 10.8.128.0/21
radosgw_interface: eno1
ceph_docker_image: "rhceph"
ceph_docker_image_tag: "ceph-3.0-rhel-7-docker-candidate-91008-20171006180220"
containerized_deployment: true
ceph_docker_registry: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888


Regards,
Vasishta
Comment 33 leseb 2017-10-09 11:45:30 EDT
Well, it's not what the play says... 
I can't reproduce your issue, can you get me an env where you can reproduce this every time?

Thanks.
Comment 34 Vasishta 2017-10-13 01:35 EDT
Created attachment 1338072 [details]
File contains contents ansible-playbook log

Hi Sebastien,

I'm really sorry, I noticed that the attachment was inappropriate, my bad.

I'm facing this issue again (as mentioned in Comment 29 ), this time I'm attaching the proper one, please take a look.

Regards,
Vasishta
Comment 35 leseb 2017-10-13 04:22:42 EDT
Ok, I understand the issue, you are upgrading a single monitor.
When running this task: https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L161, the first command returns 1, so the task exits immediately.

This is not a supported configuration.

You **must** test with a minimum of 3 monitors.
I"m closing this as invalid.
Comment 36 Vasishta 2017-10-16 05:24:59 EDT
Hi Sebastien, 

Sorry, my bad.

I tried again with 3 monitor set up and it worked fine in the task "waiting for the containerized monitor to join the quorum..."

I observed that rolling_update doesn't support upgrading rbd-mirroring and nfs-ganesha.

Moving back to ASSIGNED state, please let me know if there are any concerns.

Regards,
Vasishta
Comment 37 leseb 2017-10-17 09:21:00 EDT
fixed, will be in 3.0.3
Comment 38 leseb 2017-10-18 03:21:35 EDT
Will be in 3.0.3, release upstream is here: https://github.com/ceph/ceph-ansible/releases/tag/v3.0.3

Ken, can you build a package? Thanks.
Comment 42 Vasishta 2017-10-22 08:45:15 EDT
Tried using ceph-ansible-3.0.3-1.el7cp, Working fine.
Moving to VERIFIED state.
Comment 46 errata-xmlrpc 2017-12-05 18:18:20 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3388

Note You need to log in before you can comment on or make changes to this bug.