1458024 – [ceph-ansible] [ceph-container] : upgrade of containerized cluster fails

Bug 1458024 - [ceph-ansible] [ceph-container] : upgrade of containerized cluster fails

Summary: [ceph-ansible] [ceph-container] : upgrade of containerized cluster fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Container
Sub Component:
Version:	2.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	3.0
Assignee:	Sébastien Han
QA Contact:	Vasishta
Docs Contact:	Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks:	1437916 1494421
TreeView+	depends on / blocked

Reported:	2017-06-01 19:26 UTC by Rachana Patel
Modified:	2019-01-02 17:59 UTC (History)
CC List:	15 users (show)
Fixed In Version:	RHEL: ceph-ansible-3.0.3-1.el7cp Ubuntu: ceph-ansible_3.0.3-2redhat1
Doc Type:	Bug Fix
Doc Text:	.Upgrading a containerized Ceph cluster by using `rolling_update.yml` is supported Previously, after upgrading a containerized Ceph cluster by using the `rolling_update.yml` playbook, the `ceph-mon` daemons were not restarted. As a consequence, they were unable to join the quorum after the upgrade. With this update, upgrading containerized Ceph clusters with `rolling_update.yml` works as expected. For details, see the link:https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/container_guide/#upgrading-a-red-hat-ceph-storage-cluster-that-runs-in-containers[Upgrading a Red Hat Ceph Storage Cluster That Runs in Containers] section in the Container Guide for Red Hat Ceph Storage 3.
Clone Of:
Environment:
Last Closed:	2019-01-02 17:59:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
File contains ansible-playbook log (171.41 KB, text/plain) 2017-10-09 12:15 UTC, Vasishta	no flags	Details
File contains contents ansible-playbook log (369.70 KB, text/plain) 2017-10-13 05:35 UTC, Vasishta	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 2055	0	None	closed	upgrade: support for rbd mirror and nfs	2020-11-12 01:13:03 UTC
Red Hat Product Errata	RHEA-2017:3388	0	normal	SHIPPED_LIVE	new container image: rhceph-3-rhel7	2017-12-06 02:43:29 UTC

Internal Links: 1663026

Description Rachana Patel 2017-06-01 19:26:23 UTC

Description of problem:
========================
It's failing in MON upgrade hence failed_QA


TASK [waiting for the containerized monitor to join the quorum...] *************
task path: /root/slavv_new/rolling_update.yml:153
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (5 retries left).
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (4 retries left).
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (3 retries left).
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (2 retries left).
FAILED - RETRYING: TASK: waiting for the containerized monitor to join the quorum... (1 retries left).
fatal: [magna029 -> magna027]: FAILED! => {"attempts": 5, "changed": true, "cmd": "docker exec ceph-mon-magna027 ceph -s --cluster slave | grep quorum | sed 's/.*quorum//' | egrep -sq magna029", "delta": "0:00:00.019534", "end": "2017-05-31 22:09:58.148914", "failed": true, "rc": 1, "start": "2017-05-31 22:09:58.129380", "stderr": "Error response from daemon: Container c9d3588bd0766f7f43d3680cc42bfe7eb7bcb9556d480b0cffe18be806e3f82e is not running", "stdout": "", "stdout_lines": [], "warnings": []}
	to retry, use: --limit @/root/slavv_new/rolling_update.retry

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0   
magna025                   : ok=46   changed=5    unreachable=0    failed=0   
magna027                   : ok=45   changed=5    unreachable=0    failed=0   
magna029                   : ok=44   changed=4    unreachable=0    failed=1   
magna036                   : ok=3    changed=0    unreachable=0    failed=0   
magna043                   : ok=3    changed=0    unreachable=0    failed=0   
magna048                   : ok=3    changed=0    unreachable=0    failed=0   
magna067                   : ok=3    changed=0    unreachable=0    failed=0  


had following version:-
======================
ceph-ansible-2.2.7-1.el7scon.noarch
ansible-2.2.3.0-1.el7.noarch
ceph-2-rhel-7-docker-candidate-20170526111545




How reproducible:
================
always

Steps to Reproduce:
====================
upgrade to version - ceph-2-rhel-7-docker-candidate-20170530155520 

Steps:
======
1. cluster was installed using ceph-ansible. Use same setup/variable/group_var files. 
cd to that Dir
2. cp infrastructure-playbooks/rolling_update.yml .
3. update group_vars/all.yml 
ceph_docker_image_tag: ceph-2-rhel-7-docker-candidate-20170530155520
mon_containerized_deployment: true
(Rest variables/files are not changed)
4.ansible-playbook rolling_update.yml -i /etc/ansible/slave -vv

Comment 3 Rachana Patel 2017-06-01 19:30:22 UTC

This bug is created to track issues mentioned in Bug 1449159 - [ceph container]:- docker instances of osd and mons will not be spinned with new image after a docker pull update (comment #44)

Comment 4 seb 2017-06-02 09:23:52 UTC

Did you just copy/paste the content of Bug 1449159 or did try again with the new image and ended up with the same issue?

Comment 5 Christina Meno 2017-06-05 15:11:19 UTC

please provide info requested in c4

Comment 6 Rachana Patel 2017-06-05 15:16:32 UTC

(In reply to seb from comment #4)
> Did you just copy/paste the content of Bug 1449159 or did try again with the
> new image and ended up with the same issue?

Copy/Pasted from Bug 1449159 but as you can see in description we were trying to upgrade to latest available version at that time ("upgrade to version - ceph-2-rhel-7-docker-candidate-20170530155520 ")

Comment 7 Christina Meno 2017-06-05 18:24:10 UTC

Andrew please triage this.

Comment 9 Warren 2017-06-06 15:42:02 UTC

I have seen this behavior when a field in the groups/all.yml file is incorrect.  I once made a typo in the public_network field and there were no errors noticed until this point.  I will try to reproduce this today.

Comment 10 Warren 2017-06-06 15:57:15 UTC

So based on the errors that I have seen in the past, it would not surprise me if the 'ceph_docker_image_tag: ceph-2-rhel-7-docker-candidate-20170530155520' line is incorrect.

Comment 12 Warren 2017-06-06 17:38:07 UTC

Reproduced the 'FAILED - RETRYING: waiting for the containerized monitor to join the quorum...' error. I needed to copy the [mons] and [osds] portions of the  /etc/ansible/hosts file to /etc/ansible/slave, and make sure that root could ssh between my nodes before I got the following: 

TASK [waiting for the containerized monitor to join the quorum...] *************
task path: /usr/share/ceph-ansible/rolling_update.yml:153
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (5 retries left).
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (4 retries left).
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (3 retries left).
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (2 retries left).
FAILED - RETRYING: waiting for the containerized monitor to join the quorum... (1 retries left).
fatal: [magna045 -> magna060]: FAILED! => {"attempts": 5, "changed": true, "cmd": "docker exec ceph-mon-magna060 ceph -s --cluster ceph | grep quorum | sed 's/.*quorum//' | egrep -sq magna045", "delta": "0:00:00.019702", "end": "2017-06-06 17:33:35.050290", "failed": true, "rc": 1, "start": "2017-06-06 17:33:35.030588", "stderr": "Error response from daemon: No such container: ceph-mon-magna060", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-magna060"], "stdout": "", "stdout_lines": []}
	to retry, use: --limit @/usr/share/ceph-ansible/rolling_update.retry

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0   
magna045                   : ok=45   changed=5    unreachable=0    failed=1   
magna055                   : ok=3    changed=0    unreachable=0    failed=0   
magna060                   : ok=3    changed=0    unreachable=0    failed=0

Comment 19 seb 2017-06-12 13:28:23 UTC

I'm missing something here, if ceph_docker_image_tag is properly set then ceph-ansible will change the mon systemd unit file and the /usr/share/ceph-osd-run.sh for osd to point to the new image.

If you test with the last container image the monitor should not fail.

Comment 20 Andrew Schoen 2017-06-12 13:41:59 UTC

(In reply to seb from comment #19)
> I'm missing something here, if ceph_docker_image_tag is properly set then
> ceph-ansible will change the mon systemd unit file and the
> /usr/share/ceph-osd-run.sh for osd to point to the new image.
> 
> If you test with the last container image the monitor should not fail.

The failure I see is that after being updated a MON does not rejoin the quorum. This is the same issue as mentioned above. I've created an upstream testing scenario that should expose the issue.

https://github.com/ceph/ceph-ansible/pull/1599

Comment 21 seb 2017-06-12 15:21:23 UTC

I would like to see logs from journalctl to properly understand what's going on.

Comment 22 Andrew Schoen 2017-06-12 21:40:42 UTC

(In reply to seb from comment #21)
> I would like to see logs from journalctl to properly understand what's going
> on.

I seems as though the rolling_update.yml playbook just needed a higher value for 'health_mon_check_delay' for the playbook to continue. There were also some bugs in the task that performs the PG check for updated OSDs.

I've created an upstream PR with those fixes and a testing scenario for containerized rolling updates: https://github.com/ceph/ceph-ansible/pull/1599

With those patches I am able to use rolling_update.yml to update a containerized cluster.

Comment 24 seb 2017-06-14 09:06:35 UTC

This won't be in 2.3.
I still think we can use rolling_update and simply increase "health_mon_check_delay".

Andrew what do you think?

Comment 29 Vasishta 2017-10-09 12:15:05 UTC

Created attachment 1336379 [details]
File contains ansible-playbook log

Hi, 

rolling update is failing in the task "waiting for the containerized monitor to join the quorum..." saying  "The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)[\"quorum_names\"]\n' failed. The error was: No JSON object could be decoded"

I tried on single mon cluster. I observed that the mon was successfully updated, and new container was up and running, quoram was there.


$ sudo docker ps
CONTAINER ID        IMAGE                                                                                                               COMMAND             CREATED             STATUS              PORTS               NAMES
f26f118b3eb7        brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-91008-20171006180220   "/entrypoint.sh"    6 minutes ago       Up 5 minutes                            ceph-mon-magna012
56a9fe73ac55        brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-31370-20171003232256   "/entrypoint.sh"    4 days ago          Up 4 days                               ceph-mgr-magna012
[ubuntu@magna012 ~]$ sudo docker exec ceph-mon-magna012 ceph -v 
ceph version 12.2.1-10.el7cp (5ba1c3fa606d7bf16f72756b0026f04a40297673) luminous (stable)
[ubuntu@magna012 ~]$ sudo docker exec ceph-mgr-magna012 ceph -v 
ceph version 12.2.1-9.el7cp (3972a2f60763dcf1be2e26457eee677515a2705d) luminous (stable)


Moving back to ASSIGNED state, please let m eknow if there are any concerns.

Regards,
Vasishta

Comment 31 Sébastien Han 2017-10-09 15:10:50 UTC

Vasishta,

Your log shows:


TASK [waiting for the monitor to join the quorum...] ***************************
changed: [magna068 -> magna068]

TASK [waiting for the containerized monitor to join the quorum...] *************
skipping: [magna068]

Which means: containerized_deployment is set to False, correct?
Can I see your group_vars/all.yml?

Thanks.

Comment 32 Vasishta 2017-10-09 15:18:50 UTC

Hi Sebastien,

No, I had set containerized_deployment to true, my all.yml looks like this -

<magna012> $ cat group_vars/all.yml | egrep -v ^# | grep -v ^$
---
dummy:
fetch_directory: ~/ceph-ansible-keys
cluster: "2017_sub" 
ceph_origin: repository
ceph_repository: rhcs
monitor_interface: eno1
public_network: 10.8.128.0/21
radosgw_interface: eno1
ceph_docker_image: "rhceph"
ceph_docker_image_tag: "ceph-3.0-rhel-7-docker-candidate-91008-20171006180220"
containerized_deployment: true
ceph_docker_registry: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888


Regards,
Vasishta

Comment 33 Sébastien Han 2017-10-09 15:45:30 UTC

Well, it's not what the play says... 
I can't reproduce your issue, can you get me an env where you can reproduce this every time?

Thanks.

Comment 34 Vasishta 2017-10-13 05:35:29 UTC

Created attachment 1338072 [details]
File contains contents ansible-playbook log

Hi Sebastien,

I'm really sorry, I noticed that the attachment was inappropriate, my bad.

I'm facing this issue again (as mentioned in Comment 29 ), this time I'm attaching the proper one, please take a look.

Regards,
Vasishta

Comment 35 Sébastien Han 2017-10-13 08:22:42 UTC

Ok, I understand the issue, you are upgrading a single monitor.
When running this task: https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L161, the first command returns 1, so the task exits immediately.

This is not a supported configuration.

You **must** test with a minimum of 3 monitors.
I"m closing this as invalid.

Comment 36 Vasishta 2017-10-16 09:24:59 UTC

Hi Sebastien, 

Sorry, my bad.

I tried again with 3 monitor set up and it worked fine in the task "waiting for the containerized monitor to join the quorum..."

I observed that rolling_update doesn't support upgrading rbd-mirroring and nfs-ganesha.

Moving back to ASSIGNED state, please let me know if there are any concerns.

Regards,
Vasishta

Comment 37 Sébastien Han 2017-10-17 13:21:00 UTC

fixed, will be in 3.0.3

Comment 38 Sébastien Han 2017-10-18 07:21:35 UTC

Will be in 3.0.3, release upstream is here: https://github.com/ceph/ceph-ansible/releases/tag/v3.0.3

Ken, can you build a package? Thanks.

Comment 42 Vasishta 2017-10-22 12:45:15 UTC

Tried using ceph-ansible-3.0.3-1.el7cp, Working fine.
Moving to VERIFIED state.

Comment 46 errata-xmlrpc 2017-12-05 23:18:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3388

Note You need to log in before you can comment on or make changes to this bug.