Bug 1546834 - [ceph-ansible]: Upgrading container cluster from 2.4 to 2.5 fails for waiting cluster to form quorum even though quorum is there
Summary: [ceph-ansible]: Upgrading container cluster from 2.4 to 2.5 fails for waiting...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: z1
: 3.0
Assignee: Sébastien Han
QA Contact: Vasishta
Aron Gunn
URL:
Whiteboard:
: 1553818 (view as bug list)
Depends On:
Blocks: 1546127
TreeView+ depends on / blocked
 
Reported: 2018-02-19 17:37 UTC by Ken Dreyer (Red Hat)
Modified: 2021-12-10 15:53 UTC (History)
19 users (show)

Fixed In Version: RHEL: ceph-ansible-3.0.26-1.el7cp Ubuntu: ceph-ansible_3.0.26-2redhat1 rhceph:ceph-3.0-rhel-7-docker-candidate-38019-20180222163657
Doc Type: Bug Fix
Doc Text:
.Upgrading containerized cluster with fqdn as hostname no longer fails neither searching for inappropriate 'asok' file nor the quorum issue. Previously, when using Full Qualified Domain Names (FQDN) in the "/etc/hostname" file for containerized Ceph deployments they would fail when installing and upgrading Ceph using the ceph-ansible playbook. With this release the installation or upgrading Ceph no longer fails when using a FQDN.
Clone Of: 1546127
Environment:
Last Closed: 2018-03-08 15:54:03 UTC
Embargoed:


Attachments (Terms of Use)
File contains contents of ansible-playbook log (657.42 KB, text/plain)
2018-02-27 04:45 UTC, Vasishta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 2398 0 None closed update: look for short and fqdn in ceph_health_raw 2020-12-09 17:35:11 UTC
Github ceph ceph-container pull 907 0 None closed variables: sync with `ansible_hostname` ansible fact 2020-12-09 17:35:10 UTC
Red Hat Issue Tracker RHCEPH-2625 0 None None None 2021-12-10 15:53:57 UTC
Red Hat Product Errata RHBA-2018:0474 0 normal SHIPPED_LIVE Red Hat Ceph Storage 3.0 bug fix update 2018-03-08 20:51:53 UTC

Comment 8 Vasishta 2018-02-26 16:59:13 UTC
Similar error occurred for MGRs -

failed: [magna113 -> magna113] (item=magna113) => {
    "changed": true, 
    "cmd": [
        "/tmp/restart_mgr_daemon.sh"
    ], 
    "delta": "0:01:17.305509", 
    "end": "2018-02-26 16:47:21.102684", 
    "invocation": {
        "module_args": {
            "_raw_params": "/tmp/restart_mgr_daemon.sh", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "item": "magna113", 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2018-02-26 16:46:03.797175", 
    "stderr": "Error response from daemon: No such container: ceph-mgr-magna113", 
    "stderr_lines": [
        "Error response from daemon: No such container: ceph-mgr-magna113"
    ], 
    "stdout": "Socket file /var/run/ceph/ceph1-mgr.magna113.asok could not be found, which means ceph manager is not running.", 
    "stdout_lines": [
        "Socket file /var/run/ceph/ceph1-mgr.magna113.asok could not be found, which means ceph manager is not running."
    ]
$ sudo docker exec ceph-mgr-magna113 ls /var/run/ceph
ceph1-mgr.magna113.ceph.redhat.com.asok

-----------------------------------------
Execution details -

3.0 Cluster was configure when hostname was short hostname, hostname were changed to fqdn, updated ceph-ansible to 3.0.26-1.el7cp.noarch, Ran rolling update.

Moving back to ASSIGNED state, Please let me know if there are any concerns.

Regards,
Vasishta 
AQE, Ceph

Comment 9 Sébastien Han 2018-02-26 18:58:50 UTC
Vashita, the issue is fixed with the latest 3.0 container image, Ken, do we have it?
Thanks.

I don't see it in Brew which sounds weird to me. I committed here: https://bugzilla.redhat.com/show_bug.cgi?id=1546127#c62

Comment 11 Sébastien Han 2018-02-26 20:55:15 UTC
Right, the only issue is that registry.access.redhat.com/rhceph/rhceph-3-rhel7 doesn't have the fix which is present in rhceph:ceph-3.0-rhel-7-docker-candidate-38019-20180222163657

How can we proceed?

Comment 12 Ken Dreyer (Red Hat) 2018-02-26 22:37:02 UTC
Sébastien, QE is testing upgrading from the latest released container to the latest unreleased.

  From: registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest

  To: rhceph:ceph-3.0-rhel-7-docker-candidate-38019-20180222163657

When they've verified the fix is working in that container for this BZ, we will ship the final gold-signed container to customers on registry.access.redhat.com, and it will become rhceph-3-rhel7:latest.

I see both of these images both defined on magna113.ceph.redhat.com (in `docker images`), so I imagine that's what Vashita has been testing here already.

Comment 13 Vasishta 2018-02-27 04:45:45 UTC
Created attachment 1401136 [details]
File contains contents of ansible-playbook log

Hi Sebastien, 

As Ken mentioned, I was trying to upgrade from 3.0 live to ceph-3.0-rhel-7-docker-candidate-38019-20180222163657

I think, after upgrading mon, mgr is restarted without getting upgraded, so asok file with fqdn in name has been created, which has resulted in this failure.

Regards,
Vasishta Shastry
AQE, Ceph

Comment 14 Sébastien Han 2018-02-27 07:53:18 UTC
In this case, the failure is expected, if you don't apply the workaround.
I see your /etc/hostname still has the FQDN, you have to force the shortname.

Comment 15 Vasishta 2018-02-27 17:10:14 UTC
Yes, With work around rolling update worked fine for me.

However, it can be observed that previously, the new asok file created using new container image doesn't contain fqdn in its name even though hostname is fqdn.

Work around is needed as ceph-ansible restarts mgr daemon before updating which causes creation of asok file with fqdn in its name.

Regards,
Vasishta 
AQE, Ceph

Comment 16 Sébastien Han 2018-02-28 14:37:09 UTC
You don't see the fqdn anymore because the container enforces the shortname.
What's needed so we can move this to VERIFIED?

Thanks

Comment 17 Vasishta 2018-02-28 15:56:58 UTC
(In reply to leseb from comment #16)
> You don't see the fqdn anymore because the container enforces the shortname.

Though we don't see fqdn, upgrade fails as mgr containers are restarted before upgrading which results in creation of asok file with fqdn in file name. Will file separate bug for this.

> What's needed so we can move this to VERIFIED?
> 
> Thanks

Will move it to VERIFIED state once it comes ON_QA.

Ceph-ansible - ceph-ansible-3.0.26-1.el7cp
Container - 3.0 live to ceph-3.0-rhel-7-docker-candidate-38019-20180222163657

Regards,
Vasishta shastry
AQE, Ceph

Comment 21 errata-xmlrpc 2018-03-08 15:54:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0474

Comment 22 Sébastien Han 2018-04-05 10:11:39 UTC
*** Bug 1553818 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.