Bug 1546127

Summary: [ceph-ansible]: Upgrading container cluster from 2.4 to 2.5 fails for waiting cluster to form quorum even though quorum is there
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Ramakrishnan Periyasamy <rperiyas>
Component: ContainerAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED EOL QA Contact: Vasishta <vashastr>
Severity: urgent Docs Contact:
Priority: medium    
Version: 2.5CC: adeza, agunn, aschoen, ceph-eng-bugs, dang, gabrioux, gmeno, hchen, hnallurv, jim.curtis, kdreyer, mhackett, mkasturi, nthomas, pprakash, rperiyas, sankarshan, shan, tserlin
Target Milestone: rc   
Target Release: 2.*   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhceph-rhel7-docker-2.5-2 Doc Type: Known Issue
Doc Text:
.Installing and upgrading containerized Ceph fails Using Full Qualified Domain Names (FQDN) in the `/etc/hostname` file for containerized Ceph deployments will fail when installing and upgrading Ceph. When using the `ceph-ansible` playbook to install Ceph, the installation will fail with the following error message: ---- "msg": "The task includes an option with an undefined variable. The error was: 'osd_pool_default_pg_num' is undefined ---- To work around the installation failure, change the FQDN in the `/etc/hostname` file to the short host name on all nodes in the storage cluster. Next, rerun the `ceph-ansible` playbook to install Ceph. When upgrading Ceph with the `rolling_update` playbook, the upgrade will fail with the following error message: ---- "FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum" ---- To work around the upgrade failure, change the FQDN in the `/etc/hostname` file to the short host name on all nodes in the storage cluster. Next, restart the corresponding Ceph daemons running on each node in the storage cluster, then rerun the `rolling_update` playbook to upgrade Ceph.
Story Points: ---
Clone Of:
: 1546834 (view as bug list) Environment:
Last Closed: 2019-08-23 03:14:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1546834    
Bug Blocks: 1536401, 1572368    
Attachments:
Description Flags
ansible-playbook logs. none

Description Ramakrishnan Periyasamy 2018-02-16 12:01:23 UTC
Created attachment 1396996 [details]
ansible-playbook logs.

Description of problem:
Ceph-ansible fails during upgrade of container cluster from 2.4 to 2.5 for waiting cluster to form quorum even though quorum is there

Thanks guillaume for debugging this.

Failure message: "FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum"

2018-02-16 11:40:10,566 p=26854 u=ubuntu |  FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (5 retries left).Result was: {
    "attempts": 1,
    "changed": true,
    "cmd": [
        "docker",
        "exec",
        "ceph-mon-magna082",
        "ceph",
        "--cluster",
        "slave",
        "-s",
        "--format",
        "json"
    ],
    "delta": "0:00:00.305037",
    "end": "2018-02-16 11:40:10.527565",
    "invocation": {
        "module_args": {
            "_raw_params": "docker exec ceph-mon-magna082 ceph --cluster \"slave\" -s --format json",
            "_uses_shell": false,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "rc": 0,
    "retries": 6,
    "start": "2018-02-16 11:40:10.222528",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "\n{\"health\":{\"health\":{\"health_services\":[{\"mons\":[{\"name\":\"magna069.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4429200,\"kb_avail\":908013800,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:42.396679\",\"store_stats\":{\"bytes_total\":57521079,\"bytes_sst\":55423911,\"bytes_log\":2031616,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"},{\"name\":\"magna072.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4276336,\"kb_avail\":908166664,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:34.496080\",\"store_stats\":{\"bytes_total\":34111166,\"bytes_sst\":30965422,\"bytes_log\":3080192,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"},{\"name\":\"magna082.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4287284,\"kb_avail\":908155716,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:48.694097\",\"store_stats\":{\"bytes_total\":35160310,\"bytes_sst\":30965990,\"bytes_log\":4128768,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"}]}]},\"timechecks\":{\"epoch\":56,\"round\":2,\"round_status\":\"finished\",\"mons\":[{\"name\":\"magna069.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.000000,\"health\":\"HEALTH_OK\"},{\"name\":\"magna072.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.012728,\"health\":\"HEALTH_OK\"},{\"name\":\"magna082.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.042653,\"health\":\"HEALTH_OK\"}]},\"summary\":[],\"overall_status\":\"HEALTH_OK\",\"detail\":[]},\"fsid\":\"5362be02-bf26-4b66-ac09-7496cadcd801\",\"election_epoch\":56,\"quorum\":[0,1,2],\"quorum_names\":[\"magna069.ceph.redhat.com\",\"magna072.ceph.redhat.com\",\"magna082.ceph.redhat.com\"],\"monmap\":{\"epoch\":5,\"fsid\":\"5362be02-bf26-4b66-ac09-7496cadcd801\",\"modified\":\"2018-02-16 09:42:21.176234\",\"created\":\"2018-02-13 10:38:27.533432\",\"mons\":[{\"rank\":0,\"name\":\"magna069.ceph.redhat.com\",\"addr\":\"10.8.128.69:6789\\/0\"},{\"rank\":1,\"name\":\"magna072.ceph.redhat.com\",\"addr\":\"10.8.128.72:6789\\/0\"},{\"rank\":2,\"name\":\"magna082.ceph.redhat.com\",\"addr\":\"10.8.128.82:6789\\/0\"}]},\"osdmap\":{\"osdmap\":{\"epoch\":5261,\"num_osds\":8,\"num_up_osds\":8,\"num_in_osds\":8,\"full\":false,\"nearfull\":false,\"num_remapped_pgs\":0}},\"pgmap\":{\"pgs_by_state\":[{\"state_name\":\"active+clean\",\"count\":288}],\"version\":176347,\"num_pgs\":288,\"data_bytes\":5921356152,\"bytes_used\":18343018496,\"bytes_avail\":7947121070080,\"bytes_total\":7965464088576},\"fsmap\":{\"epoch\":5,\"id\":1,\"up\":1,\"in\":1,\"max\":1,\"by_rank\":[{\"filesystem_id\":1,\"rank\":0,\"name\":\"magna118\",\"status\":\"up:active\"}]}}",
    "stdout_lines": [
        "",
        "{\"health\":{\"health\":{\"health_services\":[{\"mons\":[{\"name\":\"magna069.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4429200,\"kb_avail\":908013800,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:42.396679\",\"store_stats\":{\"bytes_total\":57521079,\"bytes_sst\":55423911,\"bytes_log\":2031616,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"},{\"name\":\"magna072.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4276336,\"kb_avail\":908166664,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:34.496080\",\"store_stats\":{\"bytes_total\":34111166,\"bytes_sst\":30965422,\"bytes_log\":3080192,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"},{\"name\":\"magna082.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4287284,\"kb_avail\":908155716,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:48.694097\",\"store_stats\":{\"bytes_total\":35160310,\"bytes_sst\":30965990,\"bytes_log\":4128768,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"}]}]},\"timechecks\":{\"epoch\":56,\"round\":2,\"round_status\":\"finished\",\"mons\":[{\"name\":\"magna069.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.000000,\"health\":\"HEALTH_OK\"},{\"name\":\"magna072.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.012728,\"health\":\"HEALTH_OK\"},{\"name\":\"magna082.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.042653,\"health\":\"HEALTH_OK\"}]},\"summary\":[],\"overall_status\":\"HEALTH_OK\",\"detail\":[]},\"fsid\":\"5362be02-bf26-4b66-ac09-7496cadcd801\",\"election_epoch\":56,\"quorum\":[0,1,2],\"quorum_names\":[\"magna069.ceph.redhat.com\",\"magna072.ceph.redhat.com\",\"magna082.ceph.redhat.com\"],\"monmap\":{\"epoch\":5,\"fsid\":\"5362be02-bf26-4b66-ac09-7496cadcd801\",\"modified\":\"2018-02-16 09:42:21.176234\",\"created\":\"2018-02-13 10:38:27.533432\",\"mons\":[{\"rank\":0,\"name\":\"magna069.ceph.redhat.com\",\"addr\":\"10.8.128.69:6789\\/0\"},{\"rank\":1,\"name\":\"magna072.ceph.redhat.com\",\"addr\":\"10.8.128.72:6789\\/0\"},{\"rank\":2,\"name\":\"magna082.ceph.redhat.com\",\"addr\":\"10.8.128.82:6789\\/0\"}]},\"osdmap\":{\"osdmap\":{\"epoch\":5261,\"num_osds\":8,\"num_up_osds\":8,\"num_in_osds\":8,\"full\":false,\"nearfull\":false,\"num_remapped_pgs\":0}},\"pgmap\":{\"pgs_by_state\":[{\"state_name\":\"active+clean\",\"count\":288}],\"version\":176347,\"num_pgs\":288,\"data_bytes\":5921356152,\"bytes_used\":18343018496,\"bytes_avail\":7947121070080,\"bytes_total\":7965464088576},\"fsmap\":{\"epoch\":5,\"id\":1,\"up\":1,\"in\":1,\"max\":1,\"by_rank\":[{\"filesystem_id\":1,\"rank\":0,\"name\":\"magna118\",\"status\":\"up:active\"}]}}"
    ]
}
2018-02-16 11:40:17,013 p=26854 u=ubuntu |   [ERROR]: User interrupted execution

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.25-1.el7cp.noarch
ansible-2.4.2.0-2.el7.noarch

How reproducible:
10/10

Steps to Reproduce:
1. Configure 2.4 cluster
2. update 2.5 ansible packges 
3. upgrade using ceph-ansible, followed the official doc.

Actual results:
Even though quorum is there ansible fails with unable to form quorum error.

Expected results:
NA

Additional info:
NA

Comment 25 Madhavi Kasturi 2018-02-19 14:17:41 UTC
Observed the similar failure during installation [kernel updated]

Changed the hostname from FQDN to short hostname in /etc/hostname, the installation completed successfully.

Comment 26 Guillaume Abrioux 2018-02-19 17:19:05 UTC
v3.0.26 should fix this issue

Comment 40 Guillaume Abrioux 2018-02-20 13:42:04 UTC
Hi Ramakrishnan,

the initial error reported in this BZ is fixed in ceph-ansible v3.0.26
the error reported in c28 is fixed in the container image ceph-2-rhel-7-docker-candidate-62031-20180220125431

Comment 43 Ramakrishnan Periyasamy 2018-02-20 14:01:13 UTC
(In reply to Guillaume Abrioux from comment #40)
> Hi Ramakrishnan,
> 
> the initial error reported in this BZ is fixed in ceph-ansible v3.0.26
> the error reported in c28 is fixed in the container image
> ceph-2-rhel-7-docker-candidate-62031-20180220125431

Thanks for the update Guillaume and thanks for your time to troubleshoot the issue and explaining me about the problem.

Comment 54 Ramakrishnan Periyasamy 2018-02-21 06:00:08 UTC
Hi Aron,

Provided doc text for release notes, clearing the needinfo tag. 

Regards,
Ramakrishnan