Bug 1546127 - [ceph-ansible]: Upgrading container cluster from 2.4 to 2.5 fails for waiting cluster to form quorum even though quorum is there
Summary: [ceph-ansible]: Upgrading container cluster from 2.4 to 2.5 fails for waiting...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Container
Version: 2.5
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: rc
: 2.*
Assignee: Guillaume Abrioux
QA Contact: Vasishta
URL:
Whiteboard:
Depends On: 1546834
Blocks: 1536401 1572368
TreeView+ depends on / blocked
 
Reported: 2018-02-16 12:01 UTC by Ramakrishnan Periyasamy
Modified: 2019-08-23 03:14 UTC (History)
19 users (show)

Fixed In Version: rhceph-rhel7-docker-2.5-2
Doc Type: Known Issue
Doc Text:
.Installing and upgrading containerized Ceph fails Using Full Qualified Domain Names (FQDN) in the `/etc/hostname` file for containerized Ceph deployments will fail when installing and upgrading Ceph. When using the `ceph-ansible` playbook to install Ceph, the installation will fail with the following error message: ---- "msg": "The task includes an option with an undefined variable. The error was: 'osd_pool_default_pg_num' is undefined ---- To work around the installation failure, change the FQDN in the `/etc/hostname` file to the short host name on all nodes in the storage cluster. Next, rerun the `ceph-ansible` playbook to install Ceph. When upgrading Ceph with the `rolling_update` playbook, the upgrade will fail with the following error message: ---- "FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum" ---- To work around the upgrade failure, change the FQDN in the `/etc/hostname` file to the short host name on all nodes in the storage cluster. Next, restart the corresponding Ceph daemons running on each node in the storage cluster, then rerun the `rolling_update` playbook to upgrade Ceph.
Clone Of:
: 1546834 (view as bug list)
Environment:
Last Closed: 2019-08-23 03:14:24 UTC
Embargoed:


Attachments (Terms of Use)
ansible-playbook logs. (3.31 MB, text/plain)
2018-02-16 12:01 UTC, Ramakrishnan Periyasamy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-container pull 907 0 None closed variables: sync with `ansible_hostname` ansible fact 2020-06-18 12:22:47 UTC

Description Ramakrishnan Periyasamy 2018-02-16 12:01:23 UTC
Created attachment 1396996 [details]
ansible-playbook logs.

Description of problem:
Ceph-ansible fails during upgrade of container cluster from 2.4 to 2.5 for waiting cluster to form quorum even though quorum is there

Thanks guillaume for debugging this.

Failure message: "FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum"

2018-02-16 11:40:10,566 p=26854 u=ubuntu |  FAILED - RETRYING: container | waiting for the containerized monitor to join the quorum... (5 retries left).Result was: {
    "attempts": 1,
    "changed": true,
    "cmd": [
        "docker",
        "exec",
        "ceph-mon-magna082",
        "ceph",
        "--cluster",
        "slave",
        "-s",
        "--format",
        "json"
    ],
    "delta": "0:00:00.305037",
    "end": "2018-02-16 11:40:10.527565",
    "invocation": {
        "module_args": {
            "_raw_params": "docker exec ceph-mon-magna082 ceph --cluster \"slave\" -s --format json",
            "_uses_shell": false,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "rc": 0,
    "retries": 6,
    "start": "2018-02-16 11:40:10.222528",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "\n{\"health\":{\"health\":{\"health_services\":[{\"mons\":[{\"name\":\"magna069.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4429200,\"kb_avail\":908013800,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:42.396679\",\"store_stats\":{\"bytes_total\":57521079,\"bytes_sst\":55423911,\"bytes_log\":2031616,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"},{\"name\":\"magna072.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4276336,\"kb_avail\":908166664,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:34.496080\",\"store_stats\":{\"bytes_total\":34111166,\"bytes_sst\":30965422,\"bytes_log\":3080192,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"},{\"name\":\"magna082.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4287284,\"kb_avail\":908155716,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:48.694097\",\"store_stats\":{\"bytes_total\":35160310,\"bytes_sst\":30965990,\"bytes_log\":4128768,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"}]}]},\"timechecks\":{\"epoch\":56,\"round\":2,\"round_status\":\"finished\",\"mons\":[{\"name\":\"magna069.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.000000,\"health\":\"HEALTH_OK\"},{\"name\":\"magna072.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.012728,\"health\":\"HEALTH_OK\"},{\"name\":\"magna082.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.042653,\"health\":\"HEALTH_OK\"}]},\"summary\":[],\"overall_status\":\"HEALTH_OK\",\"detail\":[]},\"fsid\":\"5362be02-bf26-4b66-ac09-7496cadcd801\",\"election_epoch\":56,\"quorum\":[0,1,2],\"quorum_names\":[\"magna069.ceph.redhat.com\",\"magna072.ceph.redhat.com\",\"magna082.ceph.redhat.com\"],\"monmap\":{\"epoch\":5,\"fsid\":\"5362be02-bf26-4b66-ac09-7496cadcd801\",\"modified\":\"2018-02-16 09:42:21.176234\",\"created\":\"2018-02-13 10:38:27.533432\",\"mons\":[{\"rank\":0,\"name\":\"magna069.ceph.redhat.com\",\"addr\":\"10.8.128.69:6789\\/0\"},{\"rank\":1,\"name\":\"magna072.ceph.redhat.com\",\"addr\":\"10.8.128.72:6789\\/0\"},{\"rank\":2,\"name\":\"magna082.ceph.redhat.com\",\"addr\":\"10.8.128.82:6789\\/0\"}]},\"osdmap\":{\"osdmap\":{\"epoch\":5261,\"num_osds\":8,\"num_up_osds\":8,\"num_in_osds\":8,\"full\":false,\"nearfull\":false,\"num_remapped_pgs\":0}},\"pgmap\":{\"pgs_by_state\":[{\"state_name\":\"active+clean\",\"count\":288}],\"version\":176347,\"num_pgs\":288,\"data_bytes\":5921356152,\"bytes_used\":18343018496,\"bytes_avail\":7947121070080,\"bytes_total\":7965464088576},\"fsmap\":{\"epoch\":5,\"id\":1,\"up\":1,\"in\":1,\"max\":1,\"by_rank\":[{\"filesystem_id\":1,\"rank\":0,\"name\":\"magna118\",\"status\":\"up:active\"}]}}",
    "stdout_lines": [
        "",
        "{\"health\":{\"health\":{\"health_services\":[{\"mons\":[{\"name\":\"magna069.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4429200,\"kb_avail\":908013800,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:42.396679\",\"store_stats\":{\"bytes_total\":57521079,\"bytes_sst\":55423911,\"bytes_log\":2031616,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"},{\"name\":\"magna072.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4276336,\"kb_avail\":908166664,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:34.496080\",\"store_stats\":{\"bytes_total\":34111166,\"bytes_sst\":30965422,\"bytes_log\":3080192,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"},{\"name\":\"magna082.ceph.redhat.com\",\"kb_total\":961297424,\"kb_used\":4287284,\"kb_avail\":908155716,\"avail_percent\":94,\"last_updated\":\"2018-02-16 11:39:48.694097\",\"store_stats\":{\"bytes_total\":35160310,\"bytes_sst\":30965990,\"bytes_log\":4128768,\"bytes_misc\":65552,\"last_updated\":\"0.000000\"},\"health\":\"HEALTH_OK\"}]}]},\"timechecks\":{\"epoch\":56,\"round\":2,\"round_status\":\"finished\",\"mons\":[{\"name\":\"magna069.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.000000,\"health\":\"HEALTH_OK\"},{\"name\":\"magna072.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.012728,\"health\":\"HEALTH_OK\"},{\"name\":\"magna082.ceph.redhat.com\",\"skew\":0.000000,\"latency\":0.042653,\"health\":\"HEALTH_OK\"}]},\"summary\":[],\"overall_status\":\"HEALTH_OK\",\"detail\":[]},\"fsid\":\"5362be02-bf26-4b66-ac09-7496cadcd801\",\"election_epoch\":56,\"quorum\":[0,1,2],\"quorum_names\":[\"magna069.ceph.redhat.com\",\"magna072.ceph.redhat.com\",\"magna082.ceph.redhat.com\"],\"monmap\":{\"epoch\":5,\"fsid\":\"5362be02-bf26-4b66-ac09-7496cadcd801\",\"modified\":\"2018-02-16 09:42:21.176234\",\"created\":\"2018-02-13 10:38:27.533432\",\"mons\":[{\"rank\":0,\"name\":\"magna069.ceph.redhat.com\",\"addr\":\"10.8.128.69:6789\\/0\"},{\"rank\":1,\"name\":\"magna072.ceph.redhat.com\",\"addr\":\"10.8.128.72:6789\\/0\"},{\"rank\":2,\"name\":\"magna082.ceph.redhat.com\",\"addr\":\"10.8.128.82:6789\\/0\"}]},\"osdmap\":{\"osdmap\":{\"epoch\":5261,\"num_osds\":8,\"num_up_osds\":8,\"num_in_osds\":8,\"full\":false,\"nearfull\":false,\"num_remapped_pgs\":0}},\"pgmap\":{\"pgs_by_state\":[{\"state_name\":\"active+clean\",\"count\":288}],\"version\":176347,\"num_pgs\":288,\"data_bytes\":5921356152,\"bytes_used\":18343018496,\"bytes_avail\":7947121070080,\"bytes_total\":7965464088576},\"fsmap\":{\"epoch\":5,\"id\":1,\"up\":1,\"in\":1,\"max\":1,\"by_rank\":[{\"filesystem_id\":1,\"rank\":0,\"name\":\"magna118\",\"status\":\"up:active\"}]}}"
    ]
}
2018-02-16 11:40:17,013 p=26854 u=ubuntu |   [ERROR]: User interrupted execution

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.25-1.el7cp.noarch
ansible-2.4.2.0-2.el7.noarch

How reproducible:
10/10

Steps to Reproduce:
1. Configure 2.4 cluster
2. update 2.5 ansible packges 
3. upgrade using ceph-ansible, followed the official doc.

Actual results:
Even though quorum is there ansible fails with unable to form quorum error.

Expected results:
NA

Additional info:
NA

Comment 25 Madhavi Kasturi 2018-02-19 14:17:41 UTC
Observed the similar failure during installation [kernel updated]

Changed the hostname from FQDN to short hostname in /etc/hostname, the installation completed successfully.

Comment 26 Guillaume Abrioux 2018-02-19 17:19:05 UTC
v3.0.26 should fix this issue

Comment 40 Guillaume Abrioux 2018-02-20 13:42:04 UTC
Hi Ramakrishnan,

the initial error reported in this BZ is fixed in ceph-ansible v3.0.26
the error reported in c28 is fixed in the container image ceph-2-rhel-7-docker-candidate-62031-20180220125431

Comment 43 Ramakrishnan Periyasamy 2018-02-20 14:01:13 UTC
(In reply to Guillaume Abrioux from comment #40)
> Hi Ramakrishnan,
> 
> the initial error reported in this BZ is fixed in ceph-ansible v3.0.26
> the error reported in c28 is fixed in the container image
> ceph-2-rhel-7-docker-candidate-62031-20180220125431

Thanks for the update Guillaume and thanks for your time to troubleshoot the issue and explaining me about the problem.

Comment 54 Ramakrishnan Periyasamy 2018-02-21 06:00:08 UTC
Hi Aron,

Provided doc text for release notes, clearing the needinfo tag. 

Regards,
Ramakrishnan


Note You need to log in before you can comment on or make changes to this bug.