Bug 1600943

Summary: [cee/sd] upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set sortnibblewise
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tomas Petr <tpetr>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: subhash <vpoliset>
Severity: medium Docs Contact: John Brier <jbrier>
Priority: medium    
Version: 3.0CC: agunn, aschoen, ceph-eng-bugs, ceph-qe-bugs, gmeno, hnallurv, jbrier, kdreyer, nthomas, sankarshan, seb, shan, tserlin, vpoliset
Target Milestone: rc   
Target Release: 3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.1.0-0.1.rc21.el7cp Ubuntu: ceph-ansible_3.1.0~rc21-2redhat1 Doc Type: Bug Fix
Doc Text:
.Upgrading {product} 2 to version 3 will set the `sortbitwise` option properly Previously, a rolling upgrade from {product} 2 to {product} 3 would fail because the OSDs would never initialize. This is because `sortbitwise` was not properly set by Ceph Ansible. With this release, Ceph Ansible sets `sortbitwise` properly, so the ODSs can start.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-26 18:22:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1584264    

Description Tomas Petr 2018-07-13 12:54:26 UTC
Description of problem:
upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will not start if nibblewise is set.

running "ceph osd set sortbitwise" will fix this.

The ceph-ansible playbook could check this and fail in prerequisites check or set it itself

Version-Release number of selected component (if applicable):
RHCS 3
ceph-ansible-3.0.39

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Sébastien Han 2018-08-08 15:05:31 UTC
In https://github.com/ceph/ceph-ansible/releases/tag/v3.0.41

Comment 4 Sébastien Han 2018-08-08 15:06:09 UTC
since .40 sorry

Comment 5 Sébastien Han 2018-08-08 15:07:36 UTC
and v3.1.0rc14

Comment 10 subhash 2018-08-21 07:26:33 UTC
verified the issue with the following steps

1. Deployed ceph2.5 cluster(10.2.10-28 ) and unset sortbitwise flag
2. Upgraded ceph-ansible to 3.1 and the cluster through rolling_update.yml

rolling_updating.yml fail at the below task,moving the bz to assigned state
TASK [waiting for clean pgs...] ******************************************************************************************
task path: /usr/share/ceph-ansible/rolling_update.yml:411
Tuesday 21 August 2018  06:42:51 +0000 (0:00:00.655)       0:12:40.336 ******** 
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<magna021> ESTABLISH SSH CONNECTION FOR USER: None
<magna021> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/root/.ansible/cp/%h-%r-%p magna021 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-dbyoistppgehsreyyylnqrtbmvbmpcsb; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<magna021> (0, '\n{"changed": true, "end": "2018-08-21 06:42:51.835425", "stdout": "\\n{\\"fsid\\":\\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\\",\\"health\\":{\\"summary\\":[{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs degraded\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs stuck degraded\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs stuck unclean\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs stuck undersized\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs undersized\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"1 host (3 osds) down\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"3 osds down\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"noout,noscrub,nodeep-scrub flag(s) set\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"no legacy OSD present but \'sortbitwise\' flag is not set\\"}],\\"overall_status\\":\\"HEALTH_WARN\\",\\"detail\\":[]},\\"election_epoch\\":7,\\"quorum\\":[0],\\"quorum_names\\":[\\"magna021\\"],\\"monmap\\":{\\"epoch\\":2,\\"fsid\\":\\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\\",\\"modified\\":\\"2018-08-21 06:34:29.260101\\",\\"created\\":\\"2018-08-21 05:43:32.108976\\",\\"features\\":{\\"persistent\\":[\\"kraken\\",\\"luminous\\"],\\"optional\\":[]},\\"mons\\":[{\\"rank\\":0,\\"name\\":\\"magna021\\",\\"addr\\":\\"10.8.128.21:6789/0\\",\\"public_addr\\":\\"10.8.128.21:6789/0\\"}]},\\"osdmap\\":{\\"osdmap\\":{\\"epoch\\":44,\\"num_osds\\":9,\\"num_up_osds\\":6,\\"num_in_osds\\":9,\\"full\\":false,\\"nearfull\\":false,\\"num_remapped_pgs\\":0}},\\"pgmap\\":{\\"pgs_by_state\\":[{\\"state_name\\":\\"active+undersized+degraded\\",\\"count\\":64}],\\"num_pgs\\":64,\\"num_pools\\":1,\\"num_objects\\":0,\\"data_bytes\\":0,\\"bytes_used\\":1018404864,\\"bytes_avail\\":8948088016896,\\"bytes_total\\":8949106421760},\\"fsmap\\":{\\"epoch\\":1,\\"by_rank\\":[]},\\"mgrmap\\":{\\"epoch\\":60,\\"active_gid\\":14113,\\"active_name\\":\\"magna021\\",\\"active_addr\\":\\"10.8.128.21:6812/185260\\",\\"available\\":true,\\"standbys\\":[],\\"modules\\":[\\"status\\"],\\"available_modules\\":[\\"balancer\\",\\"dashboard\\",\\"influx\\",\\"localpool\\",\\"prometheus\\",\\"restful\\",\\"selftest\\",\\"status\\",\\"zabbix\\"],\\"services\\":{}},\\"servicemap\\":{\\"epoch\\":1,\\"modified\\":\\"0.000000\\",\\"services\\":{}}}", "cmd": ["ceph", "--cluster", "ceph", "-s", "--format", "json"], "rc": 0, "start": "2018-08-21 06:42:51.484577", "stderr": "", "delta": "0:00:00.350848", "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": " ceph --cluster ceph -s --format json", "removes": null, "creates": null, "chdir": null, "stdin": null}}}\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /root/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 8: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 167277\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\n')
FAILED - RETRYING: waiting for clean pgs... (40 retries left).Result was: {
    "attempts": 1, 
    "changed": true, 
    "cmd": [
        "ceph", 
        "--cluster", 
        "ceph", 
        "-s", 
        "--format", 
        "json"
    ], 
    "delta": "0:00:00.350848", 
    "end": "2018-08-21 06:42:51.835425", 
    "failed": false, 
    "invocation": {
        "module_args": {
            "_raw_params": " ceph --cluster ceph -s --format json", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "rc": 0, 
    "retries": 41, 
    "start": "2018-08-21 06:42:51.484577", 
    "stderr": "", 
    "stderr_lines": [], 
 "stdout": "\n{\"fsid\":\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\",\"health\":{\"summary\":[{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs degraded\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs stuck degraded\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs stuck unclean\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs stuck undersized\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs undersized\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"1 host (3 osds) down\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"3 osds down\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"noout,noscrub,nodeep-scrub flag(s) set\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"no legacy OSD present but 'sortbitwise' flag is not set\"}],\"overall_status\":\"HEALTH_WARN\",\"detail\":[]},\"election_epoch\":7,\"quorum\":[0],\"quorum_names\":[\"magna021\"],\"monmap\":{\"epoch\":2,\"fsid\":\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\",\"modified\":\"2018-08-21 06:34:29.260101\",\"created\":\"2018-08-21 05:43:32.108976\",\"features\":{\"persistent\":[\"kraken\",\"luminous\"],\"optional\":[]},\"mons\":[{\"rank\":0,\"name\":\"magna021\",\"addr\":\"10.8.128.21:6789/0\",\"public_addr\":\"10.8.128.21:6789/0\"}]},\"osdmap\":{\"osdmap\":{\"epoch\":44,\"num_osds\":9,\"num_up_osds\":6,\"num_in_osds\":9,\"full\":false,\"nearfull\":false,\"num_remapped_pgs\":0}},\"pgmap\":{\"pgs_by_state\":[{\"state_name\":\"active+undersized+degraded\",\"count\":64}],\"num_pgs\":64,\"num_pools\":1,\"num_objects\":0,\"data_bytes\":0,\"bytes_used\":1018404864,\"bytes_avail\":8948088016896,\"bytes_total\":8949106421760},\"fsmap\":{\"epoch\":1,\"by_rank\":[]},\"mgrmap\":{\"epoch\":60,\"active_gid\":14113,\"active_name\":\"magna021\",\"active_addr\":\"10.8.128.21:6812/185260\",\"available\":true,\"standbys\":[],\"modules\":[\"status\"],\"available_modules\":[\"balancer\",\"dashboard\",\"influx\",\"localpool\",\"prometheus\",\"restful\",\"selftest\",\"status\",\"zabbix\"],\"services\":{}},\"servicemap\":{\"epoch\":1,\"modified\":\"0.000000\",\"services\":{}}}",

Comment 12 Sébastien Han 2018-08-21 09:13:34 UTC
Can you tell why your OSDs did not get updated?
It seems they got the right version of the package but they report running ceph version 10.2.10-28.el7cp after restart.

Can I access the env and run one test?
I believe the OSD cannot start if 'sortbitwise' flag is not set thus it won't report its newer version.

I just need to trigger the command manually and see if the OSD start reporting their actually new version after this, because at the moment they report their old version.

Thanks.

Comment 14 Ken Dreyer (Red Hat) 2018-08-21 21:23:13 UTC
https://github.com/ceph/ceph-ansible/pull/3047 was backported to the stable-3.1 branch today, and we need an upstream tag for stable-3.1 with this change.

Comment 19 subhash 2018-08-23 11:25:22 UTC
verified with following version :ceph-ansible-3.1.0-0.1.rc21.el7cp

steps:

1. Deployed ceph2.5 cluster(10.2.10-28 ) and unset sortbitwise flag
2. Upgraded ceph-ansible to 3.1 and the cluster through rolling_update.yml

Upgrade playbook ran fine and the PG's are is active+clean state.Moving to verified state

Comment 22 errata-xmlrpc 2018-09-26 18:22:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819