Bug 1600943

Summary:	[cee/sd] upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set sortnibblewise
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Tomas Petr <tpetr>
Component:	Ceph-Ansible	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	subhash <vpoliset>
Severity:	medium	Docs Contact:	John Brier <jbrier>
Priority:	medium
Version:	3.0	CC:	agunn, aschoen, ceph-eng-bugs, ceph-qe-bugs, gmeno, hnallurv, jbrier, kdreyer, nthomas, sankarshan, seb, shan, tserlin, vpoliset
Target Milestone:	rc
Target Release:	3.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: ceph-ansible-3.1.0-0.1.rc21.el7cp Ubuntu: ceph-ansible_3.1.0~rc21-2redhat1	Doc Type:	Bug Fix
Doc Text:	.Upgrading {product} 2 to version 3 will set the `sortbitwise` option properly Previously, a rolling upgrade from {product} 2 to {product} 3 would fail because the OSDs would never initialize. This is because `sortbitwise` was not properly set by Ceph Ansible. With this release, Ceph Ansible sets `sortbitwise` properly, so the ODSs can start.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-26 18:22:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1584264

Description Tomas Petr 2018-07-13 12:54:26 UTC

Description of problem:
upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will not start if nibblewise is set.

running "ceph osd set sortbitwise" will fix this.

The ceph-ansible playbook could check this and fail in prerequisites check or set it itself

Version-Release number of selected component (if applicable):
RHCS 3
ceph-ansible-3.0.39

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Sébastien Han 2018-08-08 15:05:31 UTC

In https://github.com/ceph/ceph-ansible/releases/tag/v3.0.41

Comment 4 Sébastien Han 2018-08-08 15:06:09 UTC

since .40 sorry

Comment 5 Sébastien Han 2018-08-08 15:07:36 UTC

and v3.1.0rc14

Comment 10 subhash 2018-08-21 07:26:33 UTC

verified the issue with the following steps

1. Deployed ceph2.5 cluster(10.2.10-28 ) and unset sortbitwise flag
2. Upgraded ceph-ansible to 3.1 and the cluster through rolling_update.yml

rolling_updating.yml fail at the below task,moving the bz to assigned state
TASK [waiting for clean pgs...] ******************************************************************************************
task path: /usr/share/ceph-ansible/rolling_update.yml:411
Tuesday 21 August 2018  06:42:51 +0000 (0:00:00.655)       0:12:40.336 ******** 
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<magna021> ESTABLISH SSH CONNECTION FOR USER: None
<magna021> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/root/.ansible/cp/%h-%r-%p magna021 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-dbyoistppgehsreyyylnqrtbmvbmpcsb; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<magna021> (0, '\n{"changed": true, "end": "2018-08-21 06:42:51.835425", "stdout": "\\n{\\"fsid\\":\\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\\",\\"health\\":{\\"summary\\":[{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs degraded\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs stuck degraded\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs stuck unclean\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs stuck undersized\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs undersized\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"1 host (3 osds) down\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"3 osds down\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"noout,noscrub,nodeep-scrub flag(s) set\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"no legacy OSD present but \'sortbitwise\' flag is not set\\"}],\\"overall_status\\":\\"HEALTH_WARN\\",\\"detail\\":[]},\\"election_epoch\\":7,\\"quorum\\":[0],\\"quorum_names\\":[\\"magna021\\"],\\"monmap\\":{\\"epoch\\":2,\\"fsid\\":\\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\\",\\"modified\\":\\"2018-08-21 06:34:29.260101\\",\\"created\\":\\"2018-08-21 05:43:32.108976\\",\\"features\\":{\\"persistent\\":[\\"kraken\\",\\"luminous\\"],\\"optional\\":[]},\\"mons\\":[{\\"rank\\":0,\\"name\\":\\"magna021\\",\\"addr\\":\\"10.8.128.21:6789/0\\",\\"public_addr\\":\\"10.8.128.21:6789/0\\"}]},\\"osdmap\\":{\\"osdmap\\":{\\"epoch\\":44,\\"num_osds\\":9,\\"num_up_osds\\":6,\\"num_in_osds\\":9,\\"full\\":false,\\"nearfull\\":false,\\"num_remapped_pgs\\":0}},\\"pgmap\\":{\\"pgs_by_state\\":[{\\"state_name\\":\\"active+undersized+degraded\\",\\"count\\":64}],\\"num_pgs\\":64,\\"num_pools\\":1,\\"num_objects\\":0,\\"data_bytes\\":0,\\"bytes_used\\":1018404864,\\"bytes_avail\\":8948088016896,\\"bytes_total\\":8949106421760},\\"fsmap\\":{\\"epoch\\":1,\\"by_rank\\":[]},\\"mgrmap\\":{\\"epoch\\":60,\\"active_gid\\":14113,\\"active_name\\":\\"magna021\\",\\"active_addr\\":\\"10.8.128.21:6812/185260\\",\\"available\\":true,\\"standbys\\":[],\\"modules\\":[\\"status\\"],\\"available_modules\\":[\\"balancer\\",\\"dashboard\\",\\"influx\\",\\"localpool\\",\\"prometheus\\",\\"restful\\",\\"selftest\\",\\"status\\",\\"zabbix\\"],\\"services\\":{}},\\"servicemap\\":{\\"epoch\\":1,\\"modified\\":\\"0.000000\\",\\"services\\":{}}}", "cmd": ["ceph", "--cluster", "ceph", "-s", "--format", "json"], "rc": 0, "start": "2018-08-21 06:42:51.484577", "stderr": "", "delta": "0:00:00.350848", "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": " ceph --cluster ceph -s --format json", "removes": null, "creates": null, "chdir": null, "stdin": null}}}\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /root/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 8: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 167277\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\n')
FAILED - RETRYING: waiting for clean pgs... (40 retries left).Result was: {
    "attempts": 1, 
    "changed": true, 
    "cmd": [
        "ceph", 
        "--cluster", 
        "ceph", 
        "-s", 
        "--format", 
        "json"
    ], 
    "delta": "0:00:00.350848", 
    "end": "2018-08-21 06:42:51.835425", 
    "failed": false, 
    "invocation": {
        "module_args": {
            "_raw_params": " ceph --cluster ceph -s --format json", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "rc": 0, 
    "retries": 41, 
    "start": "2018-08-21 06:42:51.484577", 
    "stderr": "", 
    "stderr_lines": [], 
 "stdout": "\n{\"fsid\":\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\",\"health\":{\"summary\":[{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs degraded\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs stuck degraded\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs stuck unclean\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs stuck undersized\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs undersized\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"1 host (3 osds) down\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"3 osds down\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"noout,noscrub,nodeep-scrub flag(s) set\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"no legacy OSD present but 'sortbitwise' flag is not set\"}],\"overall_status\":\"HEALTH_WARN\",\"detail\":[]},\"election_epoch\":7,\"quorum\":[0],\"quorum_names\":[\"magna021\"],\"monmap\":{\"epoch\":2,\"fsid\":\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\",\"modified\":\"2018-08-21 06:34:29.260101\",\"created\":\"2018-08-21 05:43:32.108976\",\"features\":{\"persistent\":[\"kraken\",\"luminous\"],\"optional\":[]},\"mons\":[{\"rank\":0,\"name\":\"magna021\",\"addr\":\"10.8.128.21:6789/0\",\"public_addr\":\"10.8.128.21:6789/0\"}]},\"osdmap\":{\"osdmap\":{\"epoch\":44,\"num_osds\":9,\"num_up_osds\":6,\"num_in_osds\":9,\"full\":false,\"nearfull\":false,\"num_remapped_pgs\":0}},\"pgmap\":{\"pgs_by_state\":[{\"state_name\":\"active+undersized+degraded\",\"count\":64}],\"num_pgs\":64,\"num_pools\":1,\"num_objects\":0,\"data_bytes\":0,\"bytes_used\":1018404864,\"bytes_avail\":8948088016896,\"bytes_total\":8949106421760},\"fsmap\":{\"epoch\":1,\"by_rank\":[]},\"mgrmap\":{\"epoch\":60,\"active_gid\":14113,\"active_name\":\"magna021\",\"active_addr\":\"10.8.128.21:6812/185260\",\"available\":true,\"standbys\":[],\"modules\":[\"status\"],\"available_modules\":[\"balancer\",\"dashboard\",\"influx\",\"localpool\",\"prometheus\",\"restful\",\"selftest\",\"status\",\"zabbix\"],\"services\":{}},\"servicemap\":{\"epoch\":1,\"modified\":\"0.000000\",\"services\":{}}}",

Comment 12 Sébastien Han 2018-08-21 09:13:34 UTC

Can you tell why your OSDs did not get updated?
It seems they got the right version of the package but they report running ceph version 10.2.10-28.el7cp after restart.

Can I access the env and run one test?
I believe the OSD cannot start if 'sortbitwise' flag is not set thus it won't report its newer version.

I just need to trigger the command manually and see if the OSD start reporting their actually new version after this, because at the moment they report their old version.

Thanks.

Comment 14 Ken Dreyer (Red Hat) 2018-08-21 21:23:13 UTC

https://github.com/ceph/ceph-ansible/pull/3047 was backported to the stable-3.1 branch today, and we need an upstream tag for stable-3.1 with this change.

Comment 15 Sébastien Han 2018-08-21 21:31:32 UTC

https://github.com/ceph/ceph-ansible/releases/tag/v3.1.0rc20

Comment 19 subhash 2018-08-23 11:25:22 UTC

verified with following version :ceph-ansible-3.1.0-0.1.rc21.el7cp

steps:

1. Deployed ceph2.5 cluster(10.2.10-28 ) and unset sortbitwise flag
2. Upgraded ceph-ansible to 3.1 and the cluster through rolling_update.yml

Upgrade playbook ran fine and the PG's are is active+clean state.Moving to verified state

Comment 22 errata-xmlrpc 2018-09-26 18:22:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819