1600943 – [cee/sd] upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set sortnibblewise

Bug 1600943 - [cee/sd] upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set sortnibblewise

Summary: [cee/sd] upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set sortnibb...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	3.1
Assignee:	Sébastien Han
QA Contact:	subhash
Docs Contact:	John Brier
URL:
Whiteboard:
Depends On:
Blocks:	1584264
TreeView+	depends on / blocked

Reported:	2018-07-13 12:54 UTC by Tomas Petr
Modified:	2021-09-09 15:02 UTC (History)
CC List:	14 users (show)
Fixed In Version:	RHEL: ceph-ansible-3.1.0-0.1.rc21.el7cp Ubuntu: ceph-ansible_3.1.0~rc21-2redhat1
Doc Type:	Bug Fix
Doc Text:	.Upgrading {product} 2 to version 3 will set the `sortbitwise` option properly Previously, a rolling upgrade from {product} 2 to {product} 3 would fail because the OSDs would never initialize. This is because `sortbitwise` was not properly set by Ceph Ansible. With this release, Ceph Ansible sets `sortbitwise` properly, so the ODSs can start.
Clone Of:
Environment:
Last Closed:	2018-09-26 18:22:32 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 2914	None	None	None	2018-07-23 12:58:21 UTC
Github	ceph ceph-ansible pull 3047	None	None	None	2018-08-21 09:18:42 UTC
Red Hat Issue Tracker	RHCEPH-1636	None	None	None	2021-09-09 15:02:54 UTC
Red Hat Product Errata	RHBA-2018:2819	None	None	None	2018-09-26 18:23:15 UTC

Description Tomas Petr 2018-07-13 12:54:26 UTC

Description of problem:
upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will not start if nibblewise is set.

running "ceph osd set sortbitwise" will fix this.

The ceph-ansible playbook could check this and fail in prerequisites check or set it itself

Version-Release number of selected component (if applicable):
RHCS 3
ceph-ansible-3.0.39

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Sébastien Han 2018-08-08 15:05:31 UTC

In https://github.com/ceph/ceph-ansible/releases/tag/v3.0.41

Comment 4 Sébastien Han 2018-08-08 15:06:09 UTC

since .40 sorry

Comment 5 Sébastien Han 2018-08-08 15:07:36 UTC

and v3.1.0rc14

Comment 10 subhash 2018-08-21 07:26:33 UTC

verified the issue with the following steps

1. Deployed ceph2.5 cluster(10.2.10-28 ) and unset sortbitwise flag
2. Upgraded ceph-ansible to 3.1 and the cluster through rolling_update.yml

rolling_updating.yml fail at the below task,moving the bz to assigned state
TASK [waiting for clean pgs...] ******************************************************************************************
task path: /usr/share/ceph-ansible/rolling_update.yml:411
Tuesday 21 August 2018  06:42:51 +0000 (0:00:00.655)       0:12:40.336 ******** 
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<magna021> ESTABLISH SSH CONNECTION FOR USER: None
<magna021> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/root/.ansible/cp/%h-%r-%p magna021 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-dbyoistppgehsreyyylnqrtbmvbmpcsb; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<magna021> (0, '\n{"changed": true, "end": "2018-08-21 06:42:51.835425", "stdout": "\\n{\\"fsid\\":\\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\\",\\"health\\":{\\"summary\\":[{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs degraded\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs stuck degraded\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs stuck unclean\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs stuck undersized\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"64 pgs undersized\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"1 host (3 osds) down\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"3 osds down\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"noout,noscrub,nodeep-scrub flag(s) set\\"},{\\"severity\\":\\"HEALTH_WARN\\",\\"summary\\":\\"no legacy OSD present but \'sortbitwise\' flag is not set\\"}],\\"overall_status\\":\\"HEALTH_WARN\\",\\"detail\\":[]},\\"election_epoch\\":7,\\"quorum\\":[0],\\"quorum_names\\":[\\"magna021\\"],\\"monmap\\":{\\"epoch\\":2,\\"fsid\\":\\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\\",\\"modified\\":\\"2018-08-21 06:34:29.260101\\",\\"created\\":\\"2018-08-21 05:43:32.108976\\",\\"features\\":{\\"persistent\\":[\\"kraken\\",\\"luminous\\"],\\"optional\\":[]},\\"mons\\":[{\\"rank\\":0,\\"name\\":\\"magna021\\",\\"addr\\":\\"10.8.128.21:6789/0\\",\\"public_addr\\":\\"10.8.128.21:6789/0\\"}]},\\"osdmap\\":{\\"osdmap\\":{\\"epoch\\":44,\\"num_osds\\":9,\\"num_up_osds\\":6,\\"num_in_osds\\":9,\\"full\\":false,\\"nearfull\\":false,\\"num_remapped_pgs\\":0}},\\"pgmap\\":{\\"pgs_by_state\\":[{\\"state_name\\":\\"active+undersized+degraded\\",\\"count\\":64}],\\"num_pgs\\":64,\\"num_pools\\":1,\\"num_objects\\":0,\\"data_bytes\\":0,\\"bytes_used\\":1018404864,\\"bytes_avail\\":8948088016896,\\"bytes_total\\":8949106421760},\\"fsmap\\":{\\"epoch\\":1,\\"by_rank\\":[]},\\"mgrmap\\":{\\"epoch\\":60,\\"active_gid\\":14113,\\"active_name\\":\\"magna021\\",\\"active_addr\\":\\"10.8.128.21:6812/185260\\",\\"available\\":true,\\"standbys\\":[],\\"modules\\":[\\"status\\"],\\"available_modules\\":[\\"balancer\\",\\"dashboard\\",\\"influx\\",\\"localpool\\",\\"prometheus\\",\\"restful\\",\\"selftest\\",\\"status\\",\\"zabbix\\"],\\"services\\":{}},\\"servicemap\\":{\\"epoch\\":1,\\"modified\\":\\"0.000000\\",\\"services\\":{}}}", "cmd": ["ceph", "--cluster", "ceph", "-s", "--format", "json"], "rc": 0, "start": "2018-08-21 06:42:51.484577", "stderr": "", "delta": "0:00:00.350848", "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": " ceph --cluster ceph -s --format json", "removes": null, "creates": null, "chdir": null, "stdin": null}}}\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /root/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 8: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 167277\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\n')
FAILED - RETRYING: waiting for clean pgs... (40 retries left).Result was: {
    "attempts": 1, 
    "changed": true, 
    "cmd": [
        "ceph", 
        "--cluster", 
        "ceph", 
        "-s", 
        "--format", 
        "json"
    ], 
    "delta": "0:00:00.350848", 
    "end": "2018-08-21 06:42:51.835425", 
    "failed": false, 
    "invocation": {
        "module_args": {
            "_raw_params": " ceph --cluster ceph -s --format json", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "rc": 0, 
    "retries": 41, 
    "start": "2018-08-21 06:42:51.484577", 
    "stderr": "", 
    "stderr_lines": [], 
 "stdout": "\n{\"fsid\":\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\",\"health\":{\"summary\":[{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs degraded\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs stuck degraded\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs stuck unclean\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs stuck undersized\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"64 pgs undersized\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"1 host (3 osds) down\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"3 osds down\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"noout,noscrub,nodeep-scrub flag(s) set\"},{\"severity\":\"HEALTH_WARN\",\"summary\":\"no legacy OSD present but 'sortbitwise' flag is not set\"}],\"overall_status\":\"HEALTH_WARN\",\"detail\":[]},\"election_epoch\":7,\"quorum\":[0],\"quorum_names\":[\"magna021\"],\"monmap\":{\"epoch\":2,\"fsid\":\"eb52f84f-c6eb-4820-8a0a-4f9cc84e20ae\",\"modified\":\"2018-08-21 06:34:29.260101\",\"created\":\"2018-08-21 05:43:32.108976\",\"features\":{\"persistent\":[\"kraken\",\"luminous\"],\"optional\":[]},\"mons\":[{\"rank\":0,\"name\":\"magna021\",\"addr\":\"10.8.128.21:6789/0\",\"public_addr\":\"10.8.128.21:6789/0\"}]},\"osdmap\":{\"osdmap\":{\"epoch\":44,\"num_osds\":9,\"num_up_osds\":6,\"num_in_osds\":9,\"full\":false,\"nearfull\":false,\"num_remapped_pgs\":0}},\"pgmap\":{\"pgs_by_state\":[{\"state_name\":\"active+undersized+degraded\",\"count\":64}],\"num_pgs\":64,\"num_pools\":1,\"num_objects\":0,\"data_bytes\":0,\"bytes_used\":1018404864,\"bytes_avail\":8948088016896,\"bytes_total\":8949106421760},\"fsmap\":{\"epoch\":1,\"by_rank\":[]},\"mgrmap\":{\"epoch\":60,\"active_gid\":14113,\"active_name\":\"magna021\",\"active_addr\":\"10.8.128.21:6812/185260\",\"available\":true,\"standbys\":[],\"modules\":[\"status\"],\"available_modules\":[\"balancer\",\"dashboard\",\"influx\",\"localpool\",\"prometheus\",\"restful\",\"selftest\",\"status\",\"zabbix\"],\"services\":{}},\"servicemap\":{\"epoch\":1,\"modified\":\"0.000000\",\"services\":{}}}",

Comment 12 Sébastien Han 2018-08-21 09:13:34 UTC

Can you tell why your OSDs did not get updated?
It seems they got the right version of the package but they report running ceph version 10.2.10-28.el7cp after restart.

Can I access the env and run one test?
I believe the OSD cannot start if 'sortbitwise' flag is not set thus it won't report its newer version.

I just need to trigger the command manually and see if the OSD start reporting their actually new version after this, because at the moment they report their old version.

Thanks.

Comment 14 Ken Dreyer (Red Hat) 2018-08-21 21:23:13 UTC

https://github.com/ceph/ceph-ansible/pull/3047 was backported to the stable-3.1 branch today, and we need an upstream tag for stable-3.1 with this change.

Comment 15 Sébastien Han 2018-08-21 21:31:32 UTC

https://github.com/ceph/ceph-ansible/releases/tag/v3.1.0rc20

Comment 19 subhash 2018-08-23 11:25:22 UTC

verified with following version :ceph-ansible-3.1.0-0.1.rc21.el7cp

steps:

1. Deployed ceph2.5 cluster(10.2.10-28 ) and unset sortbitwise flag
2. Upgraded ceph-ansible to 3.1 and the cluster through rolling_update.yml

Upgrade playbook ran fine and the PG's are is active+clean state.Moving to verified state

Comment 22 errata-xmlrpc 2018-09-26 18:22:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819

Note You need to log in before you can comment on or make changes to this bug.