1275631 – After replacing the disk, ceph osd tree still showing the old disk entry

Bug 1275631 - After replacing the disk, ceph osd tree still showing the old disk entry

Summary: After replacing the disk, ceph osd tree still showing the old disk entry

Keywords:
Status:	CLOSED DUPLICATE of bug 1210539
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	1.3.1
Assignee:	Samuel Just
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-27 11:45 UTC by Tanay Ganguly
Modified:	2017-07-30 15:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-11-03 18:11:27 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Calamari_GUI (104.20 KB, image/png) 2015-10-28 05:23 UTC, Tanay Ganguly	no flags	Details
Calamari_Dashboard (138.02 KB, image/png) 2015-10-28 05:24 UTC, Tanay Ganguly	no flags	Details
Log (435.04 KB, text/plain) 2015-10-28 07:40 UTC, Tanay Ganguly	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1210539	0	unspecified	CLOSED	Replacing failed disks on CEPH nodes	2023-09-14 02:57:51 UTC

Internal Links: 1210539

Description Tanay Ganguly 2015-10-27 11:45:40 UTC

Description of problem:
After replacing the disk, ceph osd tree still showing the old disk entry.

Reference Document:
https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/blob/v1.3/replace-osds.adoc

Version-Release number of selected component (if applicable):
1.3.1
ceph-0.94.3-2.el7cp.x86_64

How reproducible:
NA

Steps to Reproduce:
1. Followed the document to replace the Failed drive with the new drive.
2. Output of ceph osd tree, after completing steps 1-6.

-1 29.42995 root default
-2  7.62997     host cephqe5
 0  1.09000         osd.0         up  1.00000          1.00000
 1  1.09000         osd.1         up  1.00000          1.00000
 2  1.09000         osd.2         up  1.00000          1.00000
 3  1.09000         osd.3         up  1.00000          1.00000
 4  1.09000         osd.4         up  1.00000          1.00000
 5  1.09000         osd.5         up  1.00000          1.00000
 6  1.09000         osd.6         up  1.00000          1.00000
-3  8.71999     host cephqe6
 8  1.09000         osd.8         up  1.00000          1.00000
 9  1.09000         osd.9         up  1.00000          1.00000
10  1.09000         osd.10        up  1.00000          1.00000
11  1.09000         osd.11        up  1.00000          1.00000
12  1.09000         osd.12        up  1.00000          1.00000
13  1.09000         osd.13        up  1.00000          1.00000
14  1.09000         osd.14        up  1.00000          1.00000
15  1.09000         osd.15        up  1.00000          1.00000
-5  6.53999     host cephqe8
24  1.09000         osd.24        up  1.00000          1.00000
25  1.09000         osd.25        up  1.00000          1.00000
26  1.09000         osd.26        up  1.00000          1.00000
27  1.09000         osd.27        up  1.00000          1.00000
28  1.09000         osd.28        up  1.00000          1.00000
29  1.09000         osd.29        up  1.00000          1.00000
-4  6.53999     host NEW
16  1.09000         osd.16        up  1.00000          1.00000
17  1.09000         osd.17        up  1.00000          1.00000
18  1.09000         osd.18        up  1.00000          1.00000
19  1.09000         osd.19        up  1.00000          1.00000
20  1.09000         osd.20        up  1.00000          1.00000
21  1.09000         osd.21        up  1.00000          1.00000

3. Then when i execute step 8, i.e. Recreate the OSD
   ceph osd create
   7 ( This gets proper osd number, as osd.7 was removed as a part of Failed Disk)
4. Then i added the new osd following the document, the osd gets added properly, but the newly added osd gets a new osd number ( i.e osd.22 ) and the older osd ( i.e. osd.7 shows as down state)

Output of ceph osd tree:
ID WEIGHT   TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 30.51994 root default                                       
-2  8.71997     host cephqe5                                   
 0  1.09000         osd.0         up  1.00000          1.00000 
 1  1.09000         osd.1         up  1.00000          1.00000 
 2  1.09000         osd.2         up  1.00000          1.00000 
 3  1.09000         osd.3         up  1.00000          1.00000 
 4  1.09000         osd.4         up  1.00000          1.00000 
 5  1.09000         osd.5         up  1.00000          1.00000 
 6  1.09000         osd.6         up  1.00000          1.00000 
22  1.09000         osd.22        up  1.00000          1.00000 
-3  8.71999     host cephqe6                                   
 8  1.09000         osd.8         up  1.00000          1.00000 
 9  1.09000         osd.9         up  1.00000          1.00000 
10  1.09000         osd.10        up  1.00000          1.00000 
11  1.09000         osd.11        up  1.00000          1.00000 
12  1.09000         osd.12        up  1.00000          1.00000 
13  1.09000         osd.13        up  1.00000          1.00000 
14  1.09000         osd.14        up  1.00000          1.00000 
15  1.09000         osd.15        up  1.00000          1.00000 
-5  6.53999     host cephqe8                                   
24  1.09000         osd.24        up  1.00000          1.00000 
25  1.09000         osd.25        up  1.00000          1.00000 
26  1.09000         osd.26        up  1.00000          1.00000 
27  1.09000         osd.27        up  1.00000          1.00000 
28  1.09000         osd.28        up  1.00000          1.00000 
29  1.09000         osd.29        up  1.00000          1.00000 
-4  6.53999     host NEW                                       
16  1.09000         osd.16        up  1.00000          1.00000 
17  1.09000         osd.17        up  1.00000          1.00000 
18  1.09000         osd.18        up  1.00000          1.00000 
19  1.09000         osd.19        up  1.00000          1.00000 
20  1.09000         osd.20        up  1.00000          1.00000 
21  1.09000         osd.21        up  1.00000          1.00000 
 7        0 osd.7               down        0          1.00000 

Actual results: osd.7 is showing as down and the newly added osd is getting assigned as a new osd (osd.22) instead of osd.7

Expected results: The newly added osd should have been osd.7

Additional info:

Comment 2 Ken Dreyer (Red Hat) 2015-10-27 22:52:08 UTC

Sam, mind looking into this one (or re-assigning as appropriate?) Is this a bug the  docs (replace-osds.adoc), or something else?

Comment 3 Tanay Ganguly 2015-10-28 05:23:03 UTC

Few more information:

ceph -s 
output shows 29 OSD's but 28 up and in

snippet:
osdmap e695: 29 osds: 28 up, 28 in


Calamari GUI also shows wrong information:
 OSD
28/29
In & Up
1 down


PFA, the screenshot ( Dashboard and OSD Workbench)

Comment 4 Tanay Ganguly 2015-10-28 05:23:44 UTC

Created attachment 1087142 [details]
Calamari_GUI

Comment 5 Tanay Ganguly 2015-10-28 05:24:11 UTC

Created attachment 1087143 [details]
Calamari_Dashboard

Comment 6 Tanay Ganguly 2015-10-28 07:39:27 UTC

After i restart the newly added osd, i am seeing I/O error, as it still pointing to ceph-7 directory.

But the old.22 is getting started, PFA the log of osd.22

[root@cephqe5 ~]# /etc/init.d/ceph stop osd.22
find: ‘/var/lib/ceph/osd/ceph-7’: Input/output error
=== osd.22 ===
Stopping Ceph osd.22 on cephqe5...kill 66039...kill 66039...done
[root@cephqe5 ~]# /etc/init.d/ceph start osd.22
find: ‘/var/lib/ceph/osd/ceph-7’: Input/output error
=== osd.22 ===
create-or-move updated item name 'osd.22' weight 1.09 at location {host=cephqe5,root=default} to crush map
Starting Ceph osd.22 on cephqe5...
Running as unit run-71865.service.

Comment 7 Tanay Ganguly 2015-10-28 07:40:16 UTC

Created attachment 1087157 [details]
Log

Comment 8 David Zafman 2015-11-03 16:27:43 UTC

I don't now why osd.7 wasn't found as the next open slot.

I have not been able to reproduce this on v0.94.3.  I even did rm/create multiple times in a tight loop to check for a race condition.

If the customer created a new osd before removing the old one, that would explain this.  Or if the customer created 2 new osds (osd.7/osd.22) and removed osd.7 a second time.

Comment 9 David Zafman 2015-11-03 17:59:13 UTC

Previous comment not important.

Comment 10 David Zafman 2015-11-03 18:11:27 UTC

Based on bug description, instructions are wrong because ceph-deploy must also be doing a "ceph osd create"

New instructions are pending and I've noted a concern that the new instructions might lead to the same problem and how to fix it.

I've added comment to https://bugzilla.redhat.com/show_bug.cgi?id=1210539 and marking duplicate.

*** This bug has been marked as a duplicate of bug 1210539 ***

Note You need to log in before you can comment on or make changes to this bug.