Bug 1832993
| Summary: | RHCS 4.0 : OSDs flapping issue as workload is applied on the cluster | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | karan singh <karan> | ||||||
| Component: | RADOS | Assignee: | Neha Ojha <nojha> | ||||||
| Status: | CLOSED NEXTRELEASE | QA Contact: | Manohar Murthy <mmurthy> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.0 | CC: | akupczyk, bhubbard, ceph-eng-bugs, dzafman, kchai, nojha, rzarzyns, sseshasa, vumrao | ||||||
| Target Milestone: | z1 | ||||||||
| Target Release: | 4.1 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-05-10 10:00:43 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
karan singh
2020-05-07 15:20:46 UTC
Created attachment 1686216 [details]
Logs from one of the OSD
Attached are the logs from one of the OSD
Here are some Ceph commands outputs
[root@rgw-4 /]# ceph -s
cluster:
id: ebe0aa4b-4fb5-4c68-84ab-cbf1118937a2
health: HEALTH_WARN
1 osds down
Reduced data availability: 3 pgs inactive, 3 pgs incomplete
Degraded data redundancy: 448392/146731677 objects degraded (0.306%), 345 pgs degraded
24 slow ops, oldest one blocked for 4014 sec, daemons [mon,rgw-1,mon,rgw-2,mon,rgw-3] have slow ops.
services:
mon: 3 daemons, quorum rgw-1,rgw-2,rgw-3 (age 47m)
mgr: rgw-5(active, since 38m), standbys: rgw-4, rgw-6
osd: 318 osds: 317 up (since 3s), 318 in (since 47h); 21 remapped pgs
rgw: 12 daemons active (rgw-1.rgw0, rgw-1.rgw1, rgw-2.rgw0, rgw-2.rgw1, rgw-3.rgw0, rgw-3.rgw1, rgw-4.rgw0, rgw-4.rgw1, rgw-5.rgw0, rgw-5.rgw1, rgw-6.rgw0, rgw-6.rgw1)
task status:
data:
pools: 7 pools, 21760 pgs
objects: 24.46M objects, 1.4 TiB
usage: 260 TiB used, 4.5 PiB / 4.8 PiB avail
pgs: 0.142% pgs not active
448392/146731677 objects degraded (0.306%)
21395 active+clean
269 active+undersized+degraded+remapped+backfill_wait
41 active+recovery_wait+undersized+degraded+remapped
28 activating+undersized+degraded+remapped
14 active+undersized
4 active+undersized+degraded
3 remapped+incomplete
3 active+undersized+remapped+backfill_wait
2 active+undersized+degraded+remapped+backfilling
1 active+recovering+undersized+degraded+remapped
io:
client: 1.7 KiB/s rd, 1 op/s rd, 0 op/s wr
recovery: 9.8 KiB/s, 0 objects/s
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]# ceph df
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 4.8 PiB 4.5 PiB 260 TiB 260 TiB 5.34
TOTAL 4.8 PiB 4.5 PiB 260 TiB 260 TiB 5.34
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
.rgw.root 10 1.2 KiB 4 768 KiB 0 1.4 PiB
default.rgw.control 11 0 B 8 0 B 0 1.5 PiB
default.rgw.meta 12 302 KiB 1.25k 235 MiB 0 1.4 PiB
default.rgw.log 13 0 B 208 0 B 0 1.4 PiB
default.rgw.buckets.index 14 0 B 626 0 B 0 1.4 PiB
default.rgw.data.root 15 0 B 0 0 B 0 1.4 PiB
default.rgw.buckets.data 16 1.5 TiB 24.45M 8.7 TiB 0.20 2.9 PiB
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 4878.71826 root default
-7 813.11523 host rgw-1
0 hdd 15.34180 osd.0 up 1.00000 1.00000
1 hdd 15.34180 osd.1 up 1.00000 1.00000
2 hdd 15.34180 osd.2 up 1.00000 1.00000
3 hdd 15.34180 osd.3 up 1.00000 1.00000
4 hdd 15.34180 osd.4 up 1.00000 1.00000
5 hdd 15.34180 osd.5 up 1.00000 1.00000
6 hdd 15.34180 osd.6 up 1.00000 1.00000
7 hdd 15.34180 osd.7 up 1.00000 1.00000
8 hdd 15.34180 osd.8 up 1.00000 1.00000
9 hdd 15.34180 osd.9 up 1.00000 1.00000
10 hdd 15.34180 osd.10 up 1.00000 1.00000
11 hdd 15.34180 osd.11 up 1.00000 1.00000
14 hdd 15.34180 osd.14 up 1.00000 1.00000
15 hdd 15.34180 osd.15 up 1.00000 1.00000
18 hdd 15.34180 osd.18 up 1.00000 1.00000
19 hdd 15.34180 osd.19 up 1.00000 1.00000
20 hdd 15.34180 osd.20 up 1.00000 1.00000
23 hdd 15.34180 osd.23 up 1.00000 1.00000
24 hdd 15.34180 osd.24 up 1.00000 1.00000
26 hdd 15.34180 osd.26 up 1.00000 1.00000
28 hdd 15.34180 osd.28 up 1.00000 1.00000
29 hdd 15.34180 osd.29 up 1.00000 1.00000
32 hdd 15.34180 osd.32 up 1.00000 1.00000
33 hdd 15.34180 osd.33 up 1.00000 1.00000
35 hdd 15.34180 osd.35 up 1.00000 1.00000
37 hdd 15.34180 osd.37 up 1.00000 1.00000
38 hdd 15.34180 osd.38 up 1.00000 1.00000
41 hdd 15.34180 osd.41 up 1.00000 1.00000
42 hdd 15.34180 osd.42 up 1.00000 1.00000
45 hdd 15.34180 osd.45 up 1.00000 1.00000
46 hdd 15.34180 osd.46 up 1.00000 1.00000
47 hdd 15.34180 osd.47 up 1.00000 1.00000
50 hdd 15.34180 osd.50 up 1.00000 1.00000
51 hdd 15.34180 osd.51 up 1.00000 1.00000
54 hdd 15.34180 osd.54 up 1.00000 1.00000
55 hdd 15.34180 osd.55 up 1.00000 1.00000
58 hdd 15.34180 osd.58 up 1.00000 1.00000
59 hdd 15.34180 osd.59 up 1.00000 1.00000
61 hdd 15.34180 osd.61 up 1.00000 1.00000
63 hdd 15.34180 osd.63 up 1.00000 1.00000
65 hdd 15.34180 osd.65 up 1.00000 1.00000
67 hdd 15.34180 osd.67 up 1.00000 1.00000
68 hdd 15.34180 osd.68 up 1.00000 1.00000
71 hdd 15.34180 osd.71 up 1.00000 1.00000
72 hdd 15.34180 osd.72 up 1.00000 1.00000
75 hdd 15.34180 osd.75 up 1.00000 1.00000
76 hdd 15.34180 osd.76 up 1.00000 1.00000
79 hdd 15.34180 osd.79 up 1.00000 1.00000
80 hdd 15.34180 osd.80 up 1.00000 1.00000
83 hdd 15.34180 osd.83 up 1.00000 1.00000
84 hdd 15.34180 osd.84 up 1.00000 1.00000
87 hdd 15.34180 osd.87 up 1.00000 1.00000
88 hdd 15.34180 osd.88 up 1.00000 1.00000
-5 813.11523 host rgw-2
13 hdd 15.34180 osd.13 up 1.00000 1.00000
17 hdd 15.34180 osd.17 up 1.00000 1.00000
22 hdd 15.34180 osd.22 up 1.00000 1.00000
27 hdd 15.34180 osd.27 up 1.00000 1.00000
31 hdd 15.34180 osd.31 up 1.00000 1.00000
36 hdd 15.34180 osd.36 up 1.00000 1.00000
40 hdd 15.34180 osd.40 up 1.00000 1.00000
44 hdd 15.34180 osd.44 up 1.00000 1.00000
49 hdd 15.34180 osd.49 up 1.00000 1.00000
53 hdd 15.34180 osd.53 up 1.00000 1.00000
57 hdd 15.34180 osd.57 up 1.00000 1.00000
62 hdd 15.34180 osd.62 up 1.00000 1.00000
66 hdd 15.34180 osd.66 up 1.00000 1.00000
70 hdd 15.34180 osd.70 up 1.00000 1.00000
74 hdd 15.34180 osd.74 up 1.00000 1.00000
78 hdd 15.34180 osd.78 up 1.00000 1.00000
82 hdd 15.34180 osd.82 up 1.00000 1.00000
86 hdd 15.34180 osd.86 up 1.00000 1.00000
90 hdd 15.34180 osd.90 up 1.00000 1.00000
92 hdd 15.34180 osd.92 up 1.00000 1.00000
94 hdd 15.34180 osd.94 up 1.00000 1.00000
96 hdd 15.34180 osd.96 up 1.00000 1.00000
98 hdd 15.34180 osd.98 up 1.00000 1.00000
100 hdd 15.34180 osd.100 up 1.00000 1.00000
102 hdd 15.34180 osd.102 up 1.00000 1.00000
104 hdd 15.34180 osd.104 up 1.00000 1.00000
106 hdd 15.34180 osd.106 up 1.00000 1.00000
108 hdd 15.34180 osd.108 up 1.00000 1.00000
110 hdd 15.34180 osd.110 up 1.00000 1.00000
112 hdd 15.34180 osd.112 up 1.00000 1.00000
114 hdd 15.34180 osd.114 up 1.00000 1.00000
116 hdd 15.34180 osd.116 up 1.00000 1.00000
118 hdd 15.34180 osd.118 up 1.00000 1.00000
120 hdd 15.34180 osd.120 up 1.00000 1.00000
122 hdd 15.34180 osd.122 up 1.00000 1.00000
124 hdd 15.34180 osd.124 up 1.00000 1.00000
126 hdd 15.34180 osd.126 up 1.00000 1.00000
128 hdd 15.34180 osd.128 up 1.00000 1.00000
130 hdd 15.34180 osd.130 up 1.00000 1.00000
132 hdd 15.34180 osd.132 up 1.00000 1.00000
134 hdd 15.34180 osd.134 up 1.00000 1.00000
136 hdd 15.34180 osd.136 up 1.00000 1.00000
138 hdd 15.34180 osd.138 up 1.00000 1.00000
140 hdd 15.34180 osd.140 up 1.00000 1.00000
142 hdd 15.34180 osd.142 up 1.00000 1.00000
144 hdd 15.34180 osd.144 up 1.00000 1.00000
146 hdd 15.34180 osd.146 up 1.00000 1.00000
148 hdd 15.34180 osd.148 up 1.00000 1.00000
150 hdd 15.34180 osd.150 up 1.00000 1.00000
152 hdd 15.34180 osd.152 up 1.00000 1.00000
154 hdd 15.34180 osd.154 up 1.00000 1.00000
156 hdd 15.34180 osd.156 up 1.00000 1.00000
158 hdd 15.34180 osd.158 up 1.00000 1.00000
-3 813.11523 host rgw-3
43 hdd 15.34180 osd.43 up 1.00000 1.00000
52 hdd 15.34180 osd.52 up 1.00000 1.00000
107 hdd 15.34180 osd.107 up 1.00000 1.00000
109 hdd 15.34180 osd.109 up 1.00000 1.00000
111 hdd 15.34180 osd.111 up 1.00000 1.00000
113 hdd 15.34180 osd.113 up 1.00000 1.00000
115 hdd 15.34180 osd.115 up 1.00000 1.00000
117 hdd 15.34180 osd.117 up 1.00000 1.00000
119 hdd 15.34180 osd.119 up 1.00000 1.00000
121 hdd 15.34180 osd.121 up 1.00000 1.00000
123 hdd 15.34180 osd.123 up 1.00000 1.00000
125 hdd 15.34180 osd.125 up 1.00000 1.00000
127 hdd 15.34180 osd.127 up 1.00000 1.00000
129 hdd 15.34180 osd.129 up 1.00000 1.00000
131 hdd 15.34180 osd.131 up 1.00000 1.00000
133 hdd 15.34180 osd.133 up 1.00000 1.00000
135 hdd 15.34180 osd.135 up 1.00000 1.00000
137 hdd 15.34180 osd.137 up 1.00000 1.00000
139 hdd 15.34180 osd.139 up 1.00000 1.00000
141 hdd 15.34180 osd.141 up 1.00000 1.00000
143 hdd 15.34180 osd.143 up 1.00000 1.00000
145 hdd 15.34180 osd.145 up 1.00000 1.00000
147 hdd 15.34180 osd.147 up 1.00000 1.00000
149 hdd 15.34180 osd.149 up 1.00000 1.00000
151 hdd 15.34180 osd.151 up 1.00000 1.00000
153 hdd 15.34180 osd.153 up 1.00000 1.00000
155 hdd 15.34180 osd.155 up 1.00000 1.00000
157 hdd 15.34180 osd.157 up 1.00000 1.00000
187 hdd 15.34180 osd.187 up 1.00000 1.00000
188 hdd 15.34180 osd.188 up 1.00000 1.00000
190 hdd 15.34180 osd.190 up 1.00000 1.00000
191 hdd 15.34180 osd.191 up 1.00000 1.00000
192 hdd 15.34180 osd.192 up 1.00000 1.00000
193 hdd 15.34180 osd.193 up 1.00000 1.00000
194 hdd 15.34180 osd.194 up 1.00000 1.00000
196 hdd 15.34180 osd.196 up 1.00000 1.00000
197 hdd 15.34180 osd.197 up 1.00000 1.00000
198 hdd 15.34180 osd.198 up 1.00000 1.00000
199 hdd 15.34180 osd.199 up 1.00000 1.00000
200 hdd 15.34180 osd.200 up 1.00000 1.00000
202 hdd 15.34180 osd.202 up 1.00000 1.00000
203 hdd 15.34180 osd.203 up 1.00000 1.00000
204 hdd 15.34180 osd.204 up 1.00000 1.00000
205 hdd 15.34180 osd.205 up 1.00000 1.00000
206 hdd 15.34180 osd.206 up 1.00000 1.00000
208 hdd 15.34180 osd.208 up 1.00000 1.00000
209 hdd 15.34180 osd.209 up 1.00000 1.00000
210 hdd 15.34180 osd.210 up 1.00000 1.00000
211 hdd 15.34180 osd.211 up 1.00000 1.00000
212 hdd 15.34180 osd.212 up 1.00000 1.00000
214 hdd 15.34180 osd.214 up 1.00000 1.00000
215 hdd 15.34180 osd.215 up 1.00000 1.00000
216 hdd 15.34180 osd.216 up 1.00000 1.00000
-9 813.11523 host rgw-4
265 hdd 15.34180 osd.265 up 1.00000 1.00000
268 hdd 15.34180 osd.268 up 1.00000 1.00000
271 hdd 15.34180 osd.271 up 1.00000 1.00000
273 hdd 15.34180 osd.273 up 1.00000 1.00000
276 hdd 15.34180 osd.276 up 1.00000 1.00000
279 hdd 15.34180 osd.279 up 1.00000 1.00000
281 hdd 15.34180 osd.281 up 1.00000 1.00000
284 hdd 15.34180 osd.284 up 1.00000 1.00000
287 hdd 15.34180 osd.287 up 1.00000 1.00000
290 hdd 15.34180 osd.290 up 1.00000 1.00000
293 hdd 15.34180 osd.293 up 1.00000 1.00000
296 hdd 15.34180 osd.296 up 1.00000 1.00000
299 hdd 15.34180 osd.299 up 1.00000 1.00000
301 hdd 15.34180 osd.301 up 1.00000 1.00000
304 hdd 15.34180 osd.304 up 1.00000 1.00000
307 hdd 15.34180 osd.307 up 1.00000 1.00000
310 hdd 15.34180 osd.310 up 1.00000 1.00000
313 hdd 15.34180 osd.313 up 1.00000 1.00000
316 hdd 15.34180 osd.316 up 1.00000 1.00000
319 hdd 15.34180 osd.319 up 1.00000 1.00000
321 hdd 15.34180 osd.321 up 1.00000 1.00000
324 hdd 15.34180 osd.324 up 1.00000 1.00000
327 hdd 15.34180 osd.327 up 1.00000 1.00000
329 hdd 15.34180 osd.329 up 1.00000 1.00000
332 hdd 15.34180 osd.332 up 1.00000 1.00000
335 hdd 15.34180 osd.335 up 1.00000 1.00000
338 hdd 15.34180 osd.338 up 1.00000 1.00000
341 hdd 15.34180 osd.341 up 1.00000 1.00000
344 hdd 15.34180 osd.344 up 1.00000 1.00000
347 hdd 15.34180 osd.347 up 1.00000 1.00000
349 hdd 15.34180 osd.349 up 1.00000 1.00000
352 hdd 15.34180 osd.352 up 1.00000 1.00000
354 hdd 15.34180 osd.354 up 1.00000 1.00000
357 hdd 15.34180 osd.357 up 1.00000 1.00000
360 hdd 15.34180 osd.360 up 1.00000 1.00000
363 hdd 15.34180 osd.363 up 1.00000 1.00000
366 hdd 15.34180 osd.366 up 1.00000 1.00000
369 hdd 15.34180 osd.369 up 1.00000 1.00000
372 hdd 15.34180 osd.372 up 1.00000 1.00000
375 hdd 15.34180 osd.375 up 1.00000 1.00000
378 hdd 15.34180 osd.378 up 1.00000 1.00000
380 hdd 15.34180 osd.380 up 1.00000 1.00000
382 hdd 15.34180 osd.382 up 1.00000 1.00000
385 hdd 15.34180 osd.385 up 1.00000 1.00000
388 hdd 15.34180 osd.388 up 1.00000 1.00000
391 hdd 15.34180 osd.391 up 1.00000 1.00000
394 hdd 15.34180 osd.394 up 1.00000 1.00000
397 hdd 15.34180 osd.397 up 1.00000 1.00000
399 hdd 15.34180 osd.399 down 1.00000 1.00000
402 hdd 15.34180 osd.402 up 1.00000 1.00000
404 hdd 15.34180 osd.404 up 1.00000 1.00000
407 hdd 15.34180 osd.407 up 1.00000 1.00000
410 hdd 15.34180 osd.410 up 1.00000 1.00000
-13 813.14203 host rgw-5
189 hdd 15.34279 osd.189 up 1.00000 1.00000
195 hdd 15.34279 osd.195 up 1.00000 1.00000
201 hdd 15.34180 osd.201 up 1.00000 1.00000
207 hdd 15.34180 osd.207 up 1.00000 1.00000
213 hdd 15.34180 osd.213 up 1.00000 1.00000
217 hdd 15.34180 osd.217 up 1.00000 1.00000
218 hdd 15.34279 osd.218 up 1.00000 1.00000
219 hdd 15.34180 osd.219 up 1.00000 1.00000
220 hdd 15.34279 osd.220 up 1.00000 1.00000
221 hdd 15.34180 osd.221 up 1.00000 1.00000
222 hdd 15.34279 osd.222 up 1.00000 1.00000
223 hdd 15.34180 osd.223 up 1.00000 1.00000
224 hdd 15.34279 osd.224 up 1.00000 1.00000
225 hdd 15.34180 osd.225 up 1.00000 1.00000
226 hdd 15.34279 osd.226 up 1.00000 1.00000
227 hdd 15.34180 osd.227 up 1.00000 1.00000
228 hdd 15.34279 osd.228 up 1.00000 1.00000
229 hdd 15.34180 osd.229 up 1.00000 1.00000
230 hdd 15.34279 osd.230 up 1.00000 1.00000
231 hdd 15.34180 osd.231 up 1.00000 1.00000
232 hdd 15.34279 osd.232 up 1.00000 1.00000
233 hdd 15.34180 osd.233 up 1.00000 1.00000
234 hdd 15.34180 osd.234 up 1.00000 1.00000
235 hdd 15.34279 osd.235 up 1.00000 1.00000
236 hdd 15.34180 osd.236 up 1.00000 1.00000
237 hdd 15.34279 osd.237 up 1.00000 1.00000
238 hdd 15.34180 osd.238 up 1.00000 1.00000
239 hdd 15.34279 osd.239 up 1.00000 1.00000
240 hdd 15.34180 osd.240 up 1.00000 1.00000
241 hdd 15.34279 osd.241 up 1.00000 1.00000
242 hdd 15.34180 osd.242 up 1.00000 1.00000
243 hdd 15.34279 osd.243 up 1.00000 1.00000
244 hdd 15.34180 osd.244 up 1.00000 1.00000
245 hdd 15.34279 osd.245 up 1.00000 1.00000
246 hdd 15.34180 osd.246 up 1.00000 1.00000
247 hdd 15.34279 osd.247 up 1.00000 1.00000
248 hdd 15.34180 osd.248 up 1.00000 1.00000
249 hdd 15.34279 osd.249 up 1.00000 1.00000
250 hdd 15.34180 osd.250 up 1.00000 1.00000
251 hdd 15.34279 osd.251 up 1.00000 1.00000
252 hdd 15.34180 osd.252 up 1.00000 1.00000
253 hdd 15.34279 osd.253 up 1.00000 1.00000
254 hdd 15.34180 osd.254 up 1.00000 1.00000
255 hdd 15.34279 osd.255 up 1.00000 1.00000
256 hdd 15.34180 osd.256 up 1.00000 1.00000
257 hdd 15.34279 osd.257 up 1.00000 1.00000
258 hdd 15.34180 osd.258 up 1.00000 1.00000
259 hdd 15.34279 osd.259 up 1.00000 1.00000
260 hdd 15.34180 osd.260 up 1.00000 1.00000
261 hdd 15.34279 osd.261 up 1.00000 1.00000
262 hdd 15.34279 osd.262 up 1.00000 1.00000
263 hdd 15.34279 osd.263 up 1.00000 1.00000
264 hdd 15.34279 osd.264 up 1.00000 1.00000
-11 813.11523 host rgw-6
12 hdd 15.34180 osd.12 up 1.00000 1.00000
16 hdd 15.34180 osd.16 up 1.00000 1.00000
21 hdd 15.34180 osd.21 up 1.00000 1.00000
25 hdd 15.34180 osd.25 up 1.00000 1.00000
30 hdd 15.34180 osd.30 up 1.00000 1.00000
34 hdd 15.34180 osd.34 up 1.00000 1.00000
39 hdd 15.34180 osd.39 up 1.00000 1.00000
48 hdd 15.34180 osd.48 up 1.00000 1.00000
56 hdd 15.34180 osd.56 up 1.00000 1.00000
60 hdd 15.34180 osd.60 up 1.00000 1.00000
64 hdd 15.34180 osd.64 up 1.00000 1.00000
69 hdd 15.34180 osd.69 up 1.00000 1.00000
73 hdd 15.34180 osd.73 up 1.00000 1.00000
77 hdd 15.34180 osd.77 up 1.00000 1.00000
81 hdd 15.34180 osd.81 up 1.00000 1.00000
85 hdd 15.34180 osd.85 up 1.00000 1.00000
89 hdd 15.34180 osd.89 up 1.00000 1.00000
91 hdd 15.34180 osd.91 up 1.00000 1.00000
93 hdd 15.34180 osd.93 up 1.00000 1.00000
95 hdd 15.34180 osd.95 up 1.00000 1.00000
97 hdd 15.34180 osd.97 up 1.00000 1.00000
99 hdd 15.34180 osd.99 up 1.00000 1.00000
101 hdd 15.34180 osd.101 up 1.00000 1.00000
103 hdd 15.34180 osd.103 up 1.00000 1.00000
105 hdd 15.34180 osd.105 up 1.00000 1.00000
159 hdd 15.34180 osd.159 up 1.00000 1.00000
160 hdd 15.34180 osd.160 up 1.00000 1.00000
161 hdd 15.34180 osd.161 up 1.00000 1.00000
162 hdd 15.34180 osd.162 up 1.00000 1.00000
163 hdd 15.34180 osd.163 up 1.00000 1.00000
164 hdd 15.34180 osd.164 up 1.00000 1.00000
165 hdd 15.34180 osd.165 up 1.00000 1.00000
166 hdd 15.34180 osd.166 up 1.00000 1.00000
167 hdd 15.34180 osd.167 up 1.00000 1.00000
168 hdd 15.34180 osd.168 up 1.00000 1.00000
169 hdd 15.34180 osd.169 up 1.00000 1.00000
170 hdd 15.34180 osd.170 up 1.00000 1.00000
171 hdd 15.34180 osd.171 up 1.00000 1.00000
172 hdd 15.34180 osd.172 up 1.00000 1.00000
173 hdd 15.34180 osd.173 up 1.00000 1.00000
174 hdd 15.34180 osd.174 up 1.00000 1.00000
175 hdd 15.34180 osd.175 up 1.00000 1.00000
176 hdd 15.34180 osd.176 up 1.00000 1.00000
177 hdd 15.34180 osd.177 up 1.00000 1.00000
178 hdd 15.34180 osd.178 up 1.00000 1.00000
179 hdd 15.34180 osd.179 up 1.00000 1.00000
180 hdd 15.34180 osd.180 up 1.00000 1.00000
181 hdd 15.34180 osd.181 up 1.00000 1.00000
182 hdd 15.34180 osd.182 up 1.00000 1.00000
183 hdd 15.34180 osd.183 up 1.00000 1.00000
184 hdd 15.34180 osd.184 up 1.00000 1.00000
185 hdd 15.34180 osd.185 up 1.00000 1.00000
186 hdd 15.34180 osd.186 up 1.00000 1.00000
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]#
[root@rgw-4 /]# ceph osd dump | grep -i pool
pool 10 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 12075 lfor 0/0/12064 flags hashpspool stripe_width 0 application rgw
pool 11 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 12075 lfor 0/0/12068 flags hashpspool stripe_width 0 application rgw
pool 12 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 12075 lfor 0/0/12066 flags hashpspool stripe_width 0 application rgw
pool 13 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 12075 lfor 0/0/12068 flags hashpspool stripe_width 0 application rgw
pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 12172 lfor 0/0/12085 flags hashpspool stripe_width 0 application rgw
pool 15 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 12173 flags hashpspool stripe_width 0 application rgw
pool 16 'default.rgw.buckets.data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 16384 pgp_num 16384 autoscale_mode warn last_change 12174 lfor 0/0/12162 flags hashpspool stripe_width 16384 application rgw
[root@rgw-4 /]#
Created attachment 1686217 [details]
COSBench Workload file
Attaching cosbench workload file
Clarification on ceph -s output , here is the pattern
- Multiple OSDS fails/flaps from 1 node (5-10)
- All 53 OSDs flap from one node (but comes back up in say 30 seconds)
I am also seeing slow ops
[root@rgw-4 /]# ceph -s
cluster:
id: ebe0aa4b-4fb5-4c68-84ab-cbf1118937a2
health: HEALTH_WARN
1 osds down
Reduced data availability: 2 pgs inactive, 19 pgs incomplete
Degraded data redundancy: 491400/146999901 objects degraded (0.334%), 339 pgs degraded
24 slow ops, oldest one blocked for 4335 sec, daemons [mon,rgw-1,mon,rgw-2,mon,rgw-3] have slow ops.
Logs from one of the OSD:
2020-05-07 08:51:17.141 7fd8360af700 1 osd.399 pg_epoch: 15005 pg[16.3368s5( v 14993'1582 (14977'1572,14993'1582] lb MIN (bitwise) local-lis/les=14995/14996 n=0 ec=12150/12099 lis/c 15002/13313 les/c/f 15003/13317/0 15004/15005/14022) [35,56,203,74,254,399]/[35,56,203,74,254,2147483647]p35(0) r=-1 lpr=15005 pi=[13313,15005)/4 crt=14993'1582 lcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2020-05-07 08:51:17.141 7fd8360af700 1 osd.399 pg_epoch: 15005 pg[16.35f7s5( v 14998'1548 (14929'1538,14998'1548] lb MIN (bitwise) local-lis/les=15000/15001 n=0 ec=12152/12099 lis/c 15002/13292 les/c/f 15003/13295/0 15004/15005/12251) [125,103,116,47,244,399]/[125,103,116,47,244,2147483647]p125(0) r=-1 lpr=15005 pi=[13292,15005)/3 crt=14998'1548 lcod 0'0 remapped NOTIFY mbc={}] start_peering_interval up [125,103,116,47,244,399] -> [125,103,116,47,244,399], acting [125,103,116,47,244,399] -> [125,103,116,47,244,2147483647], acting_primary 125(0) -> 125, up_primary 125(0) -> 125, role 5 -> -1, features acting 4611087854031667199 upacting 4611087854031667199
2020-05-07 08:51:17.141 7fd8360af700 1 osd.399 pg_epoch: 15005 pg[16.35f7s5( v 14998'1548 (14929'1538,14998'1548] lb MIN (bitwise) local-lis/les=15000/15001 n=0 ec=12152/12099 lis/c 15002/13292 les/c/f 15003/13295/0 15004/15005/12251) [125,103,116,47,244,399]/[125,103,116,47,244,2147483647]p125(0) r=-1 lpr=15005 pi=[13292,15005)/3 crt=14998'1548 lcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2020-05-07 08:51:18.152 7fd8340ab700 0 log_channel(cluster) log [INF] : 16.3452s0 continuing backfill to osd.249(2) from (14980'1610,15001'1627] MIN to 15001'1627
2020-05-07 08:51:18.152 7fd8350ad700 0 log_channel(cluster) log [INF] : 16.1b78s0 continuing backfill to osd.65(5) from (14969'1510,15001'1524] MIN to 15001'1524
2020-05-07 08:51:18.155 7fd8340ab700 0 log_channel(cluster) log [INF] : 16.267es0 continuing backfill to osd.65(3) from (14969'1539,15001'1553] MIN to 15001'1553
/builddir/build/BUILD/ceph-14.2.4/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::add(const pg_log_entry_t&, bool)' thread 7fd8360af700 time 2020-05-07 08:51:18.171087
/builddir/build/BUILD/ceph-14.2.4/src/osd/PGLog.h: 511: FAILED ceph_assert(head.version == 0 || e.version.version > head.version)
ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x561590c1234c]
2: (()+0x51f566) [0x561590c12566]
3: (bool PGLog::append_log_entries_update_missing<pg_missing_set<true> >(hobject_t const&, bool, std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> > const&, bool, PGLog::IndexedLog*, pg_missing_set<true>&, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0xaed) [0x561590e4442d]
4: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xf6b) [0x561590e37a6b]
5: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x68) [0x561590d8e278]
6: (PG::RecoveryState::Stray::react(MLogRec const&)+0x23b) [0x561590dce9ab]
7: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa5) [0x561590e2aa35]
8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5a) [0x561590df8bda]
9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2c2) [0x561590de98e2]
10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2bc) [0x561590d2a36c]
11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x561590fa84b5]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1366) [0x561590d26c36]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x561591306134]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x561591308cf4]
15: (()+0x82de) [0x7fd8575ac2de]
16: (clone()+0x43) [0x7fd856356133]
*** Caught signal (Aborted) **
in thread 7fd8360af700 thread_name:tp_osd_tp
2020-05-07 08:51:18.174 7fd8360af700 -1 /builddir/build/BUILD/ceph-14.2.4/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::add(const pg_log_entry_t&, bool)' thread 7fd8360af700 time 2020-05-07 08:51:18.171087
/builddir/build/BUILD/ceph-14.2.4/src/osd/PGLog.h: 511: FAILED ceph_assert(head.version == 0 || e.version.version > head.version)
Suspected issue:
https://tracker.ceph.com/issues/44532
https://github.com/ceph/ceph/pull/33910
Some more commentary per my discussion with Vikhyat So the cluster as running fine, I ran several cosbench tests successfully. Then all of a sudden, when i ran another test, I started to see this flapping issue. I stopped the workload and rested osds that are down for a long time. Then the cluster became healthy again. Post that i re-applied the load from cosbench and again started to see Flapping issues with OSDs. When i run watch "ceph -s", I can see the random amount of OSD going down 1,2, 10 or even 53 odds, and withing a few seconds they came back ... and this is a repeated pattern. I received wonderful support from Vikhyat and Neha in this case, many thanks guys, you ROCK !! Happy to report that, changes mentioned by Neha i.e switching to default values for pg_log* tuneable have fixed the issue. I have not seen a similar assert since doing the mentioned changes, So this is not a bug. Going forward, my plan is to ingest 10 Billion RADOS objects via 12xRGW on to this cluster. So if I encounter this again, I probably will re-open this. Until them all good :) (In reply to karan singh from comment #14) > I received wonderful support from Vikhyat and Neha in this case, many thanks > guys, you ROCK !! > > Happy to report that, changes mentioned by Neha i.e switching to default > values for pg_log* tuneable have fixed the issue. I have not seen a similar > assert since doing the mentioned changes, So this is not a bug. Well, if we allow our users to change from default settings, it is a bug. > > Going forward, my plan is to ingest 10 Billion RADOS objects via 12xRGW on > to this cluster. So if I encounter this again, I probably will re-open this. > Until them all good :) Based on the above, closing. Please re-open if we need to fix for 4.x. |