Description of problem:
Sometimes not all the OSD's are picked properly, some of the OSD's which meet the criteria to get reweighted are not getting reweighted.
Version-Release number of selected component (if applicable):
ceph version 0.94.5-6redhat1trusty
Steps to Reproduce:
$ sudo ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR
0 0.89999 0.80005 926G 782G 143G 84.50 1.10
1 0.89989 0 0 0 0 0 0
2 0.89999 0.80005 926G 649G 276G 70.11 0.91
3 0.89999 1.00000 926G 741G 185G 80.02 1.04
4 0.89999 1.00000 926G 783G 142G 84.61 1.10
5 0.89999 1.00000 926G 647G 278G 69.94 0.91
6 0.89999 0.80005 926G 678G 247G 73.24 0.95
7 0.89999 1.00000 926G 713G 212G 77.09 1.00
8 0.89999 1.00000 926G 737G 188G 79.65 1.03
9 0.89999 1.00000 926G 794G 131G 85.76 1.11
10 0.89999 1.00000 926G 670G 255G 72.41 0.94
11 0.89999 1.00000 926G 646G 279G 69.79 0.91
TOTAL 10186G 7844G 2341G 77.01
MIN/MAX VAR: 0/1.11 STDDEV: 5.95
$ sudo ceph osd test-reweight-by-utilization 105 .05 5
moved 59 / 1944 (3.03498%)
stddev 10.1273 -> 8.28022 (expected baseline 12.6752)
min osd.8 with 190 -> 183 pgs (1.0751 -> 1.03549 * mean)
max osd.6 with 157 -> 168 pgs (0.888374 -> 0.950617 * mean)
osd.9 weight 1.000000 -> 0.950012
osd.4 weight 1.000000 -> 0.950012
osd.0 weight 0.800049 -> 0.750061
osd.6 weight 0.800049 -> 0.841187
osd.2 weight 0.800049 -> 0.850037
osd.10 is having 72.41 of USAGE as compared to osd.2 having 70.11 USAGE
Still, osd.2 is getting selected ahead of osd.10
osd.10 should be considered ahead of osd.2
This probably should not hold up 1.3.2 -- advisory to user would be the right thing.
Sam, what advisory will be given to user in this case? please share the details.
I think sage would be the right person to ask. Maybe that the user should make sure to use the test_ option first and verify that the behavior is ok?
Please let us know what advisory will be given to user in this case.
Right. The user should
ceph osd test-reweight-by-utilization ...
ceph osd test-reweight-by-pg ...
prior to doing the non-test- variant to confirm that nothing drastic will happen.
They should also use small max_weight values. E.g.,
ceph osd test-reweight-by-utilization 120 .05 10
to update at most 10 osds with at most a change of .05.
Later, when we have backported this fix, the low-weight osds can be weighted up. If they can wait for that, they should, but if not, it's no big deal--just a bit more data movement.
(In reply to Sage Weil from comment #7)
> Right. The user should
> ceph osd test-reweight-by-utilization ...
> ceph osd test-reweight-by-pg ...
> prior to doing the non-test- variant to confirm that nothing drastic will
> They should also use small max_weight values. E.g.,
> ceph osd test-reweight-by-utilization 120 .05 10
> to update at most 10 osds with at most a change of .05.
> Later, when we have backported this fix,
"this fix" here refers to the fix for BZ 1331764 or BZ 1331784 or both? Can you please confirm?
> the low-weight osds can be weighted
> up. If they can wait for that, they should, but if not, it's no big
> deal--just a bit more data movement.
The fix is the same for both BZs.
https://github.com/ceph/ceph/pull/9416 was merged to hammer after v0.94.7 was tagged, so this bug is fixed in v0.94.8 upstream.
With the introduction of new algorithm osds are chosen based on their distance from avg utilization. i.e more the distance from the avg greater chance of getting selected.
Hence marking this as verified.
Verified on 0.94.9-1.el7cp.x86_64
Looks good to me. Thanks, Bara!
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.