Bug 1331523 - [RADOS]:- osd gets heavy weight due to reweight-by-utilization with max_change set to 1
Summary: [RADOS]:- osd gets heavy weight due to reweight-by-utilization with max_chang...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 1.3.2
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: 1.3.3
Assignee: Samuel Just
QA Contact: shylesh
Bara Ancincova
URL:
Whiteboard:
Depends On: 1335269
Blocks: 1372735
TreeView+ depends on / blocked
 
Reported: 2016-04-28 17:34 UTC by shylesh
Modified: 2017-07-30 15:22 UTC (History)
11 users (show)

Fixed In Version: RHEL: ceph-0.94.7-5.el7cp Ubuntu: ceph_0.94.7-3redhat1trusty
Doc Type: Bug Fix
Doc Text:
.OSDs no longer receive unreasonably large weight during "reweight-by-utilization" When the value of the `max_change` parameter was greater than an OSD weight, an underflow occurred. Consequently, the OSD node could receive an unreasonably large weight during the `reweight-by-utilization` process. This bug has been fixed, and OSDs no longer receive large weight in the described situation.
Clone Of:
Environment:
Last Closed: 2016-09-29 12:57:58 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 15655 None None None 2016-08-03 20:32:13 UTC
Red Hat Bugzilla 1316675 None None None Never
Red Hat Product Errata RHSA-2016:1972 normal SHIPPED_LIVE Moderate: Red Hat Ceph Storage 1.3.3 security, bug fix, and enhancement update 2016-09-29 16:51:21 UTC

Internal Links: 1316675

Description shylesh 2016-04-28 17:34:51 UTC
Description of problem:
While reweight-by-utilization on a cluster with max_change set to 1 could you lead to osd getting heavy weight of 65534 due to integer overflow of the variable

Version-Release number of selected component (if applicable):
1.3.2 Async release.

[root@magna105 ~]# rpm -qa| grep ceph
ceph-mon-0.94.5-12.el7cp.x86_64
ceph-common-0.94.5-12.el7cp.x86_64
ceph-selinux-0.94.5-12.el7cp.x86_64
mod_fastcgi-2.4.7-1.ceph.el7.x86_64
iozone-3.424-2_ceph.el7.x86_64
ceph-0.94.5-12.el7cp.x86_64

How reproducible:


Steps to Reproduce:
1.created a cluster with 9 osds
2.filled up data and observerd that there is some imbalance in the data distribution
3.Ran "ceph osd reweight-by-utilization 110"

Actual results:
[root@magna105 ~]# ceph osd df
ID WEIGHT  REWEIGHT    SIZE  USE   AVAIL %USE  VAR
 0 0.89999     1.00000  926G  692G  233G 74.79 1.02
 1 0.89999     1.00000  926G  682G  243G 73.72 1.01
 2 0.89999 65535.94922  926G  709G  216G 76.64 1.05
 3 0.75000     1.00000  926G  626G  299G 67.62 0.93
 4 0.89999 65533.94922  926G  711G  215G 76.78 1.05
 5 0.89999     1.00000  926G  572G  353G 61.87 0.85
 6 0.79999     1.00000  926G  685G  240G 74.03 1.01
 7 0.89999     1.00000  926G  624G  301G 67.42 0.92
 8 0.89999     1.00000  926G  787G  138G 85.05 1.16
                 TOTAL 8334G 6092G 2241G 73.10
MIN/MAX VAR: 0.85/1.16  STDDEV: 3.61


some of the osds got very high value for the reweight due to integer overflow in the calculation  src/mon/OSDMonitor.cc 

 new_weight = MAX(new_weight, weight - max_change);



Additional info:

[root@magna105 ~]# ceph -s
    cluster 6de276f4-42aa-4de9-85d7-6f879ce1faa3
     health HEALTH_WARN
            clock skew detected on mon.magna107, mon.magna108
            Monitor clock skew detected
     monmap e1: 3 mons at {magna105=10.8.128.105:6789/0,magna107=10.8.128.107:6789/0,magna108=10.8.128.108:6789/0}
            election epoch 24, quorum 0,1,2 magna105,magna107,magna108
     osdmap e430: 9 osds: 9 up, 9 in
      pgmap v34074: 128 pgs, 9 pools, 1978 GB data, 495 kobjects
            5944 GB used, 2390 GB / 8334 GB avail
                 128 active+clean
[root@magna105 ~]# ceph osd df
ID WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR
 0 0.89999  1.00000  926G  712G  213G 76.93 1.08
 1 0.89999  1.00000  926G  704G  221G 76.05 1.07
 2 0.89999  0.95001  926G  771G  154G 83.26 1.17
 3 0.75000  1.00000  926G  617G  308G 66.70 0.94
 4 0.89999  0.95001  926G  727G  198G 78.58 1.10
 5 0.89999  1.00000  926G  557G  368G 60.22 0.84
 6 0.79999  1.00000  926G  668G  257G 72.21 1.01
 7 0.89999  1.00000  926G  609G  316G 65.78 0.92
 8 0.89999  0.84999  926G  575G  350G 62.12 0.87
              TOTAL 8334G 5944G 2390G 71.32
MIN/MAX VAR: 0.84/1.17  STDDEV: 7.46

[root@magna105 ~]# ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    8334G     2390G        5944G         71.32
POOLS:
    NAME                   ID     USED      %USED     MAX AVAIL     OBJECTS
    rbd                    0      1978G     23.74          450G      506530
    .rgw.root              1        848         0          450G           3
    .rgw.control           2          0         0          450G           8
    .rgw                   3        704         0          450G           4
    .rgw.gc                4          0         0          450G          32
    .users.uid             5        324         0          450G           2
    .users                 6         12         0          450G           1
    .rgw.buckets.index     7          0         0          450G           2
    .rgw.buckets           8       976k         0          450G        1000
[root@magna105 ~]# ceph osd reweight-by-utilization 110
moved 14 / 384 (3.64583%)
avg 42.6667
stddev 4.89898 -> 3.82971 (expected baseline 6.1584)
min osd.4 with 51 -> 51 pgs (1.19531 -> 1.19531 * mean)
max osd.8 with 33 -> 42 pgs (0.773438 -> 0.984375 * mean)

oload 110
max_change 1
max_change_osds 4  
average 0.713184   
overload 0.784502  
osd.2 weight 0.950012 -> 65535.949219
osd.4 weight 0.950012 -> 65535.949219
osd.8 weight 0.849991 -> 0.975845
[root@magna105 ~]# ceph osd reweight-by-utilization 110
moved 0 / 384 (0%) 
avg 42.6667
stddev 3.82971 -> 3.82971 (expected baseline 6.1584)
min osd.4 with 51 -> 51 pgs (1.19531 -> 1.19531 * mean)
max osd.5 with 36 -> 36 pgs (0.84375 -> 0.84375 * mean)

oload 110
max_change 1
max_change_osds 4  
average 0.721498   
overload 0.793647  
osd.4 weight 65535.949219 -> 65534.949219
osd.8 weight 0.975845 -> 1.000000
[root@magna105 ~]# ceph osd reweight-by-utilization 110 --no-increasing
moved 0 / 384 (0%) 
avg 42.6667
stddev 3.82971 -> 3.82971 (expected baseline 6.1584)
min osd.4 with 51 -> 51 pgs (1.19531 -> 1.19531 * mean)
max osd.5 with 36 -> 36 pgs (0.84375 -> 0.84375 * mean)
oload 110
max_change 1
max_change_osds 4  
average 0.721498   
overload 0.793648  
osd.4 weight 65534.949219 -> 65533.949219
[root@magna105 ~]# ceph osd df
ID WEIGHT  REWEIGHT    SIZE  USE   AVAIL %USE  VAR
 0 0.89999     1.00000  926G  712G  213G 76.93 1.07
 1 0.89999     1.00000  926G  704G  221G 76.05 1.05
 2 0.89999 65535.94922  926G  732G  193G 79.12 1.10
 3 0.75000     1.00000  926G  610G  315G 65.93 0.91
 4 0.89999 65533.94922  926G  769G  156G 83.10 1.15
 5 0.89999     1.00000  926G  557G  368G 60.22 0.83
 6 0.79999     1.00000  926G  668G  257G 72.21 1.00
 7 0.89999     1.00000  926G  609G  316G 65.78 0.91
 8 0.89999     1.00000  926G  648G  277G 69.99 0.97
                 TOTAL 8334G 6013G 2321G 72.15
MIN/MAX VAR: 0.83/1.15  STDDEV: 9.18
[root@magna105 ~]# ceph -s
    cluster 6de276f4-42aa-4de9-85d7-6f879ce1faa3
     health HEALTH_WARN
            clock skew detected on mon.magna107, mon.magna108
            5 pgs backfilling
            5 pgs stuck unclean
            recovery 66837/1571772 objects misplaced (4.252%)
            Monitor clock skew detected
     monmap e1: 3 mons at {magna105=10.8.128.105:6789/0,magna107=10.8.128.107:6789/0,magna108=10.8.128.108:6789/0}
            election epoch 24, quorum 0,1,2 magna105,magna107,magna108
     osdmap e443: 9 osds: 9 up, 9 in; 5 remapped pgs
      pgmap v35822: 128 pgs, 9 pools, 1978 GB data, 495 kobjects
            6065 GB used, 2269 GB / 8334 GB avail
            66837/1571772 objects misplaced (4.252%)
                 123 active+clean
                   5 active+remapped+backfilling
recovery io 37717 kB/s, 9 objects/s
[root@magna105 ~]# ceph osd df
ID WEIGHT  REWEIGHT    SIZE  USE   AVAIL %USE  VAR
 0 0.89999     1.00000  926G  712G  213G 76.93 1.06
 1 0.89999     1.00000  926G  704G  221G 76.05 1.05
 2 0.89999 65535.94922  926G  732G  193G 79.12 1.09
 3 0.75000     1.00000  926G  610G  315G 65.93 0.91
 4 0.89999 65533.94922  926G  769G  156G 83.10 1.14
 5 0.89999     1.00000  926G  557G  368G 60.22 0.83
 6 0.79999     1.00000  926G  668G  257G 72.21 0.99
 7 0.89999     1.00000  926G  609G  316G 65.78 0.90
 8 0.89999     1.00000  926G  699G  226G 75.58 1.04
                 TOTAL 8334G 6065G 2269G 72.77
MIN/MAX VAR: 0.83/1.14  STDDEV: 8.57

[root@magna105 ~]# ceph osd df
ID WEIGHT  REWEIGHT    SIZE  USE   AVAIL %USE  VAR  
 0 0.89999     1.00000  926G  692G  233G 74.79 1.02 
 1 0.89999     1.00000  926G  682G  243G 73.72 1.01 
 2 0.89999 65535.94922  926G  709G  216G 76.64 1.05 
 3 0.75000     1.00000  926G  626G  299G 67.62 0.93 
 4 0.89999 65533.94922  926G  711G  215G 76.78 1.05 
 5 0.89999     1.00000  926G  572G  353G 61.87 0.85 
 6 0.79999     1.00000  926G  685G  240G 74.03 1.01 
 7 0.89999     1.00000  926G  624G  301G 67.42 0.92 
 8 0.89999     1.00000  926G  787G  138G 85.05 1.16 
                 TOTAL 8334G 6092G 2241G 73.10      
MIN/MAX VAR: 0.85/1.16  STDDEV: 3.61

Comment 3 Samuel Just 2016-04-28 20:23:39 UTC
Actually an unsigned underflow on the previous line.  FYI: max_change defaults to 0.05, it's a ratio compared with the current weight.  Simple enough fix.  Fixing

Comment 4 Samuel Just 2016-04-28 20:24:28 UTC
Just to confirm: how did you set max_change to 1?

Comment 5 shylesh 2016-04-28 20:35:27 UTC
(In reply to Samuel Just from comment #4)
> Just to confirm: how did you set max_change to 1?

using injectargs

Comment 6 Samuel Just 2016-04-28 21:07:48 UTC
In wip-sam-testing to run through upstream testing on master tonight.  Backported ceph-1.3-rhel-patches-15655 in gerrit for testing in the mean time.  Should be able to backport to upstream hammer/jewel on Monday teuthology permitting.

Comment 11 Sage Weil 2016-04-29 13:12:44 UTC
It'll delay the release to pull this in so the plan is to release as-in and provide guidance.  Specifically,

1- Make sure the customer users a small max_change.  This is what they will want to do anyway, FWIW.  I suggest a value of .05 or smaller.

2- Advise the customer to always use test-reweight-by-utilization first to confirm that the reweight plan is sane.  For example,

 ceph osd test-reweight-by-utilization 120 .05 10   # max .05 change for 10 osds

then verify the weight changes seem small and reasonable, and a smallish number of PGs will move, and then

 ceph osd reweight-by-utilization 120 .05 10

Sound okay?

Comment 18 shylesh 2016-09-13 07:41:05 UTC
No underflow observed, hence marking as verified. 
Verified on 0.94.9-1.el7cp.x86_64

Comment 23 errata-xmlrpc 2016-09-29 12:57:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1972.html


Note You need to log in before you can comment on or make changes to this bug.