Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1342402

Summary:	Ceph-FS IOPs slowed down after adding OSDS
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	rakesh-gm <rgowdege>
Component:	CephFS	Assignee:	John Spray <john.spray>
Status:	CLOSED NOTABUG	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	2.0	CC:	ceph-eng-bugs, john.spray, kdreyer, rgowdege, tchandra
Target Milestone:	rc
Target Release:	2.2
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-01-06 12:08:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description rakesh-gm 2016-06-03 08:02:22 UTC

Description of problem:

I was doing a wget of the file and at this point in time, I started to add mon and osds using ceph-ansible, IOPS were still in mbps during mon add and mon reached to a quorum. Then osds addition was completed ,At the end of this installation process ceph-MDS was restarted. ceph recovered to Active+clean state. But IOPS now dropped to kbps and remained the same till the download of the file was completed. 

Note that I had same FS by ceph-fuse and Kernel-client in different nodes. 

some of the outputs:

 76% [==============================================================================================================>                                    ] 1,635,711,544  237KB/s  eta 30m 24s


[root@magna070 ubuntu]# ceph -w
    cluster 27b4d1a0-a522-4866-b344-4ea5b101c8bd
     health HEALTH_OK   
     monmap e2: 2 mons at {magna041=10.8.128.41:6789/0,magna070=10.8.128.70:6789/0}
            election epoch 12, quorum 0,1 magna041,magna070
      fsmap e29: 1/1/1 up {0=magna070=up:active}
     osdmap e225: 12 osds: 12 up, 12 in
            flags sortbitwise
      pgmap v9801: 360 pgs, 8 pools, 20684 MB data, 5280 objects
            63026 MB used, 11052 GB / 11114 GB avail
                 360 active+clean
  client io 850 kB/s wr, 0 op/s rd, 0 op/s wr

2016-06-03 07:57:08.579814 mon.0 [INF] pgmap v9800: 360 pgs: 360 active+clean; 20684 MB data, 63018 MB used, 11052 GB / 11114 GB avail; 429 kB/s wr, 0 op/s
2016-06-03 07:57:09.926819 mon.0 [INF] pgmap v9801: 360 pgs: 360 active+clean; 20684 MB data, 63026 MB used, 11052 GB / 11114 GB avail; 850 kB/s wr, 0 op/s
2016-06-03 07:57:10.985164 mon.0 [INF] pgmap v9802: 360 pgs: 360 active+clean; 20684 MB data, 63034 MB used, 11052 GB / 11114 GB avail
2016-06-03 07:57:14.293977 mon.0 [INF] pgmap v9803: 360 pgs: 360 active+clean; 20684 MB data, 63034 MB used, 11052 GB / 11114 GB avail


------------------>Attached MDS log file.

Comment 2 John Spray 2016-06-03 09:40:23 UTC

It's not immediately clear that this is a cephfs issue.  Has the same test ever been done while using e.g. an RBD image?

You mention restarting the MDS (which probably wasn't necessary), was the performance different before/after restarting that?

Also, just a sanity check, the wget is from somewhere local right, not from an internet source that might have slowed down on its own?

Comment 3 rakesh-gm 2016-06-03 10:21:00 UTC

(In reply to John Spray from comment #2)

> It's not immediately clear that this is a cephfs issue.  Has the same test
> ever been done while using e.g. an RBD image?

I have not tested with RBD, so I dont know the behaviour.

> You mention restarting the MDS (which probably wasn't necessary), was the
> performance different before/after restarting that?

I was installing OSDS using ceph-ansible, (so the restart of MDS was not in my control). the IOPS were definitely  slowed before the restart of MDS as as 3 more osds daemons were added.  
  
> Also, just a sanity check, the wget is from somewhere local right, not from
> an internet source that might have slowed down on its own?

wget is not from a local source, I was downloading an iso from fedora.org.
I was downloading the same ISO several times, redirecting the file name using -O
and the speeds were close in range of MBPS for the previous nodes. 


i saw the same issue while removing OSDS, but this i cant confirm now as i am still running the test

Comment 4 John Spray 2016-06-03 10:29:04 UTC

OK, so for things to slow down while you add/remove OSDs is completely normal and expected (the OSDs are doing backfilling, which consumes some of their bandwidth).

Once the status has all the PGs as active+clean again, performance should go back to normal.  Of course, if you have removed OSDs, the overall system bandwidth will be lower after you have removed them.

I notice that you have a relatively small number of PGs.  This will lead to unpredictable performance.  See the documentation for choosing the number of PGs in your pools: http://docs.ceph.com/docs/jewel/rados/operations/placement-groups/#a-preselection-of-pg-num

Any performance-related test needs to be done with some dependable source for the data.  wgets from the internet are just too unpredictable.  It is better to use a  benchmarking tool (such as fio) to generate a consistent load -- ask around in the QE group for advice on such benchmarks.

Comment 5 Ken Dreyer (Red Hat) 2016-09-30 21:32:18 UTC

still in needinfo -> re-targeting to 2.2

Comment 6 John Spray 2017-01-06 12:08:24 UTC

No response from the reporter, and no evidence this is a real bug.  Closing.