Description of problem: I was doing a wget of the file and at this point in time, I started to add mon and osds using ceph-ansible, IOPS were still in mbps during mon add and mon reached to a quorum. Then osds addition was completed ,At the end of this installation process ceph-MDS was restarted. ceph recovered to Active+clean state. But IOPS now dropped to kbps and remained the same till the download of the file was completed. Note that I had same FS by ceph-fuse and Kernel-client in different nodes. some of the outputs: 76% [==============================================================================================================> ] 1,635,711,544 237KB/s eta 30m 24s [root@magna070 ubuntu]# ceph -w cluster 27b4d1a0-a522-4866-b344-4ea5b101c8bd health HEALTH_OK monmap e2: 2 mons at {magna041=10.8.128.41:6789/0,magna070=10.8.128.70:6789/0} election epoch 12, quorum 0,1 magna041,magna070 fsmap e29: 1/1/1 up {0=magna070=up:active} osdmap e225: 12 osds: 12 up, 12 in flags sortbitwise pgmap v9801: 360 pgs, 8 pools, 20684 MB data, 5280 objects 63026 MB used, 11052 GB / 11114 GB avail 360 active+clean client io 850 kB/s wr, 0 op/s rd, 0 op/s wr 2016-06-03 07:57:08.579814 mon.0 [INF] pgmap v9800: 360 pgs: 360 active+clean; 20684 MB data, 63018 MB used, 11052 GB / 11114 GB avail; 429 kB/s wr, 0 op/s 2016-06-03 07:57:09.926819 mon.0 [INF] pgmap v9801: 360 pgs: 360 active+clean; 20684 MB data, 63026 MB used, 11052 GB / 11114 GB avail; 850 kB/s wr, 0 op/s 2016-06-03 07:57:10.985164 mon.0 [INF] pgmap v9802: 360 pgs: 360 active+clean; 20684 MB data, 63034 MB used, 11052 GB / 11114 GB avail 2016-06-03 07:57:14.293977 mon.0 [INF] pgmap v9803: 360 pgs: 360 active+clean; 20684 MB data, 63034 MB used, 11052 GB / 11114 GB avail ------------------>Attached MDS log file.
It's not immediately clear that this is a cephfs issue. Has the same test ever been done while using e.g. an RBD image? You mention restarting the MDS (which probably wasn't necessary), was the performance different before/after restarting that? Also, just a sanity check, the wget is from somewhere local right, not from an internet source that might have slowed down on its own?
(In reply to John Spray from comment #2) > It's not immediately clear that this is a cephfs issue. Has the same test > ever been done while using e.g. an RBD image? I have not tested with RBD, so I dont know the behaviour. > You mention restarting the MDS (which probably wasn't necessary), was the > performance different before/after restarting that? I was installing OSDS using ceph-ansible, (so the restart of MDS was not in my control). the IOPS were definitely slowed before the restart of MDS as as 3 more osds daemons were added. > Also, just a sanity check, the wget is from somewhere local right, not from > an internet source that might have slowed down on its own? wget is not from a local source, I was downloading an iso from fedora.org. I was downloading the same ISO several times, redirecting the file name using -O and the speeds were close in range of MBPS for the previous nodes. i saw the same issue while removing OSDS, but this i cant confirm now as i am still running the test
OK, so for things to slow down while you add/remove OSDs is completely normal and expected (the OSDs are doing backfilling, which consumes some of their bandwidth). Once the status has all the PGs as active+clean again, performance should go back to normal. Of course, if you have removed OSDs, the overall system bandwidth will be lower after you have removed them. I notice that you have a relatively small number of PGs. This will lead to unpredictable performance. See the documentation for choosing the number of PGs in your pools: http://docs.ceph.com/docs/jewel/rados/operations/placement-groups/#a-preselection-of-pg-num Any performance-related test needs to be done with some dependable source for the data. wgets from the internet are just too unpredictable. It is better to use a benchmarking tool (such as fio) to generate a consistent load -- ask around in the QE group for advice on such benchmarks.
still in needinfo -> re-targeting to 2.2
No response from the reporter, and no evidence this is a real bug. Closing.