Bug 1569655 - Recovery impact on Client I/O latencies
Summary: Recovery impact on Client I/O latencies
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 3.*
Assignee: Josh Durgin
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-19 17:10 UTC by John Harrigan
Modified: 2019-01-28 17:54 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-28 17:54:49 UTC
Target Upstream Version:


Attachments (Terms of Use)
summary of statistics from recovery test run (1.22 KB, text/plain)
2018-04-19 17:10 UTC, John Harrigan
no flags Details

Description John Harrigan 2018-04-19 17:10:02 UTC
Created attachment 1424221 [details]
summary of statistics from recovery test run

Description of problem:
Injected failures into RHCS while running a customer based workload and observed large increase in client latencies

Test Configuration:
 * 12x Supermicro 6048r (OSD and RGW nodes)
 * 3x Supermicro 6018r (MON nodes)
 * 24x Dell R620 (Client nodes) 
 * All systems running RHEL 7.4 GA
 * RHCS 3 (RHCEPH-3.0-RHEL-7-20171101.ci.0-x86_64-dvd.iso)
 * COSbench I/O workload generator
 * Failure script - https://github.com/jharriga/OSDfailure

Version-Release number of selected component (if applicable):
 * RHCEPH-3.0-RHEL-7-20171101.ci.0-x86_64-dvd.iso
 * Deploy using ceph-ansible
 * objectstore = filestore

Steps to Reproduce:
1. Cluster pre-filled to 25% capacity
   100 Containers
   21698 Objects/container
   Object sizes: 4KB; 64KB; 64MB (evenly divided)

2. Automation drives I/O workload through these three phases
   Phase 1 (NOfailure): No Failures
     All OSDs in, ceph -s == HEALTH_OK
     Establish baseline for Client I/O (latency and throughput)
   Phase 2 (OSDdrop): Drop single OSD device
     Remove a single OSD from the Ceph cluster
     Command issued: “systemctl stop ceph-osd@$origOSD”
   Phase 3 (OSDnode): Drop OSD node
     Disable two 40GbE NICs (public & cluster) on a single OSD node
     Command issued: “ifcfg down $iface”   
   Each phase runs for failure-time of 10m and recovery-time 60m

Actual results:
  Client latencies through the phases (in msec)
    Average - 716 (NOfailure); 757 (OSDdrop);  922 (OSDnode)
    99% -   1220 (NOfailure); 5010 (OSDdrop); 7660 (OSDnode)

Expected results:
  Less severe increase in client latencies during failures.

Additional Information:
  Writeup here  https://docs.google.com/presentation/d/1wwtYf9ymHwd8B1Utjn2JsqXRCTlAdeTm8VDoxdIWYxQ/edit?usp=sharing

Comment 3 Ben England 2019-01-27 16:36:29 UTC
John, your recent tests with Bluestore seem to show acceptable increases in latency for RHCS 3.2, should this bz be closed?

https://docs.google.com/document/d/1tGpYX6WcNNxghpqeXl8Y59LKfHytKSgw9gAn68DzAQQ/edit#heading=h.saupco9v7wzy

Comment 4 John Harrigan 2019-01-28 17:54:49 UTC
Correct, the RHCS 3.2 test results show improvement.
I am closing this bug


Note You need to log in before you can comment on or make changes to this bug.