1569655 – Recovery impact on Client I/O latencies

Bug 1569655 - Recovery impact on Client I/O latencies

Summary: Recovery impact on Client I/O latencies

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	3.*
Assignee:	Josh Durgin
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-19 17:10 UTC by John Harrigan
Modified:	2022-02-21 18:06 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-28 17:54:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
summary of statistics from recovery test run (1.22 KB, text/plain) 2018-04-19 17:10 UTC, John Harrigan	no flags	Details
View All

Description John Harrigan 2018-04-19 17:10:02 UTC

Created attachment 1424221 [details]
summary of statistics from recovery test run

Description of problem:
Injected failures into RHCS while running a customer based workload and observed large increase in client latencies

Test Configuration:
 * 12x Supermicro 6048r (OSD and RGW nodes)
 * 3x Supermicro 6018r (MON nodes)
 * 24x Dell R620 (Client nodes) 
 * All systems running RHEL 7.4 GA
 * RHCS 3 (RHCEPH-3.0-RHEL-7-20171101.ci.0-x86_64-dvd.iso)
 * COSbench I/O workload generator
 * Failure script - https://github.com/jharriga/OSDfailure

Version-Release number of selected component (if applicable):
 * RHCEPH-3.0-RHEL-7-20171101.ci.0-x86_64-dvd.iso
 * Deploy using ceph-ansible
 * objectstore = filestore

Steps to Reproduce:
1. Cluster pre-filled to 25% capacity
   100 Containers
   21698 Objects/container
   Object sizes: 4KB; 64KB; 64MB (evenly divided)

2. Automation drives I/O workload through these three phases
   Phase 1 (NOfailure): No Failures
     All OSDs in, ceph -s == HEALTH_OK
     Establish baseline for Client I/O (latency and throughput)
   Phase 2 (OSDdrop): Drop single OSD device
     Remove a single OSD from the Ceph cluster
     Command issued: “systemctl stop ceph-osd@$origOSD”
   Phase 3 (OSDnode): Drop OSD node
     Disable two 40GbE NICs (public & cluster) on a single OSD node
     Command issued: “ifcfg down $iface”   
   Each phase runs for failure-time of 10m and recovery-time 60m

Actual results:
  Client latencies through the phases (in msec)
    Average - 716 (NOfailure); 757 (OSDdrop);  922 (OSDnode)
    99% -   1220 (NOfailure); 5010 (OSDdrop); 7660 (OSDnode)

Expected results:
  Less severe increase in client latencies during failures.

Additional Information:
  Writeup here  https://docs.google.com/presentation/d/1wwtYf9ymHwd8B1Utjn2JsqXRCTlAdeTm8VDoxdIWYxQ/edit?usp=sharing

Comment 3 Ben England 2019-01-27 16:36:29 UTC

John, your recent tests with Bluestore seem to show acceptable increases in latency for RHCS 3.2, should this bz be closed?

https://docs.google.com/document/d/1tGpYX6WcNNxghpqeXl8Y59LKfHytKSgw9gAn68DzAQQ/edit#heading=h.saupco9v7wzy

Comment 4 John Harrigan 2019-01-28 17:54:49 UTC

Correct, the RHCS 3.2 test results show improvement.
I am closing this bug

Note You need to log in before you can comment on or make changes to this bug.