1435238 – heketi is too sluggish, volume create takes more than 15 minutes

Bug 1435238 - heketi is too sluggish, volume create takes more than 15 minutes

Summary: heketi is too sluggish, volume create takes more than 15 minutes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	CNS 3.6
Assignee:	Ramakrishna Reddy Yekulla
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1445447
TreeView+	depends on / blocked

Reported:	2017-03-23 12:46 UTC by krishnaram Karthick
Modified:	2019-01-02 05:09 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	The minimum requirements for the memory need to adhere for creating volumes as mentioned in the OpenShift Container Guide is 32 GB.
Clone Of:
Environment:
Last Closed:	2017-10-11 07:07:22 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
heketilog (226.25 KB, text/plain) 2017-03-23 12:46 UTC, krishnaram Karthick	no flags	Details
volumecreatelog_comment4 (26.40 KB, text/plain) 2017-03-24 04:13 UTC, krishnaram Karthick	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:2879	0	normal	SHIPPED_LIVE	heketi bug fix and enhancement update	2017-10-11 11:07:06 UTC

Description krishnaram Karthick 2017-03-23 12:46:56 UTC

Created attachment 1265734 [details]
heketilog

Description of problem:
Heketi has become too sluggish that pvc request takes 15+ minutes to complete. The sequence of events that could have possibly lead into this situation is as follows.

1) one of the node which contains gluster pod was suspended for a test, say node 1
2) device remove request was sent on the device from the same node. This resulted in https://bugzilla.redhat.com/show_bug.cgi?id=1435097
3) Now after restoring the suspended node, any request sent to heketi takes a very long time to complete.
4) This resulted in weird issues. For ex, volume create request was submitted, but the pvc remained in pending status for a long time. This pvc which was in pending state was deleted. As a result, device has id and size carved out but lv is not actually created

heketi-cli device info 685f6ed53fca4ae417831b5ecd297079
Device Id: 685f6ed53fca4ae417831b5ecd297079
Name: /dev/sdg
State: online
Size (GiB): 99
Used (GiB): 30
Free (GiB): 69
Bricks:
Id:207727b71a72f6e4accf8957887ac952   Size (GiB):8       Path: /var/lib/heketi/mounts/vg_685f6ed53fca4ae417831b5ecd297079/brick_207727b71a72f6e4accf8957887ac952/brick
Id:21c5ad645c8a53a0ba2f94744947b41c   Size (GiB):1       Path: 
Id:6303f20265f329e4f4691342c07534ff   Size (GiB):7       Path: 
Id:6bb39d061413930c4f93d2d54355e2a4   Size (GiB):5       Path: /var/lib/heketi/mounts/vg_685f6ed53fca4ae417831b5ecd297079/brick_6bb39d061413930c4f93d2d54355e2a4/brick
Id:9bac709a4857b5a304df5efb6b2459e8   Size (GiB):9       Path: /var/lib/heketi/mounts/vg_685f6ed53fca4ae417831b5ecd297079/brick_9bac709a4857b5a304df5efb6b2459e8/brick

This might just be a symptom but the actual issue is heketi becoming too sluggish. 

5) When pvc request was submitted and waited for a long time (15-20 mins), it got successfully created

deleting heketi pod and recreating a new one hasn't helped.

other useful information:

heketi-cli volume list
Id:1648bfdf33836dc0c8f88ca9f6a52438    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_1648bfdf33836dc0c8f88ca9f6a52438
Id:28b47869298f18d083a564a87f2ed8ac    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_28b47869298f18d083a564a87f2ed8ac
Id:4258c7b0b49dfda13c9c768ec69f7c3a    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_4258c7b0b49dfda13c9c768ec69f7c3a
Id:5e309a349fbe0bf9d3a2fcad69fbd7dd    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_5e309a349fbe0bf9d3a2fcad69fbd7dd
Id:5ece205088fec4022586f026dbf1c154    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_5ece205088fec4022586f026dbf1c154
Id:6285f3dc51fe7edd42cc62df4ce7705f    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_6285f3dc51fe7edd42cc62df4ce7705f
Id:6d6c6151c9980a3219d05dcdd5f3ef87    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_6d6c6151c9980a3219d05dcdd5f3ef87
Id:7ed50e293b497b9fa2e1ba2673fc09de    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_7ed50e293b497b9fa2e1ba2673fc09de
Id:7f2a19504254b9011184f2731c0d228d    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_7f2a19504254b9011184f2731c0d228d
Id:82490940149f92ef9157566e45a59d31    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_82490940149f92ef9157566e45a59d31
Id:8b490d5ac0ee9d2025472748c75b203f    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_8b490d5ac0ee9d2025472748c75b203f
Id:92a96b1843a3c8cfbe762aa260bd6b3a    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_92a96b1843a3c8cfbe762aa260bd6b3a
Id:94cc6fd23b42f9a7ddc61a962967dfd1    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_94cc6fd23b42f9a7ddc61a962967dfd1
Id:a60a2deab354472904a9fd91250ad14b    Cluster:ee0be395eee24de0af625fb70b598342    Name:heketidbstorage
Id:b94a7d77ed294b1af14f9b94076798df    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_b94a7d77ed294b1af14f9b94076798df
Id:c706b5d41dc8fc10d1e0de74fe7359ca    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_c706b5d41dc8fc10d1e0de74fe7359ca
Id:c7585026b7b4da1fd46f85da1df3f5d0    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_c7585026b7b4da1fd46f85da1df3f5d0
Id:c8d6c1da6ed264824900b6f452238999    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_c8d6c1da6ed264824900b6f452238999
Id:cae7dfc6c42af9d60a4fe77c8b5deb84    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_cae7dfc6c42af9d60a4fe77c8b5deb84
Id:debf37e2308369d3020e09a509f0324c    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_debf37e2308369d3020e09a509f0324c
Id:f32219d1d977fb00f0c5c1238c050cfa    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_f32219d1d977fb00f0c5c1238c050cfa
Id:f35d93fb9b5fe36d125f17aedcdc04ae    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_f35d93fb9b5fe36d125f17aedcdc04ae
Id:fc528f21a431ede3bef0153d1125eeb4    Cluster:ee0be395eee24de0af625fb70b598342    Name:vol_fc528f21a431ede3bef0153d1125eeb4

Version-Release number of selected component (if applicable):
heketi-client-4.0.0-3.el7rhgs.x86_64
image: rhgs3/rhgs-volmanager-rhel7:3.2.0-3

How reproducible:
Seen this once, Not sure if this can be consistently reproduced

heketi logs have been captured and attached.

Comment 3 Humble Chirammal 2017-03-23 14:44:34 UTC

Could be related to this https://bugzilla.redhat.com/show_bug.cgi?id=1320977 . I am placing this as a reference.

Comment 4 krishnaram Karthick 2017-03-24 04:11:39 UTC

When a pv create request is submitted, two of the three nodes has accepted the request immediately while the node which was down and restored later yesterday is taking a while to process the request.

I'll attach the complete heketi log from the time volume request was submitted and the volume request completed.

log snippet:

[negroni] Started POST /volumes
[negroni] Completed 202 Accepted in 1.195593ms
[asynchttp] INFO 2017/03/24 03:51:24 Started job 2956fbc50176ed7c2c15d179b64456e8
[heketi] INFO 2017/03/24 03:51:24 Creating volume b9baf9b7a534d6fcdb0d0cc5101a48fe
[negroni] Started GET /queue/2956fbc50176ed7c2c15d179b64456e8
[negroni] Completed 200 OK in 69.048µs
[heketi] INFO 2017/03/24 03:51:24 brick_num: 0
[heketi] INFO 2017/03/24 03:51:24 Creating brick d105f896d650aa40dfa4352261936836
[heketi] INFO 2017/03/24 03:51:24 Creating brick af467c53680ce258dca7ff5327dbea4f
[heketi] INFO 2017/03/24 03:51:24 Creating brick 2277909c5c4deeb497b3da7b06647df9
[kubeexec] DEBUG 2017/03/24 03:51:24 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:250: Host: dhcp47-51.lab.eng.blr.redhat.com Pod: glusterfs-vx1s0 Command: mkdir -p /var/lib/heketi/mounts/vg_fe3c78e3d576382ccb5acf8267db4623/brick_d105f896d650aa40dfa4352261936836
Result: 
[kubeexec] DEBUG 2017/03/24 03:51:24 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:250: Host: dhcp46-165.lab.eng.blr.redhat.com Pod: glusterfs-hcp7j Command: mkdir -p /var/lib/heketi/mounts/vg_f0565469addfecd5ec737e86bdda2950/brick_2277909c5c4deeb497b3da7b06647df9
Result: 
[negroni] Started GET /queue/2956fbc50176ed7c2c15d179b64456e8
[negroni] Completed 200 OK in 89.243µs
.
.
.
.
[kubeexec] DEBUG 2017/03/24 03:52:09 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:250: Host: dhcp47-21.lab.eng.blr.redhat.com Pod: glusterfs-jg4kw Command: mkdir -p /var/lib/heketi/mounts/vg_685f6ed53fca4ae417831b5ecd297079/brick_af467c53680ce258dca7ff5327dbea4f
Result: 
[negroni] Started GET /queue/2956fbc50176ed7c2c15d179b64456e8
[negroni] Completed 200 OK in 71.028µs
.
.
.
.
.
[kubeexec] DEBUG 2017/03/24 03:56:18 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:250: Host: dhcp47-21.lab.eng.blr.redhat.com Pod: glusterfs-jg4kw Command: chmod 2775 /var/lib/heketi/mounts/vg_685f6ed53fca4ae417831b5ecd297079/brick_af467c53680ce258dca7ff5327dbea4f/brick
.
.
[sshexec] INFO 2017/03/24 03:56:18 Creating volume vol_b9baf9b7a534d6fcdb0d0cc5101a48fe replica 3
[kubeexec] DEBUG 2017/03/24 03:56:19 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:250: Host: dhcp46-165.lab.eng.blr.redhat.com Pod: glusterfs-hcp7j Command: gluster --mode=script volume create vol_b9baf9b7a534d6fcdb0d0cc5101a48fe replica 3 10.70.46.165:/var/lib/heketi/mounts/vg_f0565469addfecd5ec737e86bdda2950/brick_2277909c5c4deeb497b3da7b06647df9/brick 10.70.47.51:/var/lib/heketi/mounts/vg_fe3c78e3d576382ccb5acf8267db4623/brick_d105f896d650aa40dfa4352261936836/brick 10.70.47.21:/var/lib/heketi/mounts/vg_685f6ed53fca4ae417831b5ecd297079/brick_af467c53680ce258dca7ff5327dbea4f/brick

Comment 5 krishnaram Karthick 2017-03-24 04:13:12 UTC

Created attachment 1265966 [details]
volumecreatelog_comment4

Comment 6 Raghavendra Talur 2017-03-31 16:31:00 UTC

As observed in 1435401, the restarted node would have lost labels. This makes the cluster have only 2 working nodes out of 3. The third node will not respond at all, hence the command is not sluggish, it is hung. 

Fix for now is to relabel the rebooted node again.

Comment 15 krishnaram Karthick 2017-09-12 02:42:08 UTC

This issue is not seen in the recent 3.6 builds. 

Moving the bug to verified.

Comment 17 errata-xmlrpc 2017-10-11 07:07:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2879

Note You need to log in before you can comment on or make changes to this bug.