Bug 1601874

Summary:	While creating, deleting pvs in loop and running gluster volume heal in gluster pods 2 of the 4 gluster pods are in 0/1 state
Product:	OpenShift Container Platform	Reporter:	vinutha <vinug>
Component:	Containers	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	amukherj, aos-bugs, clichybi, hchiramm, jokerman, knarra, mmccomas, mpatel, pprakash, rhs-bugs, rtalur, sankarshan, sarumuga, vbellur, vinug
Target Milestone:	---
Target Release:	3.9.z
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-09 14:20:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1605158

Description vinutha 2018-07-17 11:45:04 UTC

Description of problem:
While creating, deleting 200 pvs in loop and running gluster volume heal in gluster pods 2 of the 4 gluster pods are in 0/1 state

Version-Release number of selected component (if applicable):
# rpm -qa| grep openshift
openshift-ansible-roles-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-master-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-sdn-ovs-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-docs-3.9.31-1.git.34.154617d.el7.noarch
openshift-ansible-playbooks-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-docker-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-node-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-clients-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-3.9.31-1.git.34.154617d.el7.noarch

# oc rsh glusterfs-storage-mrfh4 
sh-4.2# rpm -qa| grep gluster 
glusterfs-libs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-api-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-fuse-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-server-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
gluster-block-0.2.1-14.1.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-cli-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-geo-replication-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-debuginfo-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64

# oc rsh heketi-storage-1-55bw4 
sh-4.2# rpm -qa| grep heketi
python-heketi-6.0.0-7.4.el7rhgs.x86_64
heketi-client-6.0.0-7.4.el7rhgs.x86_64
heketi-6.0.0-7.4.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:

CNS 4 node setup each node having 1TB device and CPU = 32 (4 cores) Memory = 72GB

1.Created 100 1Gb mongodb pods and ran IO (using dd) 

2.Upgraded the system from 3.9 live build to the experian hotfix build 

3.After all 4 gluster pods have spinned up and in 1/1 running state. All mongodb pods are also in running state. 

4. Initiated creation and deletion of 200 pvs alongwith running gluster volume heal on all 4 gluster pods. 

---- creation and delation of pvs ----------
while true
do
    for i in {101..300}
    do
        ./pvc_create.sh c$i 1; sleep 30;
    done

    sleep 40

    for i in {101..300}
    do
        oc delete pvc c$i; sleep 20;
    done
done
---------------------pv creation/deletion-------------

running gluster volume heal : while true; do for i in $(gluster v list | grep vol); do gluster v heal $i; sleep 2; done; done 

5. A core is generated on a gluster pod and 2 gluster pods are in 0/1 state


Actual results:
core file is generated on 1 of the gluster pod and 2 gluster pods are in 0/1 state 

Expected results:
No core files should be generated and all gluster pods should be in 1/1 Running state. 


Additional info:
will attach logs

Comment 7 Raghavendra Talur 2018-07-17 14:53:20 UTC

There were some discussion in the call regarding pod limits for a node.

Referring to https://access.redhat.com/documentation/en-us/openshift_container_platform/3.9/html-single/scaling_and_performance_guide/#scaling-performance-current-cluster-limits

Max number of pods per node = Min(250, number of cpu cores * 10)

considering that the cpu count on these machines is 32, we have

pod limit = min(250, 32 * 10) = 250

This holds true only if 
a. all pods have cpu resource defined as 100 millicpus
b. host machines(which might be VMs have dedicated 32 cpu cores)

Comment 8 Raghavendra Talur 2018-07-17 15:00:07 UTC

[root@dhcp46-119 ~]# nproc
32
[root@dhcp46-119 ~]# lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
CPU(s):                32
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             8

Comment 9 Raghavendra Talur 2018-07-17 15:15:15 UTC

The above comments are based on the nproc and lscpu output. I would need data to ensure that these 32 cores are dedicated to the VM. Please attach the data from the hypervisor to prove the same.

Comment 15 Carsten Lichy-Bittendorf 2018-07-18 07:35:03 UTC

(In reply to Raghavendra Talur from comment #7)
> There were some discussion in the call regarding pod limits for a node.
> 
> Referring to
> https://access.redhat.com/documentation/en-us/openshift_container_platform/3.
> 9/html-single/scaling_and_performance_guide/#scaling-performance-current-
> cluster-limits
> 
> Max number of pods per node = Min(250, number of cpu cores * 10)
> 
> considering that the cpu count on these machines is 32, we have
> 
> pod limit = min(250, 32 * 10) = 250
> 
> This holds true only if 
> a. all pods have cpu resource defined as 100 millicpus
> b. host machines(which might be VMs have dedicated 32 cpu cores)

c. the default pod limits have not been overwritten in node-config

Comment 17 Antonio Murdaca 2018-08-27 13:54:09 UTC

any way you can test this out with newer docker referenced in https://bugzilla.redhat.com/show_bug.cgi?id=1560428#c48?

Comment 19 Mrunal Patel 2019-03-08 19:25:16 UTC

Can you try docker-1.13.1-94 which we believe has the fixes?

Comment 21 RamaKasturi 2019-03-19 09:39:29 UTC

Placing need info on kasturi to try this out and get back.

Removing need info on vinutha as he is no more with RedHat.

Comment 22 RamaKasturi 2019-03-30 19:07:11 UTC

Below are the test steps i ran to verify this bug.

1) Created 100 mongodb pods with 1GB pvc in size and 1GB Ram allocated.

2) Started running a loop of volume heal info command from all the gluster pods present in the system (3 pods were present)

3) From another terminal starting running load on mongodb

4) From another terminal started to run a loop of pvc create and pvc delete for 300 pvcs.

After 300 pvcs were created and deleted i did not see any gluster pod going into 0/1 state , see that all pods are up and running, no gluster core generated in any of the node.

But on the mongodb i did a issue similar as below, will be raising a separate bug for that :
=========================================================================================

2019-03-30 19:03:40:170 12580 sec: 276 operations; 0.1 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=1, Max=30015487, Min=29999104, Avg=30007296, 90=30015487, 99=30015487, 99.9=30015487, 99.99=30015487] [READ-MODIFY-WRITE: Count=1, Max=60030975, Min=59998208, Avg=60014592, 90=60030975, 99=60030975, 99.9=60030975, 99.99=60030975] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 
2019-03-30 19:03:50:170 12590 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 
2019-03-30 19:04:00:170 12600 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=UNKNOWN, servers=[{address=:27017:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: :27017: invalid IPv6 address}, caused by {java.net.UnknownHostException: :27017: invalid IPv6 address}}]
2019-03-30 19:04:10:170 12610 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 11 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=1, Max=30015487, Min=29999104, Avg=30007296, 90=30015487, 99=30015487, 99.9=30015487, 99.99=30015487] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 
2019-03-30 19:04:20:170 12620 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 11 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 
2019-03-30 19:04:30:170 12630 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 12 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches PrimaryServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=:27017:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: :27017: invalid IPv6 address}, caused by {java.net.UnknownHostException: :27017: invalid IPv6 address}}]
2019-03-30 19:04:40:170 12640 sec: 277 operations; 0.1 current ops/sec; est completion in 9 hours 9 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=1, Max=30015487, Min=29999104, Avg=30007296, 90=30015487, 99=30015487, 99.9=30015487, 99.99=30015487] [READ-MODIFY-WRITE: Count=1, Max=60030975, Min=59998208, Avg=60014592, 90=60030975, 99=60030975, 99.9=60030975, 99.99=60030975] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 
2019-03-30 19:04:50:170 12650 sec: 277 operations; 0 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 
2019-03-30 19:05:00:170 12660 sec: 277 operations; 0 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=UNKNOWN, servers=[{address=:27017:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: :27017: invalid IPv6 address}, caused by {java.net.UnknownHostException: :27017: invalid IPv6 address}}]


Since the actual issue which is reported is not seen, moving the bug to verified state.

Oc version :
======================
[root@dhcp47-89 ~]# oc version
oc v3.9.71
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp47-89.lab.eng.blr.redhat.com:8443
openshift v3.9.71
kubernetes v1.9.1+a0ce1bc657

sh-4.2# rpm -qa | grep glusterfs
glusterfs-libs-3.12.2-32.el7rhgs.x86_64
glusterfs-3.12.2-32.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-32.el7rhgs.x86_64
glusterfs-server-3.12.2-32.el7rhgs.x86_64
glusterfs-api-3.12.2-32.el7rhgs.x86_64
glusterfs-cli-3.12.2-32.el7rhgs.x86_64
glusterfs-fuse-3.12.2-32.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-32.el7rhgs.x86_64

Comment 24 errata-xmlrpc 2019-04-09 14:20:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0619