Description of problem: While creating, deleting 200 pvs in loop and running gluster volume heal in gluster pods 2 of the 4 gluster pods are in 0/1 state Version-Release number of selected component (if applicable): # rpm -qa| grep openshift openshift-ansible-roles-3.9.31-1.git.34.154617d.el7.noarch atomic-openshift-excluder-3.9.31-1.git.0.ef9737b.el7.noarch atomic-openshift-master-3.9.31-1.git.0.ef9737b.el7.x86_64 atomic-openshift-sdn-ovs-3.9.31-1.git.0.ef9737b.el7.x86_64 atomic-openshift-3.9.31-1.git.0.ef9737b.el7.x86_64 openshift-ansible-docs-3.9.31-1.git.34.154617d.el7.noarch openshift-ansible-playbooks-3.9.31-1.git.34.154617d.el7.noarch atomic-openshift-docker-excluder-3.9.31-1.git.0.ef9737b.el7.noarch atomic-openshift-node-3.9.31-1.git.0.ef9737b.el7.x86_64 atomic-openshift-clients-3.9.31-1.git.0.ef9737b.el7.x86_64 openshift-ansible-3.9.31-1.git.34.154617d.el7.noarch # oc rsh glusterfs-storage-mrfh4 sh-4.2# rpm -qa| grep gluster glusterfs-libs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64 glusterfs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64 glusterfs-api-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64 glusterfs-fuse-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64 glusterfs-server-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64 gluster-block-0.2.1-14.1.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64 glusterfs-cli-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64 glusterfs-geo-replication-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64 glusterfs-debuginfo-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64 # oc rsh heketi-storage-1-55bw4 sh-4.2# rpm -qa| grep heketi python-heketi-6.0.0-7.4.el7rhgs.x86_64 heketi-client-6.0.0-7.4.el7rhgs.x86_64 heketi-6.0.0-7.4.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: CNS 4 node setup each node having 1TB device and CPU = 32 (4 cores) Memory = 72GB 1.Created 100 1Gb mongodb pods and ran IO (using dd) 2.Upgraded the system from 3.9 live build to the experian hotfix build 3.After all 4 gluster pods have spinned up and in 1/1 running state. All mongodb pods are also in running state. 4. Initiated creation and deletion of 200 pvs alongwith running gluster volume heal on all 4 gluster pods. ---- creation and delation of pvs ---------- while true do for i in {101..300} do ./pvc_create.sh c$i 1; sleep 30; done sleep 40 for i in {101..300} do oc delete pvc c$i; sleep 20; done done ---------------------pv creation/deletion------------- running gluster volume heal : while true; do for i in $(gluster v list | grep vol); do gluster v heal $i; sleep 2; done; done 5. A core is generated on a gluster pod and 2 gluster pods are in 0/1 state Actual results: core file is generated on 1 of the gluster pod and 2 gluster pods are in 0/1 state Expected results: No core files should be generated and all gluster pods should be in 1/1 Running state. Additional info: will attach logs
There were some discussion in the call regarding pod limits for a node. Referring to https://access.redhat.com/documentation/en-us/openshift_container_platform/3.9/html-single/scaling_and_performance_guide/#scaling-performance-current-cluster-limits Max number of pods per node = Min(250, number of cpu cores * 10) considering that the cpu count on these machines is 32, we have pod limit = min(250, 32 * 10) = 250 This holds true only if a. all pods have cpu resource defined as 100 millicpus b. host machines(which might be VMs have dedicated 32 cpu cores)
[root@dhcp46-119 ~]# nproc 32 [root@dhcp46-119 ~]# lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' CPU(s): 32 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 8
The above comments are based on the nproc and lscpu output. I would need data to ensure that these 32 cores are dedicated to the VM. Please attach the data from the hypervisor to prove the same.
(In reply to Raghavendra Talur from comment #7) > There were some discussion in the call regarding pod limits for a node. > > Referring to > https://access.redhat.com/documentation/en-us/openshift_container_platform/3. > 9/html-single/scaling_and_performance_guide/#scaling-performance-current- > cluster-limits > > Max number of pods per node = Min(250, number of cpu cores * 10) > > considering that the cpu count on these machines is 32, we have > > pod limit = min(250, 32 * 10) = 250 > > This holds true only if > a. all pods have cpu resource defined as 100 millicpus > b. host machines(which might be VMs have dedicated 32 cpu cores) c. the default pod limits have not been overwritten in node-config
any way you can test this out with newer docker referenced in https://bugzilla.redhat.com/show_bug.cgi?id=1560428#c48?
Can you try docker-1.13.1-94 which we believe has the fixes?
Placing need info on kasturi to try this out and get back. Removing need info on vinutha as he is no more with RedHat.
Below are the test steps i ran to verify this bug. 1) Created 100 mongodb pods with 1GB pvc in size and 1GB Ram allocated. 2) Started running a loop of volume heal info command from all the gluster pods present in the system (3 pods were present) 3) From another terminal starting running load on mongodb 4) From another terminal started to run a loop of pvc create and pvc delete for 300 pvcs. After 300 pvcs were created and deleted i did not see any gluster pod going into 0/1 state , see that all pods are up and running, no gluster core generated in any of the node. But on the mongodb i did a issue similar as below, will be raising a separate bug for that : ========================================================================================= 2019-03-30 19:03:40:170 12580 sec: 276 operations; 0.1 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=1, Max=30015487, Min=29999104, Avg=30007296, 90=30015487, 99=30015487, 99.9=30015487, 99.99=30015487] [READ-MODIFY-WRITE: Count=1, Max=60030975, Min=59998208, Avg=60014592, 90=60030975, 99=60030975, 99.9=60030975, 99.99=60030975] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 2019-03-30 19:03:50:170 12590 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 2019-03-30 19:04:00:170 12600 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=UNKNOWN, servers=[{address=:27017:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: :27017: invalid IPv6 address}, caused by {java.net.UnknownHostException: :27017: invalid IPv6 address}}] 2019-03-30 19:04:10:170 12610 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 11 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=1, Max=30015487, Min=29999104, Avg=30007296, 90=30015487, 99=30015487, 99.9=30015487, 99.99=30015487] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 2019-03-30 19:04:20:170 12620 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 11 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 2019-03-30 19:04:30:170 12630 sec: 276 operations; 0 current ops/sec; est completion in 9 hours 12 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches PrimaryServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=:27017:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: :27017: invalid IPv6 address}, caused by {java.net.UnknownHostException: :27017: invalid IPv6 address}}] 2019-03-30 19:04:40:170 12640 sec: 277 operations; 0.1 current ops/sec; est completion in 9 hours 9 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=1, Max=30015487, Min=29999104, Avg=30007296, 90=30015487, 99=30015487, 99.9=30015487, 99.99=30015487] [READ-MODIFY-WRITE: Count=1, Max=60030975, Min=59998208, Avg=60014592, 90=60030975, 99=60030975, 99.9=60030975, 99.99=60030975] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 2019-03-30 19:04:50:170 12650 sec: 277 operations; 0 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] 2019-03-30 19:05:00:170 12660 sec: 277 operations; 0 current ops/sec; est completion in 9 hours 10 minutes [READ: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-MODIFY-WRITE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [READ-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] [UPDATE: Count=0, Max=0, Min=9223372036854775807, Avg=�, 90=0, 99=0, 99.9=0, 99.99=0] com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=UNKNOWN, servers=[{address=:27017:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: :27017: invalid IPv6 address}, caused by {java.net.UnknownHostException: :27017: invalid IPv6 address}}] Since the actual issue which is reported is not seen, moving the bug to verified state. Oc version : ====================== [root@dhcp47-89 ~]# oc version oc v3.9.71 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp47-89.lab.eng.blr.redhat.com:8443 openshift v3.9.71 kubernetes v1.9.1+a0ce1bc657 sh-4.2# rpm -qa | grep glusterfs glusterfs-libs-3.12.2-32.el7rhgs.x86_64 glusterfs-3.12.2-32.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-32.el7rhgs.x86_64 glusterfs-server-3.12.2-32.el7rhgs.x86_64 glusterfs-api-3.12.2-32.el7rhgs.x86_64 glusterfs-cli-3.12.2-32.el7rhgs.x86_64 glusterfs-fuse-3.12.2-32.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-32.el7rhgs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0619