Description of problem: If we delete one of gluster PODs, wait for its spawn, then it reaches "running" state. But, de-facto "gluster-blockd" and "tcmu-runner" fail to start. And, after it we are not able to delete PV with Gluster block volumes. "glusterd" service starts up successfully. Version-Release number of selected component (if applicable): OCP-3.10 GA, deployed 1 week ago. - oc v3.10.34 - kubernetes v1.10.0+b81c8f8 - openshift v3.10.34 - kubernetes v1.10.0+b81c8f8 Image info: - brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7:v3.10 - docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7@sha256:d4b4841b12c397cc9da2f50720ef347faa67ec7781fd6747139464b3626d96e9 Storage release version: Red Hat Gluster Storage Server 3.4.0(Container) Failed gluster POD (glusterfs-cns-b88vt) info: - glusterfs-client-xlators-3.12.2-18.el7rhgs.x86_64 - glusterfs-cli-3.12.2-18.el7rhgs.x86_64 - python2-gluster-3.12.2-18.el7rhgs.x86_64 - python-rtslib-2.1.fb63-12.el7_5.noarch - targetcli-2.1.fb46-6.el7_5.noarch - glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64 - glusterfs-libs-3.12.2-18.el7rhgs.x86_64 - glusterfs-3.12.2-18.el7rhgs.x86_64 - glusterfs-api-3.12.2-18.el7rhgs.x86_64 - glusterfs-fuse-3.12.2-18.el7rhgs.x86_64 - python-configshell-1.1.fb23-4.el7_5.noarch - tcmu-runner-1.2.0-24.el7rhgs.x86_64 - glusterfs-server-3.12.2-18.el7rhgs.x86_64 - gluster-block-0.2.1-26.el7rhgs.x86_64 How reproducible: tried once and faced it. Steps to Reproduce: 1. Create PVC using storage class with Gluster-block backend 2. Delete one of the Gluster PODs 3. Wait for Gluster POD to be recreated Actual results: "gluster-blockd" and "tcmu-runner" services fail to start up on the recreated Gluster POD. Expected results: "gluster-blockd" and "tcmu-runner" services successfully start up on the recreated Gluster POD. Additional info: ===================================================== [root@vp-ansible-v310-ga2-master-0 ~]# oc rsh glusterfs-cns-b88vt ===================================================== sh-4.2# systemctl status gluster-blockd ● gluster-blockd.service - Gluster block storage utility Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled) Active: inactive (dead) sh-4.2# systemctl status tcmu-runner ● tcmu-runner.service - LIO Userspace-passthrough daemon Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled) Active: failed (Result: core-dump) since Tue 2018-09-25 17:11:19 UTC; 43min ago Process: 925 ExecStart=/usr/bin/tcmu-runner --tcmu-log-dir $TCMU_LOGDIR (code=dumped, signal=ABRT) Process: 669 ExecStartPre=/usr/libexec/gluster-block/wait-for-bricks.sh 120 (code=exited, status=0/SUCCESS) Main PID: 925 (code=dumped, signal=ABRT) ===================================================== sh-4.2# journalctl -u gluster-blockd.service -b No journal files were found. -- No entries -- sh-4.2# journalctl -u tcmu-runner.service -b No journal files were found. -- No entries -- ===================================================== [root@vp-ansible-v310-ga2-master-0 ~]# oc get pods NAME READY STATUS RESTARTS AGE glusterblock-cns-provisioner-dc-1-29d8g 1/1 Running 0 7d glusterfs-cns-5l69d 1/1 Running 0 7d glusterfs-cns-b88vt 1/1 Running 0 22m glusterfs-cns-pfhf9 1/1 Running 0 7d heketi-cns-1-5s8ln 1/1 Running 0 2h [root@vp-ansible-v310-ga2-master-0 ~]# oc describe pod glusterfs-cns-b88vt Name: glusterfs-cns-b88vt Namespace: cns Node: vp-ansible-v310-ga2-app-cns-1/10.70.47.1 Start Time: Tue, 25 Sep 2018 17:06:35 +0000 Labels: controller-revision-hash=1354052714 glusterfs=cns-pod glusterfs-node=pod pod-template-generation=1 Annotations: openshift.io/scc=privileged Status: Running IP: 10.70.47.1 Controlled By: DaemonSet/glusterfs-cns Containers: glusterfs: Container ID: docker://b0fb54dcdbaef628fe19dc0c67f48b552859fd6a1acbb3287fc2aa330b24cb27 Image: rhgs3/rhgs-server-rhel7:v3.10 Image ID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7@sha256:d4b4841b12c397cc9da2f50720ef347faa67ec7781fd6747139464b3626d96e9 Port: <none> Host Port: <none> State: Running Started: Tue, 25 Sep 2018 17:06:36 +0000 Ready: True Restart Count: 0 Requests: cpu: 100m memory: 100Mi Liveness: exec [/bin/bash -c systemctl status glusterd.service] delay=40s timeout=3s period=25s #success=1 #failure=50 Readiness: exec [/bin/bash -c systemctl status glusterd.service] delay=40s timeout=3s period=25s #success=1 #failure=50 Environment: GB_GLFS_LRU_COUNT: 15 TCMU_LOGDIR: /var/log/glusterfs/gluster-block GB_LOGDIR: /var/log/glusterfs/gluster-block Mounts: /dev from glusterfs-dev (rw) /etc/glusterfs from glusterfs-etc (rw) /etc/ssl from glusterfs-ssl (ro) /etc/target from glusterfs-target (rw) /run from glusterfs-run (rw) /run/lvm from glusterfs-lvm (rw) /sys/fs/cgroup from glusterfs-cgroup (ro) /usr/lib/modules from kernel-modules (ro) /var/lib/glusterd from glusterfs-config (rw) /var/lib/heketi from glusterfs-heketi (rw) /var/lib/misc/glusterfsd from glusterfs-misc (rw) /var/log/glusterfs from glusterfs-logs (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-7nt86 (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: glusterfs-heketi: Type: HostPath (bare host directory volume) Path: /var/lib/heketi HostPathType: glusterfs-run: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: glusterfs-lvm: Type: HostPath (bare host directory volume) Path: /run/lvm HostPathType: glusterfs-etc: Type: HostPath (bare host directory volume) Path: /etc/glusterfs HostPathType: glusterfs-logs: Type: HostPath (bare host directory volume) Path: /var/log/glusterfs HostPathType: glusterfs-config: Type: HostPath (bare host directory volume) Path: /var/lib/glusterd HostPathType: glusterfs-dev: Type: HostPath (bare host directory volume) Path: /dev HostPathType: glusterfs-misc: Type: HostPath (bare host directory volume) Path: /var/lib/misc/glusterfsd HostPathType: glusterfs-cgroup: Type: HostPath (bare host directory volume) Path: /sys/fs/cgroup HostPathType: glusterfs-ssl: Type: HostPath (bare host directory volume) Path: /etc/ssl HostPathType: kernel-modules: Type: HostPath (bare host directory volume) Path: /usr/lib/modules HostPathType: glusterfs-target: Type: HostPath (bare host directory volume) Path: /etc/target HostPathType: default-token-7nt86: Type: Secret (a volume populated by a Secret) SecretName: default-token-7nt86 Optional: false QoS Class: Burstable Node-Selectors: glusterfs=cns-host Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 22m kubelet, vp-ansible-v310-ga2-app-cns-1 Container image "rhgs3/rhgs-server-rhel7:v3.10" already present on machine Normal Created 22m kubelet, vp-ansible-v310-ga2-app-cns-1 Created container Normal Started 22m kubelet, vp-ansible-v310-ga2-app-cns-1 Started container Warning Unhealthy 18m (x9 over 21m) kubelet, vp-ansible-v310-ga2-app-cns-1 Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: inactive (dead) Warning Unhealthy 18m (x9 over 21m) kubelet, vp-ansible-v310-ga2-app-cns-1 Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: inactive (dead) =====================================================
Created attachment 1486849 [details] tcmu-runner-glfs.log Adding "tcmu-runner-glfs.log"
Created attachment 1486850 [details] tcmu-runner.log Adding "tcmu-runner.log".
Created attachment 1486851 [details] gluster-block-gfapi.log Adding "gluster-block-gfapi.log".
Created attachment 1486852 [details] gluster-blockd.log Adding "gluster-blockd.log".
Created attachment 1486853 [details] gluster-block-configshell.log Adding "gluster-block-configshell.log".
Created attachment 1486855 [details] gluster-block-cli.log Adding "gluster-block-cli.log".
Xiubo Li, what kind of info should I provide in addition to the already existing?
Xiubo LI, it is still unclear for me what kind of information is pending from me. Pretty detailed information is provided in the first comment here. If you still think that some info still should be added, then, please, specify what exactly is expected from me to be added.
Valerii, Could you please attach the requested sos-reports. THanks!
Prasanna Kumar Kalever, JFYI, I don't see private messages and never did. So, I have no idea what is in the private messages. So, for which nodes do I need to provide sos-reports? All of them? Or only rebooted one(s)? Then, just to clarify, I don't have the lab with that error, because more than 4 months have passed since bug-report. So, I will need to reproduce it again. And provide sos-reports.
Valerii, I see. We need sosreports from all the gluster server pods on which tcmu-runner and gluster-blockd were failing to start. Feel free to pick the latest OCS version when you are reproducing this again. Atleast, we haven't seen a similar issue in past couple of releases. Thanks!
Xiubo Li, Prasanna Kumar Kalever, Just tried to reproduce this bug on the relatively new OpenShift cluster (OCP3.10 + OCS3.10), which is 6 days old. Verified, that "gluster-blockd" and "tcmu-runner" services started correctly on the restarted POD, and then I was able to create block volume using PVC. So, this one can be considered as fixed. Because problem really did exist last September.
Valerii, Thanks for the effort kept in reproducing this issue. Yeah, lot many things got fixed since last September. As we really don't have an details on what was the Root Cause earlier, due to insufficient logs/info, we cannot do much about it. Feel free to close this bug as CLOSED-WORKSFORME, as you don't see the issue anymore now. Cheers!
Done.