Bug 1632873 - gluster-blockd and tcmu-runner fail to start up on a recreated gluster POD
Summary: gluster-blockd and tcmu-runner fail to start up on a recreated gluster POD
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-block
Version: cns-3.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Xiubo Li
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-25 18:04 UTC by Valerii Ponomarov
Modified: 2019-02-07 15:24 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-07 15:24:44 UTC
Embargoed:


Attachments (Terms of Use)
tcmu-runner-glfs.log (19.47 KB, text/plain)
2018-09-25 18:07 UTC, Valerii Ponomarov
no flags Details
tcmu-runner.log (294 bytes, text/plain)
2018-09-25 18:07 UTC, Valerii Ponomarov
no flags Details
gluster-block-gfapi.log (10.84 KB, text/plain)
2018-09-25 18:08 UTC, Valerii Ponomarov
no flags Details
gluster-blockd.log (39.14 KB, text/plain)
2018-09-25 18:09 UTC, Valerii Ponomarov
no flags Details
gluster-block-configshell.log (37.83 KB, text/plain)
2018-09-25 18:09 UTC, Valerii Ponomarov
no flags Details
gluster-block-cli.log (47.88 KB, text/plain)
2018-09-25 18:10 UTC, Valerii Ponomarov
no flags Details

Description Valerii Ponomarov 2018-09-25 18:04:57 UTC
Description of problem:
If we delete one of gluster PODs, wait for its spawn, then it reaches "running" state. But, de-facto "gluster-blockd" and "tcmu-runner" fail to start. And, after it we are not able to delete PV with Gluster block volumes. "glusterd" service starts up successfully.

Version-Release number of selected component (if applicable):
OCP-3.10 GA, deployed 1 week ago.
- oc v3.10.34
- kubernetes v1.10.0+b81c8f8
- openshift v3.10.34
- kubernetes v1.10.0+b81c8f8

Image info:
- brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7:v3.10
- docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7@sha256:d4b4841b12c397cc9da2f50720ef347faa67ec7781fd6747139464b3626d96e9

Storage release version: Red Hat Gluster Storage Server 3.4.0(Container)

Failed gluster POD (glusterfs-cns-b88vt) info:
- glusterfs-client-xlators-3.12.2-18.el7rhgs.x86_64
- glusterfs-cli-3.12.2-18.el7rhgs.x86_64
- python2-gluster-3.12.2-18.el7rhgs.x86_64
- python-rtslib-2.1.fb63-12.el7_5.noarch
- targetcli-2.1.fb46-6.el7_5.noarch
- glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64
- glusterfs-libs-3.12.2-18.el7rhgs.x86_64
- glusterfs-3.12.2-18.el7rhgs.x86_64
- glusterfs-api-3.12.2-18.el7rhgs.x86_64
- glusterfs-fuse-3.12.2-18.el7rhgs.x86_64
- python-configshell-1.1.fb23-4.el7_5.noarch
- tcmu-runner-1.2.0-24.el7rhgs.x86_64
- glusterfs-server-3.12.2-18.el7rhgs.x86_64
- gluster-block-0.2.1-26.el7rhgs.x86_64


How reproducible: tried once and faced it.


Steps to Reproduce:
1. Create PVC using storage class with Gluster-block backend
2. Delete one of the Gluster PODs
3. Wait for Gluster POD to be recreated

Actual results:
"gluster-blockd" and "tcmu-runner" services fail to start up on the recreated Gluster POD.

Expected results:
"gluster-blockd" and "tcmu-runner" services successfully start up on the recreated Gluster POD.

Additional info:
=====================================================
[root@vp-ansible-v310-ga2-master-0 ~]# oc rsh glusterfs-cns-b88vt
=====================================================
sh-4.2# systemctl status gluster-blockd
● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
sh-4.2# systemctl status tcmu-runner
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled)
   Active: failed (Result: core-dump) since Tue 2018-09-25 17:11:19 UTC; 43min ago
  Process: 925 ExecStart=/usr/bin/tcmu-runner --tcmu-log-dir $TCMU_LOGDIR (code=dumped, signal=ABRT)
  Process: 669 ExecStartPre=/usr/libexec/gluster-block/wait-for-bricks.sh 120 (code=exited, status=0/SUCCESS)
 Main PID: 925 (code=dumped, signal=ABRT)
=====================================================
sh-4.2# journalctl -u gluster-blockd.service -b
No journal files were found.
-- No entries --
sh-4.2# journalctl -u tcmu-runner.service -b
No journal files were found.
-- No entries --
=====================================================
[root@vp-ansible-v310-ga2-master-0 ~]# oc get pods
NAME                                      READY     STATUS    RESTARTS   AGE
glusterblock-cns-provisioner-dc-1-29d8g   1/1       Running   0          7d
glusterfs-cns-5l69d                       1/1       Running   0          7d
glusterfs-cns-b88vt                       1/1       Running   0          22m
glusterfs-cns-pfhf9                       1/1       Running   0          7d
heketi-cns-1-5s8ln                        1/1       Running   0          2h
[root@vp-ansible-v310-ga2-master-0 ~]# oc describe pod glusterfs-cns-b88vt
Name:           glusterfs-cns-b88vt
Namespace:      cns
Node:           vp-ansible-v310-ga2-app-cns-1/10.70.47.1
Start Time:     Tue, 25 Sep 2018 17:06:35 +0000
Labels:         controller-revision-hash=1354052714
                glusterfs=cns-pod
                glusterfs-node=pod
                pod-template-generation=1
Annotations:    openshift.io/scc=privileged
Status:         Running
IP:             10.70.47.1
Controlled By:  DaemonSet/glusterfs-cns
Containers:
  glusterfs:
    Container ID:   docker://b0fb54dcdbaef628fe19dc0c67f48b552859fd6a1acbb3287fc2aa330b24cb27
    Image:          rhgs3/rhgs-server-rhel7:v3.10
    Image ID:       docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7@sha256:d4b4841b12c397cc9da2f50720ef347faa67ec7781fd6747139464b3626d96e9
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Tue, 25 Sep 2018 17:06:36 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   100Mi
    Liveness:   exec [/bin/bash -c systemctl status glusterd.service] delay=40s timeout=3s period=25s #success=1 #failure=50
    Readiness:  exec [/bin/bash -c systemctl status glusterd.service] delay=40s timeout=3s period=25s #success=1 #failure=50
    Environment:
      GB_GLFS_LRU_COUNT:  15
      TCMU_LOGDIR:        /var/log/glusterfs/gluster-block
      GB_LOGDIR:          /var/log/glusterfs/gluster-block
    Mounts:
      /dev from glusterfs-dev (rw)
      /etc/glusterfs from glusterfs-etc (rw)
      /etc/ssl from glusterfs-ssl (ro)
      /etc/target from glusterfs-target (rw)
      /run from glusterfs-run (rw)
      /run/lvm from glusterfs-lvm (rw)
      /sys/fs/cgroup from glusterfs-cgroup (ro)
      /usr/lib/modules from kernel-modules (ro)
      /var/lib/glusterd from glusterfs-config (rw)
      /var/lib/heketi from glusterfs-heketi (rw)
      /var/lib/misc/glusterfsd from glusterfs-misc (rw)
      /var/log/glusterfs from glusterfs-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-7nt86 (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  glusterfs-heketi:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/heketi
    HostPathType:
  glusterfs-run:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  glusterfs-lvm:
    Type:          HostPath (bare host directory volume)
    Path:          /run/lvm
    HostPathType:
  glusterfs-etc:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/glusterfs
    HostPathType:
  glusterfs-logs:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/glusterfs
    HostPathType:
  glusterfs-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/glusterd
    HostPathType:
  glusterfs-dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:
  glusterfs-misc:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/misc/glusterfsd
    HostPathType:
  glusterfs-cgroup:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/cgroup
    HostPathType:
  glusterfs-ssl:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl
    HostPathType:
  kernel-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/lib/modules
    HostPathType:
  glusterfs-target:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/target
    HostPathType:
  default-token-7nt86:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-7nt86
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  glusterfs=cns-host
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason     Age                From                                    Message
  ----     ------     ----               ----                                    -------
  Normal   Pulled     22m                kubelet, vp-ansible-v310-ga2-app-cns-1  Container image "rhgs3/rhgs-server-rhel7:v3.10" already present on machine
  Normal   Created    22m                kubelet, vp-ansible-v310-ga2-app-cns-1  Created container
  Normal   Started    22m                kubelet, vp-ansible-v310-ga2-app-cns-1  Started container
  Warning  Unhealthy  18m (x9 over 21m)  kubelet, vp-ansible-v310-ga2-app-cns-1  Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
  Warning  Unhealthy  18m (x9 over 21m)  kubelet, vp-ansible-v310-ga2-app-cns-1  Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
=====================================================

Comment 2 Valerii Ponomarov 2018-09-25 18:07:25 UTC
Created attachment 1486849 [details]
tcmu-runner-glfs.log

Adding "tcmu-runner-glfs.log"

Comment 3 Valerii Ponomarov 2018-09-25 18:07:57 UTC
Created attachment 1486850 [details]
tcmu-runner.log

Adding "tcmu-runner.log".

Comment 4 Valerii Ponomarov 2018-09-25 18:08:42 UTC
Created attachment 1486851 [details]
gluster-block-gfapi.log

Adding "gluster-block-gfapi.log".

Comment 5 Valerii Ponomarov 2018-09-25 18:09:11 UTC
Created attachment 1486852 [details]
gluster-blockd.log

Adding "gluster-blockd.log".

Comment 6 Valerii Ponomarov 2018-09-25 18:09:42 UTC
Created attachment 1486853 [details]
gluster-block-configshell.log

Adding "gluster-block-configshell.log".

Comment 7 Valerii Ponomarov 2018-09-25 18:10:12 UTC
Created attachment 1486855 [details]
gluster-block-cli.log

Adding "gluster-block-cli.log".

Comment 9 Valerii Ponomarov 2018-09-26 05:21:07 UTC
Xiubo Li, what kind of info should I provide in addition to the already existing?

Comment 11 Valerii Ponomarov 2018-11-19 06:59:12 UTC
Xiubo LI, it is still unclear for me what kind of information is pending from me. Pretty detailed information is provided in the first comment here.
If you still think that some info still should be added, then, please, specify what exactly is expected from me to be added.

Comment 13 Prasanna Kumar Kalever 2019-02-07 13:22:16 UTC
Valerii,

Could you please attach the requested sos-reports.

THanks!

Comment 14 Valerii Ponomarov 2019-02-07 13:41:14 UTC
Prasanna Kumar Kalever,

JFYI, I don't see private messages and never did. So, I have no idea what is in the private messages.

So, for which nodes do I need to provide sos-reports? All of them? Or only rebooted one(s)?

Then, just to clarify, I don't have the lab with that error, because more than 4 months have passed since bug-report.
So, I will need to reproduce it again. And provide sos-reports.

Comment 15 Prasanna Kumar Kalever 2019-02-07 13:46:11 UTC
Valerii,

I see. We need sosreports from all the gluster server pods on which tcmu-runner and gluster-blockd were failing to start.
Feel free to pick the latest OCS version when you are reproducing this again. Atleast, we haven't seen a similar issue in past couple of releases.

Thanks!

Comment 16 Valerii Ponomarov 2019-02-07 15:09:11 UTC
Xiubo Li, Prasanna Kumar Kalever,

Just tried to reproduce this bug on the relatively new OpenShift cluster (OCP3.10 + OCS3.10), which is 6 days old.
Verified, that "gluster-blockd" and "tcmu-runner" services started correctly on the restarted POD, and then I was able to create block volume using PVC.

So, this one can be considered as fixed. Because problem really did exist last September.

Comment 17 Prasanna Kumar Kalever 2019-02-07 15:16:41 UTC
Valerii,


Thanks for the effort kept in reproducing this issue.

Yeah, lot many things got fixed since last September.

As we really don't have an details on what was the Root Cause earlier, due to insufficient logs/info, we cannot do much about it.

Feel free to close this bug as CLOSED-WORKSFORME, as you don't see the issue anymore now.


Cheers!

Comment 18 Valerii Ponomarov 2019-02-07 15:24:44 UTC
Done.


Note You need to log in before you can comment on or make changes to this bug.