Bug 2179286

Summary: [OCS] [Gluster] gluster pod not able to start tcmu-runner and gluster-blockd
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: tochan
Component: gluster-blockAssignee: John Mulligan <jmulligan>
Status: CLOSED WONTFIX QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: ocs-3.11CC: abhishku, agabriel, jmulligan, prasanna.kalever, skrenger, xiubli
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-06 18:43:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description tochan 2023-03-17 08:44:15 UTC
Description of problem:
ocs 3.11.9
three pod at 3.11.9 and only one pod at 3.11.4 like the customer environment
we’ve aligned it to 3.11.9 simply replace the version in the ds, moving ondelete policy and deleted pod to restart and pull the new image (3.11.9)
gluster pod is able to start glusterd but tcmu-runner is not able to start
gluster-blockd


Version-Release number of selected component (if applicable):
ocs 3.11.4 -> ocs 3.11.9
pods already with ocs 3.11.9:
glusterfs-libs-6.0-63.el7rhgs.x86_64
glusterfs-6.0-63.el7rhgs.x86_64
glusterfs-client-xlators-6.0-63.el7rhgs.x86_64
glusterfs-fuse-6.0-63.el7rhgs.x86_64
glusterfs-geo-replication-6.0-63.el7rhgs.x86_64
glusterfs-api-6.0-63.el7rhgs.x86_64
glusterfs-cli-6.0-63.el7rhgs.x86_64
glusterfs-server-6.0-63.el7rhgs.x86_64
gluster-block-0.2.1-41.el7rhgs.x86_64

pod with the issue updated from ocs 3.11.4 to 3.11.9
glusterfs-api-6.0-30.1.el7rhgs.x86_64
glusterfs-fuse-6.0-30.1.el7rhgs.x86_64
glusterfs-server-6.0-30.1.el7rhgs.x86_64
glusterfs-libs-6.0-30.1.el7rhgs.x86_64
glusterfs-6.0-30.1.el7rhgs.x86_64
glusterfs-client-xlators-6.0-30.1.el7rhgs.x86_64
glusterfs-cli-6.0-30.1.el7rhgs.x86_64
glusterfs-geo-replication-6.0-30.1.el7rhgs.x86_64
gluster-block-0.2.1-36.el7rhgs.x86_64


How reproducible:
NAME                                          READY     STATUS    RESTARTS   AGE       IP              NODE                                           NOMINATED NODE
glusterblock-storage-provisioner-dc-2-kp69k   1/1       Running   0          1d        10.128.0.110    master-1.agabriel311.lab.psi.pnq2.redhat.com   <none>
glusterfs-storage-4lw9q                       1/1       Running   0          1d        10.74.214.241   infra-0.agabriel311.lab.psi.pnq2.redhat.com    <none>
glusterfs-storage-67lfr                       1/1       Running   0          1d        10.74.212.9     infra-1.agabriel311.lab.psi.pnq2.redhat.com    <none>
glusterfs-storage-68bpq                       0/1       Running   54         19h       10.74.213.90    node-0.infra4.lab.psi.pnq2.redhat.com          <none> <-----
glusterfs-storage-tdftc                       1/1       Running   0          1d        10.74.214.108   infra-2.agabriel311.lab.psi.pnq2.redhat.com    <none>
heketi-storage-2-gh66p                        1/1       Running   0          1d        10.129.0.78     master-2.agabriel311.lab.psi.pnq2.redhat.com   <none>



Steps to Reproduce:
cluster with ocs 3.11.9
3 gluster pods at 3.11.9
1 gluster pod at 3.11.4
scale down heketi
oc delete pod
create new gluster pod 3.11.9, issue appears and pod ready/running remains at 0/1

Actual results:
see above

Expected results:
gluster pod should start tcmu-runner and gluster-blockd, and ready/running 1/1

Additional info:


sh-4.2# systemctl status tcmu-runner
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor preset: disabled)
   Active: activating (start) since Wed 2023-03-15 09:41:26 UTC; 5min ago
  Process: 178 ExecStopPost=/usr/bin/bash -c /usr/bin/echo 1 > ${NETLINK_BLOCK};                                 /usr/bin/echo 1 > ${NETLINK_RESET};                                 /usr/bin/echo 0 > ${NETLINK_BLOCK}; (code=exited, status=0/SUCCESS)
  Process: 209 ExecStartPre=/usr/libexec/gluster-block/upgrade_activities.sh (code=exited, status=0/SUCCESS)
 Main PID: 238 (tcmu-runner)
   CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod81ebca45_c312_11ed_a23f_fa163e4943d3.slice/docker-b71753f32d66a615ca969d2fbdc8dd303e75bb789560761228bf04a0c8814d75.scope/system.slice/tcmu-runner.service
           └─238 /usr/bin/tcmu-runner --tcmu-log-dir /var/log/glusterfs/gluster-block


sh-4.2# systemctl status gluster-blockd.service
● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2023-03-15 09:47:56 UTC; 24s ago
  Process: 196 ExecStart=/usr/sbin/gluster-blockd --glfs-lru-count $GB_GLFS_LRU_COUNT --log-level $GB_LOG_LEVEL $GB_EXTRA_ARGS (code=exited, status=19)
 Main PID: 196 (code=exited, status=19)


case 03460127
this is being done on RH's quicklab lab cluster prior to replicate
the update process in the Customer