1548322 – panic: Heketi terminated while performing one or more operations. Server will not start as long as pending operations are present in the db.

Bug 1548322 - panic: Heketi terminated while performing one or more operations. Server will not start as long as pending operations are present in the db.

Summary: panic: Heketi terminated while performing one or more operations. Server will...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	CNS 3.9
Assignee:	Michael Adam
QA Contact:	Rachael
Docs Contact:
URL:
Whiteboard:	aos-scalability-39
Duplicates (1):	1549140 (view as bug list)
Depends On:	1554219
Blocks:	1526415
TreeView+	depends on / blocked

Reported:	2018-02-23 08:23 UTC by Elvir Kuric
Modified:	2018-04-05 03:09 UTC (History)
CC List:	16 users (show)
Fixed In Version:	rhgs-volmanager-rhel7:3.3.1-3
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-05 03:08:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:0638	0	None	None	None	2018-04-05 03:09:27 UTC

Description Elvir Kuric 2018-02-23 08:23:19 UTC

Description of problem:

heketi pod crashes 

Version-Release number of selected component (if applicable):

Image:          rhgs3/rhgs-volmanager-rhel7:latest
Image ID:       docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-volmanager-rhel7@sha256:d1594ec197c3620c21ff98d404f48c6decd386be194307d336b0521ce796a5b8

How reproducible:

not clear how to reproduce this issue 

Steps to Reproduce:
1. NA 


Actual results:
heketi pod crashes 


Expected results:

Additional info:


-- oc logs heketi pod --- 

 oc logs heketi-storage-1-8lss6 
Heketi 6.0.0
[heketi] INFO 2018/02/23 08:18:02 Loaded kubernetes executor
[heketi] INFO 2018/02/23 08:18:02 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue.
[heketi] ERROR 2018/02/23 08:18:02 /src/github.com/heketi/heketi/apps/glusterfs/app.go:150: Heketi terminated while performing one or more operations. Server will not start as long as pending operations are present in the db.
panic: Heketi terminated while performing one or more operations. Server will not start as long as pending operations are present in the db.

goroutine 1 [running]:
github.com/heketi/heketi/apps/glusterfs.NewApp(0x236bf40, 0xc4201402b8, 0x0)
	/builddir/build/BUILD/heketi-6.0.0/src/github.com/heketi/heketi/apps/glusterfs/app.go:154 +0xc0e
main.main()
	/builddir/build/BUILD/heketi-6.0.0/main.go:273 +0x46d



--- -------------------------------------------------------------------------------------

--- oc describe heketi pod --- 
 oc describe pod heketi-storage-1-8lss6 
Name:           heketi-storage-1-8lss6
Namespace:      glusterfs
Node:           app-node-87.scale-ci.example.com/192.168.99.135
Start Time:     Thu, 22 Feb 2018 16:05:45 -0500
Labels:         deployment=heketi-storage-1
                deploymentconfig=heketi-storage
                glusterfs=heketi-storage-pod
                heketi=storage-pod
Annotations:    openshift.io/deployment-config.latest-version=1
                openshift.io/deployment-config.name=heketi-storage
                openshift.io/deployment.name=heketi-storage-1
                openshift.io/scc=uperf
Status:         Running
IP:             10.131.8.64
Controlled By:  ReplicationController/heketi-storage-1
Containers:
  heketi:
    Container ID:   docker://ea1b2ef16949b804a99368c262c042519511e0bb6439e173df200ad0f68fb94f
    Image:          rhgs3/rhgs-volmanager-rhel7:latest
    Image ID:       docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-volmanager-rhel7@sha256:d1594ec197c3620c21ff98d404f48c6decd386be194307d336b0521ce796a5b8
    Port:           8080/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 23 Feb 2018 03:18:02 -0500
      Finished:     Fri, 23 Feb 2018 03:18:02 -0500
    Ready:          False
    Restart Count:  130
    Liveness:       http-get http://:8080/hello delay=30s timeout=3s period=10s #success=1 #failure=3
    Readiness:      http-get http://:8080/hello delay=3s timeout=3s period=10s #success=1 #failure=3
    Environment:
      HEKETI_USER_KEY:                3ul+XJr0Tc77P7biuizMBJXluUINAGW05kU6QfZwVhI=
      HEKETI_ADMIN_KEY:               7+LhqsCEe2L33J961jfezFsSDxkh4hWOi0n53mazufE=
      HEKETI_EXECUTOR:                kubernetes
      HEKETI_FSTAB:                   /var/lib/heketi/fstab
      HEKETI_SNAPSHOT_LIMIT:          14
      HEKETI_KUBE_GLUSTER_DAEMONSET:  1
    Mounts:
      /etc/heketi from config (rw)
      /var/lib/heketi from db (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from heketi-storage-service-account-token-zqv7c (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  db:
    Type:           Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime)
    EndpointsName:  heketi-db-storage-endpoints
    Path:           heketidbstorage
    ReadOnly:       false
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  heketi-storage-config-secret
    Optional:    false
  heketi-storage-service-account-token-zqv7c:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  heketi-storage-service-account-token-zqv7c
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason   Age                  From                                       Message
  ----     ------   ----                 ----                                       -------
  Normal   Pulled   1h (x113 over 10h)   kubelet, app-node-87.scale-ci.example.com  Container image "rhgs3/rhgs-volmanager-rhel7:latest" already present on machine
  Warning  BackOff  2s (x2983 over 10h)  kubelet, app-node-87.scale-ci.example.com  Back-off restarting failed container

Comment 3 Raghavendra Talur 2018-02-23 10:21:00 UTC

I have seen this too. For me it happened when I deleted a heketi pod while the volume create operation was ongoing.

Elvir,

Please attach the heketi.db and heketi logs.

Comment 4 Elvir Kuric 2018-02-23 10:36:40 UTC

(In reply to Raghavendra Talur from comment #3)
> I have seen this too. For me it happened when I deleted a heketi pod while
> the volume create operation was ongoing.
> 
> Elvir,
> 
> Please attach the heketi.db and heketi logs.

# oc logs <heketi pod> 

is already attached in #1. Do you want some other specific heketi logs?

heketidb attached

Comment 6 Mike Fiedler 2018-02-28 13:08:23 UTC

This was hit again in the scale lab after it the cluster was redeployed.  The problem occurred when the first block gluster PVC was created in OpenShift.  

Adding TestBlocker keyword since it currently is preventing OCP logging and metrics deployment with glusterblock storage in the scalability CI cluster.

Comment 10 Elvir Kuric 2018-03-02 16:06:55 UTC

heketi pod crashing *not* visible with images 

-- 
openshift_storage_glusterfs_block_version: "3.3.0-362"
openshift_storage_glusterfs_version: "3.3.1-3"
openshift_storage_glusterfs_heketi_version: "3.3.0-364"
openshift_storage_glusterfs_s3_version: "3.3.0-362"
openshift_storage_glusterfs_heketi_version: "3.3.0-364"
---

Comment 14 Raghavendra Talur 2018-03-05 09:21:05 UTC

*** Bug 1549140 has been marked as a duplicate of this bug. ***

Comment 16 Hongkai Liu 2018-03-08 22:01:04 UTC

openshift version
atomic-openshift.x86_64         3.9.2-1.git.0.eeb4d16.el7

openshift-ansible version
[fedora@ip-172-31-33-174 openshift-ansible]$ git log --oneline -1                                                     
5225351c4 (HEAD -> release-3.9, origin/release-3.9) Merge pull request #7460 from gabemontero/jenkins-template-update 

That is the latest version of CNS in brew for the moment:
openshift_storage_glusterfs_version=3.3.1-7
openshift_storage_glusterfs_heketi_version=3.3.1-4
openshift_storage_glusterfs_block_version=3.3.1-3

Steps to reproduce:

1. Create 60 pods and each of them with a glusterfs-storage PVC.
2. `DURING` the provisioning of pods/PVCs, run
        oc delete pod -n glusterfs "$(oc get pod -n glusterfs | grep heketi | awk '{print $1}')"


When this happens:
root@ip-172-31-17-124: ~ # oc logs -f -n glusterfs "$(oc get pod -n glusterfs | grep heketi | awk '{print $1}')"
Heketi 6.0.0
[heketi] ERROR 2018/03/08 21:55:45 /src/github.com/heketi/heketi/apps/glusterfs/app.go:79: invalid log level: 
[heketi] INFO 2018/03/08 21:55:45 Loaded kubernetes executor
[heketi] INFO 2018/03/08 21:55:45 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue.
[heketi] WARNING 2018/03/08 21:55:45 Server refusing to start.
[heketi] ERROR 2018/03/08 21:55:45 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db.
ERROR: Unable to start application

root@ip-172-31-17-124: ~ # oc get pod -n glusterfs -o wide
NAME                                          READY     STATUS             RESTARTS   AGE       IP              NODE
glusterblock-storage-provisioner-dc-2-mfhtw   1/1       Running            0          7m        172.23.2.24     ip-172-31-21-106.us-west-2.compute.internal
glusterfs-storage-n69j6                       1/1       Running            0          12m       172.31.54.46    ip-172-31-54-46.us-west-2.compute.internal
glusterfs-storage-xwcgj                       1/1       Running            0          12m       172.31.57.101   ip-172-31-57-101.us-west-2.compute.internal
glusterfs-storage-zlfc5                       1/1       Running            0          12m       172.31.55.128   ip-172-31-55-128.us-west-2.compute.internal
heketi-storage-2-f2jcb                        0/1       CrashLoopBackOff   3          1m        172.23.2.29     ip-172-31-21-106.us-west-2.compute.internal

More information:
I tested yesterday with older released version, heketi can always heal itself.
openshift_storage_glusterfs_image=registry.access.redhat.com/rhgs3/rhgs-server-rhel7
openshift_storage_glusterfs_version=3.3.0-362
openshift_storage_glusterfs_heketi_image=registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7
openshift_storage_glusterfs_heketi_version=3.3.0-364
openshift_storage_glusterfs_block_image=registry.access.redhat.com/rhgs3/rhgs-gluster-block-prov-rhel7
openshift_storage_glusterfs_block_version=3.3.0-362

Comment 17 Rachael 2018-03-09 10:37:08 UTC

    A volume create command was run. While the volume creation was running, the heketi pod was deleted. The pod entered terminating state and a new heketi pod was created. The volume creation which was simultaneously running was successful. Heketi logs for the new pod is given below along with the output. The same test was tried with multiple PVC creation and no issues were seen there either, a new heketi pod was up and running.
     
    [root@dhcp46-90 ~]# heketi-cli volume create --size=2
     
    [root@dhcp46-90 ~]# oc delete pod heketi-1-x52cj
    pod "heketi-1-x52cj" deleted
     
    [root@dhcp46-90 ~]# oc get pods
    NAME                                  READY     STATUS              RESTARTS   AGE
    busybox10                             1/1       Running             22         22h
    busybox11                             1/1       Running             22         22h
    busybox12                             1/1       Running             21         21h
    busybox3                              1/1       Running             44         1d
    busybox5                              1/1       Running             27         1d
    busybox6                              1/1       Running             26         1d
    busybox9                              1/1       Running             23         23h
    glusterblock-provisioner-dc-1-vz5p2   1/1       Running             1          2d
    glusterfs-574s7                       1/1       Running             1          2d
    glusterfs-r4dp6                       1/1       Running             1          2d
    glusterfs-s2t76                       1/1       Running             0          1d
    heketi-1-9sz5p                        0/1       ContainerCreating   0          8s
    heketi-1-x52cj                        1/1       Terminating         0          4m
    storage-project-router-3-jkwjn        1/1       Running             2          14d
     
    [root@dhcp46-90 ~]# oc get pods
    NAME                                  READY     STATUS    RESTARTS   AGE
    busybox10                             1/1       Running   22         22h
    busybox11                             1/1       Running   22         22h
    busybox12                             1/1       Running   21         21h
    busybox3                              1/1       Running   44         1d
    busybox5                              1/1       Running   27         1d
    busybox6                              1/1       Running   26         1d
    busybox9                              1/1       Running   23         23h
    glusterblock-provisioner-dc-1-vz5p2   1/1       Running   1          2d
    glusterfs-574s7                       1/1       Running   1          2d
    glusterfs-r4dp6                       1/1       Running   1          2d
    glusterfs-s2t76                       1/1       Running   0          1d
    heketi-1-9sz5p                        1/1       Running   0          39s
    storage-project-router-3-jkwjn        1/1       Running   2          14d
     
     
    ===================================
    Logs from the new heketi pod
    ===================================
    [root@dhcp46-90 ~]# oc logs heketi-1-9sz5p
    Heketi 6.0.0
    [heketi] ERROR 2018/03/09 08:54:38 /src/github.com/heketi/heketi/apps/glusterfs/app.go:79: invalid log level:
    [heketi] INFO 2018/03/09 08:54:38 Loaded kubernetes executor
    [heketi] ERROR 2018/03/09 08:54:38 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db.
    [heketi] INFO 2018/03/09 08:54:38 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue.
    [heketi] WARNING 2018/03/09 08:54:38 Ignoring stale pending operations.Server will be running with incomplete/inconsistent state in DB.
    [heketi] INFO 2018/03/09 08:54:38 Block: Auto Create Block Hosting Volume set to true
    [heketi] INFO 2018/03/09 08:54:38 Block: New Block Hosting Volume size 500 GB
    [heketi] INFO 2018/03/09 08:54:38 GlusterFS Application Loaded
    Listening on port 8080
     
     
    Hence moving the bug to verified.

Comment 18 Hongkai Liu 2018-03-09 12:45:27 UTC

@Rachael,

I can still replicate (3 times yesterday) the issue in the bz with openshift_storage_glusterfs_block_version=3.3.1-3

Which version did you use when verifying the bz?
Can you replicate the bz with the version before the fix?

See Comment 16.

As far as I can see, the difference on the steps we did is that 
* I use python script to create PVCS on openshift and you call 'heketi-cli' to create volumes. I do not think this is the reason that the bz is not there any more.
* But this could be important: How do you ensure that there are pending operations when the heketi pod is deleted. That is why I created 60 pods with PVCs which gave me enough time to ensure that. I have a terminal watching the number of Bound PVCs ... When I saw the number was growing (ensuring the provision in heketi has been started) but far from the targeting number 60 (ensure the provision has not been finished), then I use another terminal to delete the heketi pod.

Can you please try to replicate with my steps?

Thanks.

Comment 20 Rachael 2018-03-09 17:38:34 UTC

(In reply to Hongkai Liu from comment #18)

Hi Hongkai,

As mentioned in comment 17, I had tried the same test with PVC creation and I did not hit any issues. However I tried the test again with the steps you mentioned. I also use a python script to create PVCs. The steps I followed are:

1. Created 60 PVCs
2. When the number of bound PVCs reached 5, I deleted the heketi pod from a different terminal
3. The old heketi pod went into terminating state and a new heketi pod was created.
4. The new heketi pod was up and running and after a while, the provisioning for all the pending PVCs was also successful.

I have attached logs from the new heketi pod.

Version:

rhgs-volmanager-rhel7:3.3.1-4

Comment 21 Hongkai Liu 2018-03-09 19:51:10 UTC

Ok. I see.

I will try to replicate this again today with another cluster and will update the result here.

Observation:
From the uploaded log ... seems there are 59 volumes created by the new heketi pod.
# grep "success: please" attachment.txt | wc -l
59

That means only 1 (instead of 5) PVC is served by the killed heketi.

Unless my reading of the heketi log is wrong ... that 1 becomes very important to see the bz.

It would be nice to see the log of the killed heketi as well.

Rachael,

at same time, can you give it a try with a version with which you think you can see the bz? If you have done this before, just let me know the version.

I just wanna rule out the test environment difference between us so that you did see the bz somewhere in your test env.

Maybe something else in my env. causing the issue? I did not install CNS correctly? I do not know.

Thanks.

Comment 22 Hongkai Liu 2018-03-09 21:56:16 UTC

Replicate the bz with same steps on another cluster.

2 heketi pod Logs and version info and pod status:
http://file.rdu.redhat.com/~hongkliu/test_result/bz1548322/

Let me now if more info is needed.

Thanks.

Comment 23 Humble Chirammal 2018-03-10 07:10:33 UTC

Rachel, please share your image version. 

Hongkai, in second last build, heketi was terminating the start operation if there are pending entries. however in latest build we can control this behaviour with an ENVIRONMENT variable which defaults to 'true' , due to the same heketi server will continue its start operation. 

Below log shows that ( c#17):

    [heketi] ERROR 2018/03/09 08:54:38 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db.
    [heketi] INFO 2018/03/09 08:54:38 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue.
    [heketi] WARNING 2018/03/09 08:54:38 Ignoring stale pending operations.Server will be running with incomplete/inconsistent state in DB.

Comment 24 Hongkai Liu 2018-03-10 13:52:07 UTC

Hi Humble,

this is the version i use:
openshift_storage_glusterfs_heketi_image=registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7
openshift_storage_glusterfs_heketi_version=3.3.1-4

It is downloaded from brew reg and pushed to reg-aws. I need to do this because my cluster on aws which as no access to brew.

in #c20, Rachel mentioned the heketi version. It seems the same as mine.

Can you reveal the name of env. var. in #c23 so that I can check the next time?

I want to make sure that I am using the right image.

Another feeling:
Could it be related to the way of installing CNS?
I use openshift-ansible for this.

The pod names listed in #c17 look with different naming strategy.

Comment 25 Humble Chirammal 2018-03-12 05:38:13 UTC

https://github.com/heketi/heketi/blob/263fbb72055d71b3763a77c051e7a00cf0c4e436/docs/troubleshooting.md 


--snip--

Set the environment variable HEKETI_IGNORE_STALE_OPERATIONS=true. If provided, the value should be "true" or "false". When set to "true" the variable will force Heketi to start despite the existence of stale pending operation entries in the db. Important: This option exists for expedience and forcing Heketi to start will not resolve the stale operations and any inconsistencies between Heketi and Gluster.

One should schedule some planned downtime and use the methods below to really resolve the situation. If you are running Heketi within a container this variable can be passed into the container using the appropriate tools for your system.


--/snip--

May be ansible deployment missed this configuration.

Comment 26 Humble Chirammal 2018-03-12 05:40:30 UTC

Additional reference # https://github.com/gluster/gluster-kubernetes/pull/447/files

Comment 27 Raghavendra Talur 2018-03-12 05:42:04 UTC

Hongkai,

Yes, the deployment type matters. I think the templates that are installed with cns-deploy have this variable already whereas the templates that are installed with ansible-installer don't have it. 

Please edit the heketi dc according to the PR https://github.com/gluster/gluster-kubernetes/pull/447 if you want to recover your setups. We will file a bug on OCP and update templates in ansible-installer.

Comment 28 Hongkai Liu 2018-03-12 12:42:17 UTC

Raghavendra and Humble, thanks.

I will do the verification today.

Watched this too:
https://bugzilla.redhat.com/show_bug.cgi?id=1554219

Comment 29 Hongkai Liu 2018-03-12 15:45:38 UTC

This is my own hack to openshift-ansible. I will verify it again when the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1554219 is avaialble.

[fedora@ip-172-31-55-221 openshift-ansible]$ git diff
diff --git a/roles/openshift_storage_glusterfs/files/v3.9/deploy-heketi-template.yml b/roles/openshift_storage_glusterfs/files/v3.9/deploy-heketi-template.yml
index 34af652c2..ce2f1f468 100644
--- a/roles/openshift_storage_glusterfs/files/v3.9/deploy-heketi-template.yml
+++ b/roles/openshift_storage_glusterfs/files/v3.9/deploy-heketi-template.yml
@@ -78,6 +78,8 @@ objects:
             value: '14'
           - name: HEKETI_KUBE_GLUSTER_DAEMONSET
             value: '1'
+          - name: HEKETI_IGNORE_STALE_OPERATIONS
+            value: "true"
           ports:
           - containerPort: 8080
           volumeMounts:
@@ -131,3 +133,6 @@ parameters:
   displayName: GlusterFS cluster name
   description: A unique name to identify this heketi service, useful for running multiple heketi instances
   value: glusterfs
+- name: HEKETI_IGNORE_STALE_OPERATIONS
+  displayName: Whether to ignore stale operations at startup
+  description: This allows to control whether heketi should start up when there are stale pending operation entries present in the database. Setting this to true lets heketi ignore existing pending operations at startup. Setting it to false causes heketi to refuse to start if pending operations are found in the database.
diff --git a/roles/openshift_storage_glusterfs/files/v3.9/heketi-template.yml b/roles/openshift_storage_glusterfs/files/v3.9/heketi-template.yml
index 28cdb2982..96de35fea 100644
--- a/roles/openshift_storage_glusterfs/files/v3.9/heketi-template.yml
+++ b/roles/openshift_storage_glusterfs/files/v3.9/heketi-template.yml
@@ -78,6 +78,8 @@ objects:
             value: '14'
           - name: HEKETI_KUBE_GLUSTER_DAEMONSET
             value: '1'
+          - name: HEKETI_IGNORE_STALE_OPERATIONS
+            value: "true"
           ports:
           - containerPort: 8080
           volumeMounts:
@@ -134,3 +136,6 @@ parameters:
   displayName: GlusterFS cluster name
   description: A unique name to identify this heketi service, useful for running multiple heketi instances
   value: glusterfs
+- name: HEKETI_IGNORE_STALE_OPERATIONS
+  displayName: Whether to ignore stale operations at startup
+  description: This allows to control whether heketi should start up when there are stale pending operation entries present in the database. Setting this to true lets heketi ignore existing pending operations at startup. Setting it to false causes heketi to refuse to start if pending operations are found in the database.



root@ip-172-31-20-3: ~/svt/openshift_scalability # oc get pod -o yaml -n glusterfs | grep "image:"
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-gluster-block-prov-rhel7:3.3.1-3
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-gluster-block-prov-rhel7:3.3.1-3
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7:3.3.1-4
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7:3.3.1-4


root@ip-172-31-20-3: ~/svt/openshift_scalability # oc get pod 
NAME                                          READY     STATUS    RESTARTS   AGE
glusterblock-storage-provisioner-dc-2-mr897   1/1       Running   0          8m
glusterfs-storage-cfwp5                       1/1       Running   0          42m
glusterfs-storage-qn2cc                       1/1       Running   0          42m
glusterfs-storage-rx7pj                       1/1       Running   0          42m
heketi-storage-2-x8n9p                        1/1       Running   0          7m
root@ip-172-31-20-3: ~/svt/openshift_scalability # oc get pod -o wide -n glusterfs
NAME                                          READY     STATUS    RESTARTS   AGE       IP              NODE
glusterblock-storage-provisioner-dc-2-mr897   1/1       Running   0          8m        172.23.2.5      ip-172-31-25-253.us-west-2.compute.internal
glusterfs-storage-cfwp5                       1/1       Running   0          42m       172.31.56.215   ip-172-31-56-215.us-west-2.compute.internal
glusterfs-storage-qn2cc                       1/1       Running   0          42m       172.31.13.136   ip-172-31-13-136.us-west-2.compute.internal
glusterfs-storage-rx7pj                       1/1       Running   0          42m       172.31.61.148   ip-172-31-61-148.us-west-2.compute.internal
heketi-storage-2-x8n9p                        1/1       Running   0          7m        172.23.2.7      ip-172-31-25-253.us-west-2.compute.internal
root@ip-172-31-20-3: ~/svt/openshift_scalability # oc rsh -n glusterfs heketi-storage-2-x8n9p
sh-4.2# printenv | grep -i stale
HEKETI_IGNORE_STALE_OPERATIONS=true


Second heketi logs: I saw the precious heketi warning msg:
==============
Heketi 6.0.0
[heketi] ERROR 2018/03/12 15:26:46 /src/github.com/heketi/heketi/apps/glusterfs/app.go:79: invalid log level: 
[heketi] INFO 2018/03/12 15:26:46 Loaded kubernetes executor
[heketi] ERROR 2018/03/12 15:26:46 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db.
[heketi] INFO 2018/03/12 15:26:46 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue.
[heketi] WARNING 2018/03/12 15:26:46 Ignoring stale pending operations.Server will be running with incomplete/inconsistent state in DB.
[heketi] INFO 2018/03/12 15:26:46 Block: Auto Create Block Hosting Volume set to true
[heketi] INFO 2018/03/12 15:26:46 Block: New Block Hosting Volume size 500 GB
[heketi] INFO 2018/03/12 15:26:46 GlusterFS Application Loaded
Authorization loaded
Listening on port 8080
[negroni] Started POST /volumes
[negroni] Started POST /volumes
[heketi] INFO 2018/03/12 15:27:05 Loaded simple allocator
[heketi] INFO 2018/03/12 15:27:05 Loaded simple allocator
[negroni] Started POST /volumes
[heketi] INFO 2018/03/12 15:27:05 Loaded simple allocator
[heketi] INFO 2018/03/12 15:27:05 brick_num: 0
[negroni] Started POST /volumes
[negroni] Started POST /volumes
[negroni] Started POST /volumes


I deleted heketi pod 3 times. It can always self-heal and all pods are running with PVC mounted.

Comment 32 errata-xmlrpc 2018-04-05 03:08:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0638

Note You need to log in before you can comment on or make changes to this bug.