Description of problem: heketi pod crashes Version-Release number of selected component (if applicable): Image: rhgs3/rhgs-volmanager-rhel7:latest Image ID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-volmanager-rhel7@sha256:d1594ec197c3620c21ff98d404f48c6decd386be194307d336b0521ce796a5b8 How reproducible: not clear how to reproduce this issue Steps to Reproduce: 1. NA Actual results: heketi pod crashes Expected results: Additional info: -- oc logs heketi pod --- oc logs heketi-storage-1-8lss6 Heketi 6.0.0 [heketi] INFO 2018/02/23 08:18:02 Loaded kubernetes executor [heketi] INFO 2018/02/23 08:18:02 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue. [heketi] ERROR 2018/02/23 08:18:02 /src/github.com/heketi/heketi/apps/glusterfs/app.go:150: Heketi terminated while performing one or more operations. Server will not start as long as pending operations are present in the db. panic: Heketi terminated while performing one or more operations. Server will not start as long as pending operations are present in the db. goroutine 1 [running]: github.com/heketi/heketi/apps/glusterfs.NewApp(0x236bf40, 0xc4201402b8, 0x0) /builddir/build/BUILD/heketi-6.0.0/src/github.com/heketi/heketi/apps/glusterfs/app.go:154 +0xc0e main.main() /builddir/build/BUILD/heketi-6.0.0/main.go:273 +0x46d --- ------------------------------------------------------------------------------------- --- oc describe heketi pod --- oc describe pod heketi-storage-1-8lss6 Name: heketi-storage-1-8lss6 Namespace: glusterfs Node: app-node-87.scale-ci.example.com/192.168.99.135 Start Time: Thu, 22 Feb 2018 16:05:45 -0500 Labels: deployment=heketi-storage-1 deploymentconfig=heketi-storage glusterfs=heketi-storage-pod heketi=storage-pod Annotations: openshift.io/deployment-config.latest-version=1 openshift.io/deployment-config.name=heketi-storage openshift.io/deployment.name=heketi-storage-1 openshift.io/scc=uperf Status: Running IP: 10.131.8.64 Controlled By: ReplicationController/heketi-storage-1 Containers: heketi: Container ID: docker://ea1b2ef16949b804a99368c262c042519511e0bb6439e173df200ad0f68fb94f Image: rhgs3/rhgs-volmanager-rhel7:latest Image ID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-volmanager-rhel7@sha256:d1594ec197c3620c21ff98d404f48c6decd386be194307d336b0521ce796a5b8 Port: 8080/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 2 Started: Fri, 23 Feb 2018 03:18:02 -0500 Finished: Fri, 23 Feb 2018 03:18:02 -0500 Ready: False Restart Count: 130 Liveness: http-get http://:8080/hello delay=30s timeout=3s period=10s #success=1 #failure=3 Readiness: http-get http://:8080/hello delay=3s timeout=3s period=10s #success=1 #failure=3 Environment: HEKETI_USER_KEY: 3ul+XJr0Tc77P7biuizMBJXluUINAGW05kU6QfZwVhI= HEKETI_ADMIN_KEY: 7+LhqsCEe2L33J961jfezFsSDxkh4hWOi0n53mazufE= HEKETI_EXECUTOR: kubernetes HEKETI_FSTAB: /var/lib/heketi/fstab HEKETI_SNAPSHOT_LIMIT: 14 HEKETI_KUBE_GLUSTER_DAEMONSET: 1 Mounts: /etc/heketi from config (rw) /var/lib/heketi from db (rw) /var/run/secrets/kubernetes.io/serviceaccount from heketi-storage-service-account-token-zqv7c (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: db: Type: Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime) EndpointsName: heketi-db-storage-endpoints Path: heketidbstorage ReadOnly: false config: Type: Secret (a volume populated by a Secret) SecretName: heketi-storage-config-secret Optional: false heketi-storage-service-account-token-zqv7c: Type: Secret (a volume populated by a Secret) SecretName: heketi-storage-service-account-token-zqv7c Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 1h (x113 over 10h) kubelet, app-node-87.scale-ci.example.com Container image "rhgs3/rhgs-volmanager-rhel7:latest" already present on machine Warning BackOff 2s (x2983 over 10h) kubelet, app-node-87.scale-ci.example.com Back-off restarting failed container
I have seen this too. For me it happened when I deleted a heketi pod while the volume create operation was ongoing. Elvir, Please attach the heketi.db and heketi logs.
(In reply to Raghavendra Talur from comment #3) > I have seen this too. For me it happened when I deleted a heketi pod while > the volume create operation was ongoing. > > Elvir, > > Please attach the heketi.db and heketi logs. # oc logs <heketi pod> is already attached in #1. Do you want some other specific heketi logs? heketidb attached
This was hit again in the scale lab after it the cluster was redeployed. The problem occurred when the first block gluster PVC was created in OpenShift. Adding TestBlocker keyword since it currently is preventing OCP logging and metrics deployment with glusterblock storage in the scalability CI cluster.
heketi pod crashing *not* visible with images -- openshift_storage_glusterfs_block_version: "3.3.0-362" openshift_storage_glusterfs_version: "3.3.1-3" openshift_storage_glusterfs_heketi_version: "3.3.0-364" openshift_storage_glusterfs_s3_version: "3.3.0-362" openshift_storage_glusterfs_heketi_version: "3.3.0-364" ---
*** Bug 1549140 has been marked as a duplicate of this bug. ***
openshift version atomic-openshift.x86_64 3.9.2-1.git.0.eeb4d16.el7 openshift-ansible version [fedora@ip-172-31-33-174 openshift-ansible]$ git log --oneline -1 5225351c4 (HEAD -> release-3.9, origin/release-3.9) Merge pull request #7460 from gabemontero/jenkins-template-update That is the latest version of CNS in brew for the moment: openshift_storage_glusterfs_version=3.3.1-7 openshift_storage_glusterfs_heketi_version=3.3.1-4 openshift_storage_glusterfs_block_version=3.3.1-3 Steps to reproduce: 1. Create 60 pods and each of them with a glusterfs-storage PVC. 2. `DURING` the provisioning of pods/PVCs, run oc delete pod -n glusterfs "$(oc get pod -n glusterfs | grep heketi | awk '{print $1}')" When this happens: root@ip-172-31-17-124: ~ # oc logs -f -n glusterfs "$(oc get pod -n glusterfs | grep heketi | awk '{print $1}')" Heketi 6.0.0 [heketi] ERROR 2018/03/08 21:55:45 /src/github.com/heketi/heketi/apps/glusterfs/app.go:79: invalid log level: [heketi] INFO 2018/03/08 21:55:45 Loaded kubernetes executor [heketi] INFO 2018/03/08 21:55:45 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue. [heketi] WARNING 2018/03/08 21:55:45 Server refusing to start. [heketi] ERROR 2018/03/08 21:55:45 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db. ERROR: Unable to start application root@ip-172-31-17-124: ~ # oc get pod -n glusterfs -o wide NAME READY STATUS RESTARTS AGE IP NODE glusterblock-storage-provisioner-dc-2-mfhtw 1/1 Running 0 7m 172.23.2.24 ip-172-31-21-106.us-west-2.compute.internal glusterfs-storage-n69j6 1/1 Running 0 12m 172.31.54.46 ip-172-31-54-46.us-west-2.compute.internal glusterfs-storage-xwcgj 1/1 Running 0 12m 172.31.57.101 ip-172-31-57-101.us-west-2.compute.internal glusterfs-storage-zlfc5 1/1 Running 0 12m 172.31.55.128 ip-172-31-55-128.us-west-2.compute.internal heketi-storage-2-f2jcb 0/1 CrashLoopBackOff 3 1m 172.23.2.29 ip-172-31-21-106.us-west-2.compute.internal More information: I tested yesterday with older released version, heketi can always heal itself. openshift_storage_glusterfs_image=registry.access.redhat.com/rhgs3/rhgs-server-rhel7 openshift_storage_glusterfs_version=3.3.0-362 openshift_storage_glusterfs_heketi_image=registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7 openshift_storage_glusterfs_heketi_version=3.3.0-364 openshift_storage_glusterfs_block_image=registry.access.redhat.com/rhgs3/rhgs-gluster-block-prov-rhel7 openshift_storage_glusterfs_block_version=3.3.0-362
A volume create command was run. While the volume creation was running, the heketi pod was deleted. The pod entered terminating state and a new heketi pod was created. The volume creation which was simultaneously running was successful. Heketi logs for the new pod is given below along with the output. The same test was tried with multiple PVC creation and no issues were seen there either, a new heketi pod was up and running. [root@dhcp46-90 ~]# heketi-cli volume create --size=2 [root@dhcp46-90 ~]# oc delete pod heketi-1-x52cj pod "heketi-1-x52cj" deleted [root@dhcp46-90 ~]# oc get pods NAME READY STATUS RESTARTS AGE busybox10 1/1 Running 22 22h busybox11 1/1 Running 22 22h busybox12 1/1 Running 21 21h busybox3 1/1 Running 44 1d busybox5 1/1 Running 27 1d busybox6 1/1 Running 26 1d busybox9 1/1 Running 23 23h glusterblock-provisioner-dc-1-vz5p2 1/1 Running 1 2d glusterfs-574s7 1/1 Running 1 2d glusterfs-r4dp6 1/1 Running 1 2d glusterfs-s2t76 1/1 Running 0 1d heketi-1-9sz5p 0/1 ContainerCreating 0 8s heketi-1-x52cj 1/1 Terminating 0 4m storage-project-router-3-jkwjn 1/1 Running 2 14d [root@dhcp46-90 ~]# oc get pods NAME READY STATUS RESTARTS AGE busybox10 1/1 Running 22 22h busybox11 1/1 Running 22 22h busybox12 1/1 Running 21 21h busybox3 1/1 Running 44 1d busybox5 1/1 Running 27 1d busybox6 1/1 Running 26 1d busybox9 1/1 Running 23 23h glusterblock-provisioner-dc-1-vz5p2 1/1 Running 1 2d glusterfs-574s7 1/1 Running 1 2d glusterfs-r4dp6 1/1 Running 1 2d glusterfs-s2t76 1/1 Running 0 1d heketi-1-9sz5p 1/1 Running 0 39s storage-project-router-3-jkwjn 1/1 Running 2 14d =================================== Logs from the new heketi pod =================================== [root@dhcp46-90 ~]# oc logs heketi-1-9sz5p Heketi 6.0.0 [heketi] ERROR 2018/03/09 08:54:38 /src/github.com/heketi/heketi/apps/glusterfs/app.go:79: invalid log level: [heketi] INFO 2018/03/09 08:54:38 Loaded kubernetes executor [heketi] ERROR 2018/03/09 08:54:38 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db. [heketi] INFO 2018/03/09 08:54:38 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue. [heketi] WARNING 2018/03/09 08:54:38 Ignoring stale pending operations.Server will be running with incomplete/inconsistent state in DB. [heketi] INFO 2018/03/09 08:54:38 Block: Auto Create Block Hosting Volume set to true [heketi] INFO 2018/03/09 08:54:38 Block: New Block Hosting Volume size 500 GB [heketi] INFO 2018/03/09 08:54:38 GlusterFS Application Loaded Listening on port 8080 Hence moving the bug to verified.
@Rachael, I can still replicate (3 times yesterday) the issue in the bz with openshift_storage_glusterfs_block_version=3.3.1-3 Which version did you use when verifying the bz? Can you replicate the bz with the version before the fix? See Comment 16. As far as I can see, the difference on the steps we did is that * I use python script to create PVCS on openshift and you call 'heketi-cli' to create volumes. I do not think this is the reason that the bz is not there any more. * But this could be important: How do you ensure that there are pending operations when the heketi pod is deleted. That is why I created 60 pods with PVCs which gave me enough time to ensure that. I have a terminal watching the number of Bound PVCs ... When I saw the number was growing (ensuring the provision in heketi has been started) but far from the targeting number 60 (ensure the provision has not been finished), then I use another terminal to delete the heketi pod. Can you please try to replicate with my steps? Thanks.
(In reply to Hongkai Liu from comment #18) Hi Hongkai, As mentioned in comment 17, I had tried the same test with PVC creation and I did not hit any issues. However I tried the test again with the steps you mentioned. I also use a python script to create PVCs. The steps I followed are: 1. Created 60 PVCs 2. When the number of bound PVCs reached 5, I deleted the heketi pod from a different terminal 3. The old heketi pod went into terminating state and a new heketi pod was created. 4. The new heketi pod was up and running and after a while, the provisioning for all the pending PVCs was also successful. I have attached logs from the new heketi pod. Version: rhgs-volmanager-rhel7:3.3.1-4
Ok. I see. I will try to replicate this again today with another cluster and will update the result here. Observation: From the uploaded log ... seems there are 59 volumes created by the new heketi pod. # grep "success: please" attachment.txt | wc -l 59 That means only 1 (instead of 5) PVC is served by the killed heketi. Unless my reading of the heketi log is wrong ... that 1 becomes very important to see the bz. It would be nice to see the log of the killed heketi as well. Rachael, at same time, can you give it a try with a version with which you think you can see the bz? If you have done this before, just let me know the version. I just wanna rule out the test environment difference between us so that you did see the bz somewhere in your test env. Maybe something else in my env. causing the issue? I did not install CNS correctly? I do not know. Thanks.
Replicate the bz with same steps on another cluster. 2 heketi pod Logs and version info and pod status: http://file.rdu.redhat.com/~hongkliu/test_result/bz1548322/ Let me now if more info is needed. Thanks.
Rachel, please share your image version. Hongkai, in second last build, heketi was terminating the start operation if there are pending entries. however in latest build we can control this behaviour with an ENVIRONMENT variable which defaults to 'true' , due to the same heketi server will continue its start operation. Below log shows that ( c#17): [heketi] ERROR 2018/03/09 08:54:38 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db. [heketi] INFO 2018/03/09 08:54:38 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue. [heketi] WARNING 2018/03/09 08:54:38 Ignoring stale pending operations.Server will be running with incomplete/inconsistent state in DB.
Hi Humble, this is the version i use: openshift_storage_glusterfs_heketi_image=registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7 openshift_storage_glusterfs_heketi_version=3.3.1-4 It is downloaded from brew reg and pushed to reg-aws. I need to do this because my cluster on aws which as no access to brew. in #c20, Rachel mentioned the heketi version. It seems the same as mine. Can you reveal the name of env. var. in #c23 so that I can check the next time? I want to make sure that I am using the right image. Another feeling: Could it be related to the way of installing CNS? I use openshift-ansible for this. The pod names listed in #c17 look with different naming strategy.
https://github.com/heketi/heketi/blob/263fbb72055d71b3763a77c051e7a00cf0c4e436/docs/troubleshooting.md --snip-- Set the environment variable HEKETI_IGNORE_STALE_OPERATIONS=true. If provided, the value should be "true" or "false". When set to "true" the variable will force Heketi to start despite the existence of stale pending operation entries in the db. Important: This option exists for expedience and forcing Heketi to start will not resolve the stale operations and any inconsistencies between Heketi and Gluster. One should schedule some planned downtime and use the methods below to really resolve the situation. If you are running Heketi within a container this variable can be passed into the container using the appropriate tools for your system. --/snip-- May be ansible deployment missed this configuration.
Additional reference # https://github.com/gluster/gluster-kubernetes/pull/447/files
Hongkai, Yes, the deployment type matters. I think the templates that are installed with cns-deploy have this variable already whereas the templates that are installed with ansible-installer don't have it. Please edit the heketi dc according to the PR https://github.com/gluster/gluster-kubernetes/pull/447 if you want to recover your setups. We will file a bug on OCP and update templates in ansible-installer.
Raghavendra and Humble, thanks. I will do the verification today. Watched this too: https://bugzilla.redhat.com/show_bug.cgi?id=1554219
This is my own hack to openshift-ansible. I will verify it again when the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1554219 is avaialble. [fedora@ip-172-31-55-221 openshift-ansible]$ git diff diff --git a/roles/openshift_storage_glusterfs/files/v3.9/deploy-heketi-template.yml b/roles/openshift_storage_glusterfs/files/v3.9/deploy-heketi-template.yml index 34af652c2..ce2f1f468 100644 --- a/roles/openshift_storage_glusterfs/files/v3.9/deploy-heketi-template.yml +++ b/roles/openshift_storage_glusterfs/files/v3.9/deploy-heketi-template.yml @@ -78,6 +78,8 @@ objects: value: '14' - name: HEKETI_KUBE_GLUSTER_DAEMONSET value: '1' + - name: HEKETI_IGNORE_STALE_OPERATIONS + value: "true" ports: - containerPort: 8080 volumeMounts: @@ -131,3 +133,6 @@ parameters: displayName: GlusterFS cluster name description: A unique name to identify this heketi service, useful for running multiple heketi instances value: glusterfs +- name: HEKETI_IGNORE_STALE_OPERATIONS + displayName: Whether to ignore stale operations at startup + description: This allows to control whether heketi should start up when there are stale pending operation entries present in the database. Setting this to true lets heketi ignore existing pending operations at startup. Setting it to false causes heketi to refuse to start if pending operations are found in the database. diff --git a/roles/openshift_storage_glusterfs/files/v3.9/heketi-template.yml b/roles/openshift_storage_glusterfs/files/v3.9/heketi-template.yml index 28cdb2982..96de35fea 100644 --- a/roles/openshift_storage_glusterfs/files/v3.9/heketi-template.yml +++ b/roles/openshift_storage_glusterfs/files/v3.9/heketi-template.yml @@ -78,6 +78,8 @@ objects: value: '14' - name: HEKETI_KUBE_GLUSTER_DAEMONSET value: '1' + - name: HEKETI_IGNORE_STALE_OPERATIONS + value: "true" ports: - containerPort: 8080 volumeMounts: @@ -134,3 +136,6 @@ parameters: displayName: GlusterFS cluster name description: A unique name to identify this heketi service, useful for running multiple heketi instances value: glusterfs +- name: HEKETI_IGNORE_STALE_OPERATIONS + displayName: Whether to ignore stale operations at startup + description: This allows to control whether heketi should start up when there are stale pending operation entries present in the database. Setting this to true lets heketi ignore existing pending operations at startup. Setting it to false causes heketi to refuse to start if pending operations are found in the database. root@ip-172-31-20-3: ~/svt/openshift_scalability # oc get pod -o yaml -n glusterfs | grep "image:" image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-gluster-block-prov-rhel7:3.3.1-3 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-gluster-block-prov-rhel7:3.3.1-3 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7:3.3.1-4 image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7:3.3.1-4 root@ip-172-31-20-3: ~/svt/openshift_scalability # oc get pod NAME READY STATUS RESTARTS AGE glusterblock-storage-provisioner-dc-2-mr897 1/1 Running 0 8m glusterfs-storage-cfwp5 1/1 Running 0 42m glusterfs-storage-qn2cc 1/1 Running 0 42m glusterfs-storage-rx7pj 1/1 Running 0 42m heketi-storage-2-x8n9p 1/1 Running 0 7m root@ip-172-31-20-3: ~/svt/openshift_scalability # oc get pod -o wide -n glusterfs NAME READY STATUS RESTARTS AGE IP NODE glusterblock-storage-provisioner-dc-2-mr897 1/1 Running 0 8m 172.23.2.5 ip-172-31-25-253.us-west-2.compute.internal glusterfs-storage-cfwp5 1/1 Running 0 42m 172.31.56.215 ip-172-31-56-215.us-west-2.compute.internal glusterfs-storage-qn2cc 1/1 Running 0 42m 172.31.13.136 ip-172-31-13-136.us-west-2.compute.internal glusterfs-storage-rx7pj 1/1 Running 0 42m 172.31.61.148 ip-172-31-61-148.us-west-2.compute.internal heketi-storage-2-x8n9p 1/1 Running 0 7m 172.23.2.7 ip-172-31-25-253.us-west-2.compute.internal root@ip-172-31-20-3: ~/svt/openshift_scalability # oc rsh -n glusterfs heketi-storage-2-x8n9p sh-4.2# printenv | grep -i stale HEKETI_IGNORE_STALE_OPERATIONS=true Second heketi logs: I saw the precious heketi warning msg: ============== Heketi 6.0.0 [heketi] ERROR 2018/03/12 15:26:46 /src/github.com/heketi/heketi/apps/glusterfs/app.go:79: invalid log level: [heketi] INFO 2018/03/12 15:26:46 Loaded kubernetes executor [heketi] ERROR 2018/03/12 15:26:46 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db. [heketi] INFO 2018/03/12 15:26:46 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue. [heketi] WARNING 2018/03/12 15:26:46 Ignoring stale pending operations.Server will be running with incomplete/inconsistent state in DB. [heketi] INFO 2018/03/12 15:26:46 Block: Auto Create Block Hosting Volume set to true [heketi] INFO 2018/03/12 15:26:46 Block: New Block Hosting Volume size 500 GB [heketi] INFO 2018/03/12 15:26:46 GlusterFS Application Loaded Authorization loaded Listening on port 8080 [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] INFO 2018/03/12 15:27:05 Loaded simple allocator [heketi] INFO 2018/03/12 15:27:05 Loaded simple allocator [negroni] Started POST /volumes [heketi] INFO 2018/03/12 15:27:05 Loaded simple allocator [heketi] INFO 2018/03/12 15:27:05 brick_num: 0 [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes I deleted heketi pod 3 times. It can always self-heal and all pods are running with PVC mounted.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0638