Bug 1554219 - heketi pod goes into error state when restarted
Summary: heketi pod goes into error state when restarted
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: ---
: 3.9.0
Assignee: Jose A. Rivera
QA Contact: Wenkai Shi
URL:
Whiteboard:
Depends On:
Blocks: 1415750 1548322
TreeView+ depends on / blocked
 
Reported: 2018-03-12 06:01 UTC by krishnaram Karthick
Modified: 2018-06-18 18:29 UTC (History)
8 users (show)

Fixed In Version: openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-06-18 17:50:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description krishnaram Karthick 2018-03-12 06:01:10 UTC
Description of problem:

On a OCP deployment with CNS configured via ansible, heketi pod goes into error state when restarted. 

The fix for https://bugzilla.redhat.com/show_bug.cgi?id=1548322 isn't available in the ansible heketi templates.

# oc logs heketi-storage-1-rhdf7
Heketi 6.0.0
[heketi] ERROR 2018/03/12 05:58:29 /src/github.com/heketi/heketi/apps/glusterfs/app.go:79: invalid log level: 
[heketi] INFO 2018/03/12 05:58:29 Loaded kubernetes executor
[heketi] INFO 2018/03/12 05:58:29 Please refer to the Heketi troubleshooting documentation for more information on how to resolve this issue.
[heketi] WARNING 2018/03/12 05:58:29 Server refusing to start.
[heketi] ERROR 2018/03/12 05:58:29 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db.
ERROR: Unable to start application


Version-Release number of the following components:
rpm -q openshift-ansible - openshift-ansible-3.9.3-1.git.0.e166207.el7.noarch

rpm -q ansible - ansible-2.4.2.0-2.el7.noarch

ansible --version
ansible 2.4.2.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]


How reproducible:
Always

Steps to Reproduce:
1. deploy OCP + CNS via ansible
2. create pvc and at the same time restart heketi pod


Actual results:
heketi pod goes into error state

Expected results:
heketi pod should come up

Additional info:

Comment 1 Jose A. Rivera 2018-03-12 15:17:27 UTC
Initial PR on master created: https://github.com/openshift/openshift-ansible/pull/7494

Comment 2 Hongkai Liu 2018-03-12 17:32:40 UTC
Thanks for the fix.

Verified with

[fedora@ip-172-31-55-221 openshift-ansible]$ git log --oneline -1
4d0941a02 (HEAD -> bz1554219) GlusterFS: Add HEKETI_IGNORE_STALE_OPERATIONS to templates

# yum list installed | grep openshift
atomic-openshift.x86_64         3.9.4-1.git.0.35fdfc4.el7


# oc get pod -n glusterfs -o yaml | grep "image:" | sort -u
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-gluster-block-prov-rhel7:3.3.1-3
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-server-rhel7:3.3.1-7
      image: registry.reg-aws.openshift.com:443/rhgs3/rhgs-volmanager-rhel7:3.3.1-4


# oc rsh -n glusterfs heketi-storage-1-wf68c 
sh-4.2# printenv | grep -i stale
HEKETI_IGNORE_STALE_OPERATIONS=true

Comment 3 Jose A. Rivera 2018-03-13 18:59:45 UTC
Backport for 3.9 created: https://github.com/openshift/openshift-ansible/pull/7511

Comment 6 Wenkai Shi 2018-03-15 16:53:58 UTC
Verified with version openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7, code has merged and has effect.

# oc get template heketi -n glusterfs -o yaml | grep -A1 HEKETI_IGNORE_STALE_OPERATIONS
          - name: HEKETI_IGNORE_STALE_OPERATIONS
            value: "true"

# oc rsh heketi-storage-1-b5vhp
sh-4.2# env | grep HEKETI_IGNORE_STALE_OPERATIONS
HEKETI_IGNORE_STALE_OPERATIONS=true
sh-4.2# exit
exit


Note You need to log in before you can comment on or make changes to this bug.