Bug 1570045 - heketi-registry Pod crashes: ERROR: Unable to parse /etc/heketi/heketi.json: invalid character '#' after object key:value pair
Summary: heketi-registry Pod crashes: ERROR: Unable to parse /etc/heketi/heketi.json: ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.9.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.9.z
Assignee: Jose A. Rivera
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-20 13:52 UTC by Luis Aparicio
Modified: 2018-04-23 15:31 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-23 15:31:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Ansible hosts file inventory (8.39 KB, text/plain)
2018-04-20 13:52 UTC, Luis Aparicio
no flags Details

Description Luis Aparicio 2018-04-20 13:52:03 UTC
Created attachment 1424506 [details]
Ansible hosts file inventory

Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.9.14-1.git.3.c62bc34.el7.noarch

rpm -q ansible
ansible-2.4.3.0-1.el7ae.noarch

ansible --version
ansible 2.4.3.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible:
Install OCP 3.9 with two CNS clusters (app + infra)

Steps to Reproduce:
1. Run deploy from bastion:
ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml

2. It will fail with an error in "GlusterFS Install" phase

3. Go to the node where heketi-registry is trying to run

# oc get pod -n infra-storage -o wide
glusterfs-registry-htqwt         1/1       Running   0          16h       10.100.102.91   ocp1-sto-infra02.noc.empresas.da.tdcorp
glusterfs-registry-ng2mx         1/1       Running   0          16h       10.100.102.92   ocp1-sto-infra03.noc.empresas.da.tdcorp
glusterfs-registry-thn5n         1/1       Running   0          16h       10.100.102.90   ocp1-sto-infra01.noc.empresas.da.tdcorp
heketi-registry-1-deploy   1/1       Running   0         1m        10.16.13.4   ocp1-sto-infra02.noc.empresas.da.tdcorp
heketi-registry-1-g4bnq   0/1       CrashLoopBackOff   6         6m        10.16.11.5   ocp1-sto-infra01.noc.empresas.da.tdcorp

4. Look at the pods log
# oc logs heketi-registry-1-g4bnq
Heketi 6.0.0
ERROR: Unable to parse /etc/heketi/heketi.json: invalid character '#' after object key:value pair

5. Look at the pods config volume in /var/lib/origin
[root@ocp1-sto-app02 devicemapper]# cat /var/lib/origin/openshift.local.volumes/pods/a57641a3-449c-11e8-8315-00505693112a/volumes/kubernetes.io~secret/config/heketi.json 
{
	"_port_comment": "Heketi Server Port Number",
	"port" : "8080",

	"_use_auth": "Enable JWT authorization. Please enable for deployment",
	"use_auth" : false,

	"_jwt" : "Private keys for access",
	"jwt" : {
		"_admin" : "Admin has access to all APIs",
		"admin" : {
			"key" : "My Secret"
		},
		"_user" : "User only has access to /volumes endpoint",
		"user" : {
			"key" : "My Secret"
		}
	},

	"_glusterfs_comment": "GlusterFS Configuration",
	"glusterfs" : {

		"_executor_comment": "Execute plugin. Possible choices: mock, kubernetes, ssh",
		"executor" : "kubernetes",

		"_db_comment": "Database file name",
		"db" : "/var/lib/heketi/heketi.db",

		"sshexec" : {
			"keyfile" : "/etc/heketi/private_key",
			"port" : "22",
			"user" : "root",
			"sudo" : false
		},

		"_auto_create_block_hosting_volume": "Creates Block Hosting volumes automatically if not found or exsisting volume exhausted",
		"auto_create_block_hosting_volume": true    # enable automatic creation of block-hosting volumes in heketi upon receiving the first request for a block volume,

		"_block_hosting_volume_size": "New block hosting volume will be created in size mentioned, This is considered only if auto-create is enabled.",
		"block_hosting_volume_size": 210
	}
}

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

TASK [openshift_storage_glusterfs : Wait for heketi pod] *********************************************************************************************
FAILED - RETRYING: Wait for heketi pod (30 retries left).
FAILED - RETRYING: Wait for heketi pod (29 retries left).
FAILED - RETRYING: Wait for heketi pod (28 retries left).
FAILED - RETRYING: Wait for heketi pod (27 retries left).
FAILED - RETRYING: Wait for heketi pod (26 retries left).
FAILED - RETRYING: Wait for heketi pod (25 retries left).
FAILED - RETRYING: Wait for heketi pod (24 retries left).
FAILED - RETRYING: Wait for heketi pod (23 retries left).
FAILED - RETRYING: Wait for heketi pod (22 retries left).
FAILED - RETRYING: Wait for heketi pod (21 retries left).
FAILED - RETRYING: Wait for heketi pod (20 retries left).
FAILED - RETRYING: Wait for heketi pod (19 retries left).
FAILED - RETRYING: Wait for heketi pod (18 retries left).
FAILED - RETRYING: Wait for heketi pod (17 retries left).
FAILED - RETRYING: Wait for heketi pod (16 retries left).
FAILED - RETRYING: Wait for heketi pod (15 retries left).
FAILED - RETRYING: Wait for heketi pod (14 retries left).
FAILED - RETRYING: Wait for heketi pod (13 retries left).
FAILED - RETRYING: Wait for heketi pod (12 retries left).
FAILED - RETRYING: Wait for heketi pod (11 retries left).
FAILED - RETRYING: Wait for heketi pod (10 retries left).
FAILED - RETRYING: Wait for heketi pod (9 retries left).
FAILED - RETRYING: Wait for heketi pod (8 retries left).
FAILED - RETRYING: Wait for heketi pod (7 retries left).
FAILED - RETRYING: Wait for heketi pod (6 retries left).
FAILED - RETRYING: Wait for heketi pod (5 retries left).
FAILED - RETRYING: Wait for heketi pod (4 retries left).
FAILED - RETRYING: Wait for heketi pod (3 retries left).
FAILED - RETRYING: Wait for heketi pod (2 retries left).
FAILED - RETRYING: Wait for heketi pod (1 retries left).
fatal: [ocp1-master01.noc.empresas.da.tdcorp]: FAILED! => {"attempts": 30, "changed": false, "results": {"cmd": "/usr/bin/oc get pod --selector=glusterfs=heketi-registry-pod -o json -n infra-storage", "results": [{"apiVersion": "v1", "items": [{"apiVersion": "v1", "kind": "Pod", "metadata": {"annotations": {"openshift.io/deployment-config.latest-version": "1", "openshift.io/deployment-config.name": "heketi-registry", "openshift.io/deployment.name": "heketi-registry-1", "openshift.io/scc": "privileged"}, "creationTimestamp": "2018-04-20T12:46:17Z", "generateName": "heketi-registry-1-", "labels": {"deployment": "heketi-registry-1", "deploymentconfig": "heketi-registry", "glusterfs": "heketi-registry-pod", "heketi": "registry-pod"}, "name": "heketi-registry-1-g4bnq", "namespace": "infra-storage", "ownerReferences": [{"apiVersion": "v1", "blockOwnerDeletion": true, "controller": true, "kind": "ReplicationController", "name": "heketi-registry-1", "uid": "a238406f-4498-11e8-8315-00505693112a"}], "resourceVersion": "187425", "selfLink": "/api/v1/namespaces/infra-storage/pods/heketi-registry-1-g4bnq", "uid": "d1b7095c-4498-11e8-8315-00505693112a"}, "spec": {"containers": [{"env": [{"name": "HEKETI_USER_KEY", "value": "+H0WztUwkzVcIm2qeRQRKDSWMsOZGCEJktAIs8+fmSs="}, {"name": "HEKETI_ADMIN_KEY", "value": "oqspib4KT7DN6ADasWVVRExNwTV2XqP7+LJwLnN8+XM="}, {"name": "HEKETI_EXECUTOR", "value": "kubernetes"}, {"name": "HEKETI_FSTAB", "value": "/var/lib/heketi/fstab"}, {"name": "HEKETI_SNAPSHOT_LIMIT", "value": "14"}, {"name": "HEKETI_KUBE_GLUSTER_DAEMONSET", "value": "1"}, {"name": "HEKETI_IGNORE_STALE_OPERATIONS", "value": "true"}], "image": "rhgs3/rhgs-volmanager-rhel7:latest", "imagePullPolicy": "IfNotPresent", "livenessProbe": {"failureThreshold": 3, "httpGet": {"path": "/hello", "port": 8080, "scheme": "HTTP"}, "initialDelaySeconds": 30, "periodSeconds": 10, "successThreshold": 1, "timeoutSeconds": 3}, "name": "heketi", "ports": [{"containerPort": 8080, "protocol": "TCP"}], "readinessProbe": {"failureThreshold": 3, "httpGet": {"path": "/hello", "port": 8080, "scheme": "HTTP"}, "initialDelaySeconds": 3, "periodSeconds": 10, "successThreshold": 1, "timeoutSeconds": 3}, "resources": {}, "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "volumeMounts": [{"mountPath": "/var/lib/heketi", "name": "db"}, {"mountPath": "/etc/heketi", "name": "config"}, {"mountPath": "/var/run/secrets/kubernetes.io/serviceaccount", "name": "heketi-registry-service-account-token-2g88w", "readOnly": true}]}], "dnsPolicy": "ClusterFirst", "imagePullSecrets": [{"name": "heketi-registry-service-account-dockercfg-phdtk"}], "nodeName": "ocp1-sto-infra01.noc.empresas.da.tdcorp", "restartPolicy": "Always", "schedulerName": "default-scheduler", "securityContext": {}, "serviceAccount": "heketi-registry-service-account", "serviceAccountName": "heketi-registry-service-account", "terminationGracePeriodSeconds": 30, "volumes": [{"glusterfs": {"endpoints": "heketi-db-registry-endpoints", "path": "heketidbstorage"}, "name": "db"}, {"name": "config", "secret": {"defaultMode": 420, "secretName": "heketi-registry-config-secret"}}, {"name": "heketi-registry-service-account-token-2g88w", "secret": {"defaultMode": 420, "secretName": "heketi-registry-service-account-token-2g88w"}}]}, "status": {"conditions": [{"lastProbeTime": null, "lastTransitionTime": "2018-04-20T12:46:17Z", "status": "True", "type": "Initialized"}, {"lastProbeTime": null, "lastTransitionTime": "2018-04-20T12:46:17Z", "message": "containers with unready status: [heketi]", "reason": "ContainersNotReady", "status": "False", "type": "Ready"}, {"lastProbeTime": null, "lastTransitionTime": "2018-04-20T12:46:17Z", "status": "True", "type": "PodScheduled"}], "containerStatuses": [{"containerID": "docker://783dd779b6eda05671d42d77a86235723a210e16ec926648544b3f1242fb5c54", "image": "registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:latest", "imageID": "docker-pullable://registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7@sha256:02408e79aa76cc2031438a218b5ce8368c5e498b6dc1d5455e94e03da0a12456", "lastState": {"terminated": {"containerID": "docker://783dd779b6eda05671d42d77a86235723a210e16ec926648544b3f1242fb5c54", "exitCode": 1, "finishedAt": "2018-04-20T12:49:35Z", "reason": "Error", "startedAt": "2018-04-20T12:49:34Z"}}, "name": "heketi", "ready": false, "restartCount": 5, "state": {"waiting": {"message": "Back-off 2m40s restarting failed container=heketi pod=heketi-registry-1-g4bnq_infra-storage(d1b7095c-4498-11e8-8315-00505693112a)", "reason": "CrashLoopBackOff"}}}], "hostIP": "10.100.102.90", "phase": "Running", "podIP": "10.16.11.5", "qosClass": "BestEffort", "startTime": "2018-04-20T12:46:17Z"}}], "kind": "List", "metadata": {"resourceVersion": "", "selfLink": ""}}], "returncode": 0}, "state": "list"}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.retry

PLAY RECAP *******************************************************************************************************************************************
localhost                  : ok=14   changed=0    unreachable=0    failed=0   
ocp1-app01.noc.empresas.da.tdcorp : ok=122  changed=14   unreachable=0    failed=0   
ocp1-app02.noc.empresas.da.tdcorp : ok=122  changed=14   unreachable=0    failed=0   
ocp1-app03.noc.empresas.da.tdcorp : ok=122  changed=14   unreachable=0    failed=0   
ocp1-infra01.noc.empresas.da.tdcorp : ok=122  changed=14   unreachable=0    failed=0   
ocp1-infra02.noc.empresas.da.tdcorp : ok=122  changed=14   unreachable=0    failed=0   
ocp1-infra03.noc.empresas.da.tdcorp : ok=122  changed=14   unreachable=0    failed=0   
ocp1-lb01.noc.empresas.da.tdcorp : ok=49   changed=2    unreachable=0    failed=0   
ocp1-master01.noc.empresas.da.tdcorp : ok=481  changed=84   unreachable=0    failed=1   
ocp1-master02.noc.empresas.da.tdcorp : ok=292  changed=40   unreachable=0    failed=0   
ocp1-master03.noc.empresas.da.tdcorp : ok=292  changed=40   unreachable=0    failed=0   
ocp1-sto-app01.noc.empresas.da.tdcorp : ok=125  changed=15   unreachable=0    failed=0   
ocp1-sto-app02.noc.empresas.da.tdcorp : ok=125  changed=15   unreachable=0    failed=0   
ocp1-sto-app03.noc.empresas.da.tdcorp : ok=125  changed=15   unreachable=0    failed=0   
ocp1-sto-infra01.noc.empresas.da.tdcorp : ok=125  changed=15   unreachable=0    failed=0   
ocp1-sto-infra02.noc.empresas.da.tdcorp : ok=125  changed=15   unreachable=0    failed=0   
ocp1-sto-infra03.noc.empresas.da.tdcorp : ok=125  changed=15   unreachable=0    failed=0   


INSTALLER STATUS *************************************************************************************************************************************
Initialization             : Complete (0:00:51)
Health Check               : Complete (0:00:47)
etcd Install               : Complete (0:00:47)
Load balancer Install      : Complete (0:00:12)
Master Install             : Complete (0:02:35)
Master Additional Install  : Complete (0:00:32)
Node Install               : Complete (0:03:35)
GlusterFS Install          : In Progress (0:07:45)
	This phase can be restarted by running: playbooks/openshift-glusterfs/config.yml



Failure summary:


  1. Hosts:    ocp1-master01.noc.empresas.da.tdcorp
     Play:     Configure GlusterFS
     Task:     Wait for heketi pod
     Message:  Failed without returning a message.

Expected results:
heketi-registry pod should be in running state

Additional info:

heketi.json looks wrong, has this line containing a comment leading with '#'

 ...
 "auto_create_block_hosting_volume": true    # enable automatic creation of 
 block-hosting volumes in heketi upon receiving the first request for a block 
 volume
...

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Luis Aparicio 2018-04-20 22:45:29 UTC
Found the issue

The #comment is in the inventory file and Ansible is moving it into the hefeti.json

The line in the hosts/inventory file:

openshift_storage_glusterfs_registry_block_host_vol_create=true # enable automatic creation of block-hosting volumes in heketi upon receiving the first request for a block volume

Comment 2 Jose A. Rivera 2018-04-20 23:01:32 UTC
What happens if you remove the comment from the end of the variable definition?

Comment 3 Luis Aparicio 2018-04-21 01:21:13 UTC
I removed the comment from the end of the variable and the heketi-registry pod is Running

[root@ocp1-master01 ~]# oc get pod -n infra-storage -o wide -w
NAME                                           READY     STATUS    RESTARTS   AGE       IP              NODE
glusterblock-registry-provisioner-dc-1-rzbqx   1/1       Running   0          1h        10.16.2.2       ocp1-infra01.noc.empresas.da.tdcorp
glusterfs-registry-2stm9                       1/1       Running   0          1h        10.100.102.92   ocp1-sto-infra03.noc.empresas.da.tdcorp
glusterfs-registry-wkz7b                       1/1       Running   0          1h        10.100.102.91   ocp1-sto-infra02.noc.empresas.da.tdcorp
glusterfs-registry-xk4jc                       1/1       Running   0          1h        10.100.102.90   ocp1-sto-infra01.noc.empresas.da.tdcorp
heketi-registry-1-rjtdv                        1/1       Running   0          1h        10.16.7.2       ocp1-infra03.noc.empresas.da.tdcorp

Comment 4 Jose A. Rivera 2018-04-21 20:30:09 UTC
Sounds like a misconfigured inventory file, then. :) Problem solved?

Comment 5 Luis Aparicio 2018-04-23 15:18:41 UTC
(In reply to Jose A. Rivera from comment #4)
> Sounds like a misconfigured inventory file, then. :) Problem solved?

The problem is solved.

Don't know if this is an expected behavior when you put a comment after a value like this:

variable=value #comment

Comment 6 Jose A. Rivera 2018-04-23 15:31:54 UTC
I believe Ansible does not support inline comments like that.

Closing the BZ.


Note You need to log in before you can comment on or make changes to this bug.