Bug 1517652

Summary: [CRI-O] Cassandra doesn't start on crio
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: ContainersAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED CURRENTRELEASE QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: amurdaca, anli, aos-bugs, jhonce, jokerman, miburman, mmccomas, mwringe, trankin, vlaad
Target Milestone: ---Keywords: Regression
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
: 1607984 (view as bug list) Environment:
Last Closed: 2018-09-11 17:36:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
openshift infra logs none

Description Anping Li 2017-11-27 07:59:54 UTC
Description of problem:
When deploy metrics on crio system. the hawkular-metrics couldn't connected to hawkular-cassandra.  it seems Readiness probe checking failed, thus there isn't endpoint for the hawkular-cassandra.

Version-Release number of selected component (if applicable):
openshift-ansible-3.7.7-1.git.0.3e1b62b.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. install OCP-3.7 with crio
openshift_use_crio=true
openshift_crio_systemcontainer_image_override=registry.access.stage.redhat.com/openshift3/cri-o:v3.7

2. deploy metrics

3. Check the metrics pod status
# oc get pods
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-jpz2v   0/1       Running   0          2m
hawkular-metrics-w7sr9       0/1       Running   7          1h
heapster-94vnp               0/1       Running   7          1h

#oc describe pod hawkular-cassandra-1-jpz2v

Actual results:
The container is not ready. oc describe hawkular-cassandra show the Readiness probe failed: Could not get the Cassandra status

[root@host-172-16-120-6 ~]# oc describe pod hawkular-cassandra-1-jpz2v
Name:		hawkular-cassandra-1-jpz2v
Namespace:	openshift-infra
Node:		172.16.120.40/172.16.120.40
Start Time:	Mon, 27 Nov 2017 00:46:09 -0500
Labels:		metrics-infra=hawkular-cassandra
		name=hawkular-cassandra-1
		type=hawkular-cassandra
Annotations:	kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"hawkular-cassandra-1","uid":"43502c23-d316-11...
		openshift.io/scc=restricted
Status:		Running
IP:		10.129.0.13
Created By:	ReplicationController/hawkular-cassandra-1
Controlled By:	ReplicationController/hawkular-cassandra-1
Containers:
  hawkular-cassandra-1:
    Container ID:	cri-o://cff2143c32fe916766c24c885471f0096d23c639358be22da0cd74bd8b242f6f
    Image:		registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7
    Image ID:		c38e9f5edf372e446b03abc16c5d966cc1afa74909d1868436c17d2ccaa39d4c
    Ports:		9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP
    Command:
      /opt/apache-cassandra/bin/cassandra-docker.sh
      --cluster_name=hawkular-metrics
      --data_volume=/cassandra_data
      --internode_encryption=all
      --require_node_auth=true
      --enable_client_encryption=true
      --require_client_auth=true
    State:		Running
      Started:		Mon, 27 Nov 2017 00:46:09 -0500
    Ready:		False
    Restart Count:	0
    Limits:
      memory:	2G
    Requests:
      memory:	1G
    Readiness:	exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CASSANDRA_MASTER:			true
      CASSANDRA_DATA_VOLUME:		/cassandra_data
      JVM_OPTS:				-Dcassandra.commitlog.ignorereplayerrors=true
      ENABLE_PROMETHEUS_ENDPOINT:	True
      TRUSTSTORE_NODES_AUTHORITIES:	/hawkular-cassandra-certs/tls.peer.truststore.crt
      TRUSTSTORE_CLIENT_AUTHORITIES:	/hawkular-cassandra-certs/tls.client.truststore.crt
      POD_NAMESPACE:			openshift-infra (v1:metadata.namespace)
      MEMORY_LIMIT:			2000000000 (limits.memory)
      CPU_LIMIT:			node allocatable (limits.cpu)
    Mounts:
      /cassandra_data from cassandra-data (rw)
      /hawkular-cassandra-certs from hawkular-cassandra-certs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-zvpk7 (ro)
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  cassandra-data:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:	
  hawkular-cassandra-certs:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-cassandra-certs
    Optional:	false
  cassandra-token-zvpk7:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	cassandra-token-zvpk7
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	<none>
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath				Type		Reason			Message
  ---------	--------	-----	----			-------------				--------	------			-------
  10s		10s		1	default-scheduler						Normal		Scheduled		Successfully assigned hawkular-cassandra-1-jpz2v to 172.16.120.40
  9s		9s		1	kubelet, 172.16.120.40						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "cassandra-data" 
  9s		9s		1	kubelet, 172.16.120.40						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "cassandra-token-zvpk7" 
  9s		9s		1	kubelet, 172.16.120.40						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "hawkular-cassandra-certs" 
  9s		9s		1	kubelet, 172.16.120.40	spec.containers{hawkular-cassandra-1}	Normal		Pulled			Container image "registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7" already present on machine
  9s		9s		1	kubelet, 172.16.120.40	spec.containers{hawkular-cassandra-1}	Normal		Created			Created container
  9s		9s		1	kubelet, 172.16.120.40	spec.containers{hawkular-cassandra-1}	Normal		Started			Started container
  2s		2s		1	kubelet, 172.16.120.40	spec.containers{hawkular-cassandra-1}	Warning		Unhealthy		Readiness probe failed: Could not get the Cassandra status. This may mean that the Cassandra instance is not up yet. Will try again
/opt/apache-cassandra/bin/cassandra-docker-ready.sh: line 25: nodetool: command not found

Expected results:
The metrics works well in CRIO.

Additional info:

Comment 1 Matt Wringe 2017-11-27 14:22:54 UTC
Hawkular Metrics requires that Cassandra enter the ready state before it can enter the ready state. If Hawkular Metrics can't successfully connect to Cassandra after a certain time period, it will automatically restart the pod.

For any metrics issues you will need to attach:

- the logs for the metric components (Hawkular Metrics, Cassandra, Heapster). [But in this case we only need the Cassandra logs because the other pods can't start yet]

- the output of 'oc get pods -n openshift-infra -o yaml'

- the output of 'oc describe pod ${HAWKULAR_CASSANDRA_POD_NAME}'

Comment 2 Anping Li 2017-11-29 10:25:13 UTC
Created attachment 1360255 [details]
openshift infra logs

Comment 3 Matt Wringe 2017-11-29 13:55:06 UTC
It looks like it failing due to:

/opt/apache-cassandra/bin/cassandra-docker-ready.sh: line 25: nodetool: command not found


@stefan: can you please take a look at this and make sure its not something wrong with that docker image they are using?

Comment 5 Michael Burman 2017-11-29 20:57:37 UTC
Anping, can you verify your version of CRI-O ?

Comment 8 Antonio Murdaca 2017-11-29 21:09:02 UTC
is the very same image here working with a docker cluster? I suspect something is wrong in the image rather than something different between cri-o/docker clusters

Comment 9 Antonio Murdaca 2017-11-29 21:15:04 UTC
/opt/apache-cassandra/bin/cassandra-docker-ready.sh: line 25: nodetool: command not found


if the readiness probe is run *inside* the container, failing with the error above, then it's an image issue.

Comment 10 Michael Burman 2017-11-29 21:41:17 UTC
At least the latest image seems fine:

[root@miranda rhq]# docker run -i -t c3d909b40322 /bin/bash
bash-4.2$ ls -l /opt/apache-cassandra/bin/nodetool
-rwxrwxrwx. 1 root root 3359 Jun 29 13:51 /opt/apache-cassandra/bin/nodetool
bash-4.2$ nodetool
Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
..

[miburman@miranda Downloads]$ docker pull registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7
Trying to pull repository registry.access.stage.redhat.com/openshift3/metrics-cassandra ... 
sha256:02580cf6f69a49f0e9e1018aa2694b275d77779862dfd50794e2e3bddc85e216: Pulling from registry.access.stage.redhat.com/openshift3/metrics-cassandra
...

docker pull registry.access.stage.redhat.com/openshift3/metrics-cassandra@sha256:c38e9f5edf372e446b03abc16c5d966cc1afa74909d1868436c17d2ccaa39d4c

Seems to reply with 404, so I can't verify that one. Anping, can you verify with current image?

Comment 11 Antonio Murdaca 2017-11-29 21:48:16 UTC
(In reply to Michael Burman from comment #10)
> At least the latest image seems fine:
> 
> [root@miranda rhq]# docker run -i -t c3d909b40322 /bin/bash
> bash-4.2$ ls -l /opt/apache-cassandra/bin/nodetool
> -rwxrwxrwx. 1 root root 3359 Jun 29 13:51 /opt/apache-cassandra/bin/nodetool
> bash-4.2$ nodetool
> Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
> ..
> 
> [miburman@miranda Downloads]$ docker pull
> registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7
> Trying to pull repository
> registry.access.stage.redhat.com/openshift3/metrics-cassandra ... 
> sha256:02580cf6f69a49f0e9e1018aa2694b275d77779862dfd50794e2e3bddc85e216:
> Pulling from registry.access.stage.redhat.com/openshift3/metrics-cassandra
> ...
> 
> docker pull
> registry.access.stage.redhat.com/openshift3/metrics-cassandra@sha256:
> c38e9f5edf372e446b03abc16c5d966cc1afa74909d1868436c17d2ccaa39d4c
> 
> Seems to reply with 404, so I can't verify that one. Anping, can you verify
> with current image?

can you check "/opt/apache-cassandra/bin/cassandra-docker-ready.sh"?

Comment 12 Anping Li 2017-11-30 03:35:50 UTC
@Antonio, I am not sure if that is a image issue.

The cri-o images/version
registry.access.stage.redhat.com/openshift3/cri-o:v3.7
crio -v 
crio version 1.0.4

Both rsh and exec works

# oc rsh hawkular-cassandra-1-f6bcf /opt/apache-cassandra/bin/cassandra-docker-ready.sh
Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
Cassandra is in the up and normal state. It is now ready.
# oc exec hawkular-cassandra-1-f6bcf /opt/apache-cassandra/bin/cassandra-docker-ready.sh
Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
Cassandra is in the up and normal state. It is now ready.

Comment 13 Antonio Murdaca 2017-11-30 08:34:26 UTC
(In reply to Anping Li from comment #12)
> @Antonio, I am not sure if that is a image issue.
> 
> The cri-o images/version
> registry.access.stage.redhat.com/openshift3/cri-o:v3.7
> crio -v 
> crio version 1.0.4
> 
> Both rsh and exec works
> 
> # oc rsh hawkular-cassandra-1-f6bcf
> /opt/apache-cassandra/bin/cassandra-docker-ready.sh
> Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
> Cassandra is in the up and normal state. It is now ready.
> # oc exec hawkular-cassandra-1-f6bcf
> /opt/apache-cassandra/bin/cassandra-docker-ready.sh
> Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
> Cassandra is in the up and normal state. It is now ready.

Could you provide steps to deploy metrics so I can reproduce?

Comment 14 Antonio Murdaca 2017-11-30 09:06:19 UTC
Ok, I can reproduce this with kubernetes as well

Comment 15 Antonio Murdaca 2017-11-30 09:50:26 UTC
Fix is here https://github.com/kubernetes-incubator/cri-o/pull/1187

We'll release a new patch release and a new system container once that's merged and released

Comment 17 Matt Wringe 2018-01-05 14:26:22 UTC
*** Bug 1531495 has been marked as a duplicate of this bug. ***

Comment 18 Junqi Zhao 2018-01-22 06:30:34 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1529233

Comment 19 Junqi Zhao 2018-01-22 07:39:38 UTC
(In reply to Junqi Zhao from comment #18)
> Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1529233

Ignore this comment, it is fixed in 3.9 now

Comment 20 Junqi Zhao 2018-01-22 11:11:00 UTC
Tested with crio version 1.9.0, Cassandra pod can start up on crio

# oc get po
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-nktsc   1/1       Running   0          3m
hawkular-metrics-tl844       1/1       Running   0          3m
heapster-wtg7q               1/1       Running   0          3m
[root@host-172-16-120-155 ~]# runc exec -t cri-o bash
bash: /sbin/consoletype: No such file or directory
[root@host-172-16-120-155 /]# crio -v
crio version 1.9.0

Comment 21 Junqi Zhao 2018-02-27 09:52:13 UTC
Same issue happens in crio 1.9.7,metrics pods could not be started up, "Readiness probe errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1" for cassandra pod.
Re-open it

metrics version: v3.9.0-0.53.0.0
# crio --version
crio version 1.9.7

# oc get po
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-j2nzr   0/1       Running   0          8m
hawkular-metrics-76qxt       0/1       Running   1          8m
heapster-z7g84               0/1       Running   0          7m

# oc describe po hawkular-cassandra-1-j2nzr
Name:           hawkular-cassandra-1-j2nzr
Namespace:      openshift-infra
Node:           ip-172-18-9-9.ec2.internal/172.18.9.9
Start Time:     Tue, 27 Feb 2018 04:40:41 -0500
Labels:         metrics-infra=hawkular-cassandra
                name=hawkular-cassandra-1
                type=hawkular-cassandra
Annotations:    openshift.io/scc=restricted
Status:         Running
IP:             10.129.0.39
Controlled By:  ReplicationController/hawkular-cassandra-1
Containers:
  hawkular-cassandra-1:
    Container ID:  cri-o://672ecae1a98febb64530d5bc329716d466876f2d118cf10c3c8416f2458e6f32
    Image:         registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9.0-0.53.0.0
    Image ID:      registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra@sha256:b8c367205542a4ff725bad029fb89a4142f2c0eb63940c094232690dda12325f
    Ports:         9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP
    Command:
      /opt/apache-cassandra/bin/cassandra-docker.sh
      --cluster_name=hawkular-metrics
      --data_volume=/cassandra_data
      --internode_encryption=all
      --require_node_auth=true
      --enable_client_encryption=true
      --require_client_auth=true
    State:          Running
      Started:      Tue, 27 Feb 2018 04:43:48 -0500
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  2G
    Requests:
      memory:   1G
    Readiness:  exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CASSANDRA_MASTER:               true
      CASSANDRA_DATA_VOLUME:          /cassandra_data
      JVM_OPTS:                       -Dcassandra.commitlog.ignorereplayerrors=true
      ENABLE_PROMETHEUS_ENDPOINT:     True
      TRUSTSTORE_NODES_AUTHORITIES:   /hawkular-cassandra-certs/tls.peer.truststore.crt
      TRUSTSTORE_CLIENT_AUTHORITIES:  /hawkular-cassandra-certs/tls.client.truststore.crt
      POD_NAMESPACE:                  openshift-infra (v1:metadata.namespace)
      MEMORY_LIMIT:                   2000000000 (limits.memory)
      CPU_LIMIT:                      node allocatable (limits.cpu)
    Mounts:
      /cassandra_data from cassandra-data (rw)
      /hawkular-cassandra-certs from hawkular-cassandra-certs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-s6ksh (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  cassandra-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  metrics-cassandra-1
    ReadOnly:   false
  hawkular-cassandra-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hawkular-cassandra-certs
    Optional:    false
  cassandra-token-s6ksh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cassandra-token-s6ksh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason                 Age               From                                 Message
  ----     ------                 ----              ----                                 -------
  Normal   Scheduled              8m                default-scheduler                    Successfully assigned hawkular-cassandra-1-j2nzr to ip-172-18-9-9.ec2.internal
  Normal   SuccessfulMountVolume  8m                kubelet, ip-172-18-9-9.ec2.internal  MountVolume.SetUp succeeded for volume "cassandra-token-s6ksh"
  Normal   SuccessfulMountVolume  8m                kubelet, ip-172-18-9-9.ec2.internal  MountVolume.SetUp succeeded for volume "hawkular-cassandra-certs"
  Normal   SuccessfulMountVolume  8m                kubelet, ip-172-18-9-9.ec2.internal  MountVolume.SetUp succeeded for volume "pvc-1229192c-1ba2-11e8-845a-0ef9f426f1dc"
  Normal   Pulling                8m                kubelet, ip-172-18-9-9.ec2.internal  pulling image "registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9.0-0.53.0.0"
  Normal   Pulled                 5m                kubelet, ip-172-18-9-9.ec2.internal  Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9.0-0.53.0.0"
  Normal   Created                5m                kubelet, ip-172-18-9-9.ec2.internal  Created container
  Normal   Started                5m                kubelet, ip-172-18-9-9.ec2.internal  Started container
  Warning  Unhealthy              1m (x19 over 4m)  kubelet, ip-172-18-9-9.ec2.internal  Readiness probe errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1

Comment 22 Antonio Murdaca 2018-02-28 15:53:19 UTC
Should be fixed by https://github.com/kubernetes-incubator/cri-o/pull/1386

Comment 23 Junqi Zhao 2018-03-02 03:49:08 UTC
Please change to ON_QA, issue is fixed, metrics pods could be started up
# oc get po -n openshift-infra
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-8cz8s   1/1       Running   0          58m
hawkular-metrics-8gx7z       1/1       Running   0          58m
heapster-lxctn               1/1       Running   0          58m

# openshift version
openshift v3.9.1
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.16

# crio --version
crio version 1.9.7

cri-o images: v3.9.0-0.53.0.0

Comment 24 Junqi Zhao 2018-03-02 08:47:35 UTC
Remove TestBlocker keyword since issue is fixed

Comment 25 Junqi Zhao 2018-03-06 04:22:01 UTC
Set to VERIFIED as per Comment 23