Bug 1517652
| Summary: | [CRI-O] Cassandra doesn't start on crio | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Anping Li <anli> | ||||
| Component: | Containers | Assignee: | Antonio Murdaca <amurdaca> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.7.0 | CC: | amurdaca, anli, aos-bugs, jhonce, jokerman, miburman, mmccomas, mwringe, trankin, vlaad | ||||
| Target Milestone: | --- | Keywords: | Regression | ||||
| Target Release: | 3.9.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: |
undefined
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1607984 (view as bug list) | Environment: | |||||
| Last Closed: | 2018-09-11 17:36:06 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Hawkular Metrics requires that Cassandra enter the ready state before it can enter the ready state. If Hawkular Metrics can't successfully connect to Cassandra after a certain time period, it will automatically restart the pod.
For any metrics issues you will need to attach:
- the logs for the metric components (Hawkular Metrics, Cassandra, Heapster). [But in this case we only need the Cassandra logs because the other pods can't start yet]
- the output of 'oc get pods -n openshift-infra -o yaml'
- the output of 'oc describe pod ${HAWKULAR_CASSANDRA_POD_NAME}'
Created attachment 1360255 [details]
openshift infra logs
It looks like it failing due to: /opt/apache-cassandra/bin/cassandra-docker-ready.sh: line 25: nodetool: command not found @stefan: can you please take a look at this and make sure its not something wrong with that docker image they are using? Anping, can you verify your version of CRI-O ? is the very same image here working with a docker cluster? I suspect something is wrong in the image rather than something different between cri-o/docker clusters /opt/apache-cassandra/bin/cassandra-docker-ready.sh: line 25: nodetool: command not found if the readiness probe is run *inside* the container, failing with the error above, then it's an image issue. At least the latest image seems fine: [root@miranda rhq]# docker run -i -t c3d909b40322 /bin/bash bash-4.2$ ls -l /opt/apache-cassandra/bin/nodetool -rwxrwxrwx. 1 root root 3359 Jun 29 13:51 /opt/apache-cassandra/bin/nodetool bash-4.2$ nodetool Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss .. [miburman@miranda Downloads]$ docker pull registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7 Trying to pull repository registry.access.stage.redhat.com/openshift3/metrics-cassandra ... sha256:02580cf6f69a49f0e9e1018aa2694b275d77779862dfd50794e2e3bddc85e216: Pulling from registry.access.stage.redhat.com/openshift3/metrics-cassandra ... docker pull registry.access.stage.redhat.com/openshift3/metrics-cassandra@sha256:c38e9f5edf372e446b03abc16c5d966cc1afa74909d1868436c17d2ccaa39d4c Seems to reply with 404, so I can't verify that one. Anping, can you verify with current image? (In reply to Michael Burman from comment #10) > At least the latest image seems fine: > > [root@miranda rhq]# docker run -i -t c3d909b40322 /bin/bash > bash-4.2$ ls -l /opt/apache-cassandra/bin/nodetool > -rwxrwxrwx. 1 root root 3359 Jun 29 13:51 /opt/apache-cassandra/bin/nodetool > bash-4.2$ nodetool > Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss > .. > > [miburman@miranda Downloads]$ docker pull > registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7 > Trying to pull repository > registry.access.stage.redhat.com/openshift3/metrics-cassandra ... > sha256:02580cf6f69a49f0e9e1018aa2694b275d77779862dfd50794e2e3bddc85e216: > Pulling from registry.access.stage.redhat.com/openshift3/metrics-cassandra > ... > > docker pull > registry.access.stage.redhat.com/openshift3/metrics-cassandra@sha256: > c38e9f5edf372e446b03abc16c5d966cc1afa74909d1868436c17d2ccaa39d4c > > Seems to reply with 404, so I can't verify that one. Anping, can you verify > with current image? can you check "/opt/apache-cassandra/bin/cassandra-docker-ready.sh"? @Antonio, I am not sure if that is a image issue. The cri-o images/version registry.access.stage.redhat.com/openshift3/cri-o:v3.7 crio -v crio version 1.0.4 Both rsh and exec works # oc rsh hawkular-cassandra-1-f6bcf /opt/apache-cassandra/bin/cassandra-docker-ready.sh Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss Cassandra is in the up and normal state. It is now ready. # oc exec hawkular-cassandra-1-f6bcf /opt/apache-cassandra/bin/cassandra-docker-ready.sh Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss Cassandra is in the up and normal state. It is now ready. (In reply to Anping Li from comment #12) > @Antonio, I am not sure if that is a image issue. > > The cri-o images/version > registry.access.stage.redhat.com/openshift3/cri-o:v3.7 > crio -v > crio version 1.0.4 > > Both rsh and exec works > > # oc rsh hawkular-cassandra-1-f6bcf > /opt/apache-cassandra/bin/cassandra-docker-ready.sh > Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss > Cassandra is in the up and normal state. It is now ready. > # oc exec hawkular-cassandra-1-f6bcf > /opt/apache-cassandra/bin/cassandra-docker-ready.sh > Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss > Cassandra is in the up and normal state. It is now ready. Could you provide steps to deploy metrics so I can reproduce? Ok, I can reproduce this with kubernetes as well Fix is here https://github.com/kubernetes-incubator/cri-o/pull/1187 We'll release a new patch release and a new system container once that's merged and released *** Bug 1531495 has been marked as a duplicate of this bug. *** (In reply to Junqi Zhao from comment #18) > Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1529233 Ignore this comment, it is fixed in 3.9 now Tested with crio version 1.9.0, Cassandra pod can start up on crio # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-nktsc 1/1 Running 0 3m hawkular-metrics-tl844 1/1 Running 0 3m heapster-wtg7q 1/1 Running 0 3m [root@host-172-16-120-155 ~]# runc exec -t cri-o bash bash: /sbin/consoletype: No such file or directory [root@host-172-16-120-155 /]# crio -v crio version 1.9.0 Same issue happens in crio 1.9.7,metrics pods could not be started up, "Readiness probe errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1" for cassandra pod.
Re-open it
metrics version: v3.9.0-0.53.0.0
# crio --version
crio version 1.9.7
# oc get po
NAME READY STATUS RESTARTS AGE
hawkular-cassandra-1-j2nzr 0/1 Running 0 8m
hawkular-metrics-76qxt 0/1 Running 1 8m
heapster-z7g84 0/1 Running 0 7m
# oc describe po hawkular-cassandra-1-j2nzr
Name: hawkular-cassandra-1-j2nzr
Namespace: openshift-infra
Node: ip-172-18-9-9.ec2.internal/172.18.9.9
Start Time: Tue, 27 Feb 2018 04:40:41 -0500
Labels: metrics-infra=hawkular-cassandra
name=hawkular-cassandra-1
type=hawkular-cassandra
Annotations: openshift.io/scc=restricted
Status: Running
IP: 10.129.0.39
Controlled By: ReplicationController/hawkular-cassandra-1
Containers:
hawkular-cassandra-1:
Container ID: cri-o://672ecae1a98febb64530d5bc329716d466876f2d118cf10c3c8416f2458e6f32
Image: registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9.0-0.53.0.0
Image ID: registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra@sha256:b8c367205542a4ff725bad029fb89a4142f2c0eb63940c094232690dda12325f
Ports: 9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP
Command:
/opt/apache-cassandra/bin/cassandra-docker.sh
--cluster_name=hawkular-metrics
--data_volume=/cassandra_data
--internode_encryption=all
--require_node_auth=true
--enable_client_encryption=true
--require_client_auth=true
State: Running
Started: Tue, 27 Feb 2018 04:43:48 -0500
Ready: False
Restart Count: 0
Limits:
memory: 2G
Requests:
memory: 1G
Readiness: exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
CASSANDRA_MASTER: true
CASSANDRA_DATA_VOLUME: /cassandra_data
JVM_OPTS: -Dcassandra.commitlog.ignorereplayerrors=true
ENABLE_PROMETHEUS_ENDPOINT: True
TRUSTSTORE_NODES_AUTHORITIES: /hawkular-cassandra-certs/tls.peer.truststore.crt
TRUSTSTORE_CLIENT_AUTHORITIES: /hawkular-cassandra-certs/tls.client.truststore.crt
POD_NAMESPACE: openshift-infra (v1:metadata.namespace)
MEMORY_LIMIT: 2000000000 (limits.memory)
CPU_LIMIT: node allocatable (limits.cpu)
Mounts:
/cassandra_data from cassandra-data (rw)
/hawkular-cassandra-certs from hawkular-cassandra-certs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-s6ksh (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
cassandra-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: metrics-cassandra-1
ReadOnly: false
hawkular-cassandra-certs:
Type: Secret (a volume populated by a Secret)
SecretName: hawkular-cassandra-certs
Optional: false
cassandra-token-s6ksh:
Type: Secret (a volume populated by a Secret)
SecretName: cassandra-token-s6ksh
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m default-scheduler Successfully assigned hawkular-cassandra-1-j2nzr to ip-172-18-9-9.ec2.internal
Normal SuccessfulMountVolume 8m kubelet, ip-172-18-9-9.ec2.internal MountVolume.SetUp succeeded for volume "cassandra-token-s6ksh"
Normal SuccessfulMountVolume 8m kubelet, ip-172-18-9-9.ec2.internal MountVolume.SetUp succeeded for volume "hawkular-cassandra-certs"
Normal SuccessfulMountVolume 8m kubelet, ip-172-18-9-9.ec2.internal MountVolume.SetUp succeeded for volume "pvc-1229192c-1ba2-11e8-845a-0ef9f426f1dc"
Normal Pulling 8m kubelet, ip-172-18-9-9.ec2.internal pulling image "registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9.0-0.53.0.0"
Normal Pulled 5m kubelet, ip-172-18-9-9.ec2.internal Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9.0-0.53.0.0"
Normal Created 5m kubelet, ip-172-18-9-9.ec2.internal Created container
Normal Started 5m kubelet, ip-172-18-9-9.ec2.internal Started container
Warning Unhealthy 1m (x19 over 4m) kubelet, ip-172-18-9-9.ec2.internal Readiness probe errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1
Should be fixed by https://github.com/kubernetes-incubator/cri-o/pull/1386 Please change to ON_QA, issue is fixed, metrics pods could be started up # oc get po -n openshift-infra NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-8cz8s 1/1 Running 0 58m hawkular-metrics-8gx7z 1/1 Running 0 58m heapster-lxctn 1/1 Running 0 58m # openshift version openshift v3.9.1 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.16 # crio --version crio version 1.9.7 cri-o images: v3.9.0-0.53.0.0 Remove TestBlocker keyword since issue is fixed Set to VERIFIED as per Comment 23 |
Description of problem: When deploy metrics on crio system. the hawkular-metrics couldn't connected to hawkular-cassandra. it seems Readiness probe checking failed, thus there isn't endpoint for the hawkular-cassandra. Version-Release number of selected component (if applicable): openshift-ansible-3.7.7-1.git.0.3e1b62b.el7.noarch How reproducible: always Steps to Reproduce: 1. install OCP-3.7 with crio openshift_use_crio=true openshift_crio_systemcontainer_image_override=registry.access.stage.redhat.com/openshift3/cri-o:v3.7 2. deploy metrics 3. Check the metrics pod status # oc get pods NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-jpz2v 0/1 Running 0 2m hawkular-metrics-w7sr9 0/1 Running 7 1h heapster-94vnp 0/1 Running 7 1h #oc describe pod hawkular-cassandra-1-jpz2v Actual results: The container is not ready. oc describe hawkular-cassandra show the Readiness probe failed: Could not get the Cassandra status [root@host-172-16-120-6 ~]# oc describe pod hawkular-cassandra-1-jpz2v Name: hawkular-cassandra-1-jpz2v Namespace: openshift-infra Node: 172.16.120.40/172.16.120.40 Start Time: Mon, 27 Nov 2017 00:46:09 -0500 Labels: metrics-infra=hawkular-cassandra name=hawkular-cassandra-1 type=hawkular-cassandra Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"hawkular-cassandra-1","uid":"43502c23-d316-11... openshift.io/scc=restricted Status: Running IP: 10.129.0.13 Created By: ReplicationController/hawkular-cassandra-1 Controlled By: ReplicationController/hawkular-cassandra-1 Containers: hawkular-cassandra-1: Container ID: cri-o://cff2143c32fe916766c24c885471f0096d23c639358be22da0cd74bd8b242f6f Image: registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7 Image ID: c38e9f5edf372e446b03abc16c5d966cc1afa74909d1868436c17d2ccaa39d4c Ports: 9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP Command: /opt/apache-cassandra/bin/cassandra-docker.sh --cluster_name=hawkular-metrics --data_volume=/cassandra_data --internode_encryption=all --require_node_auth=true --enable_client_encryption=true --require_client_auth=true State: Running Started: Mon, 27 Nov 2017 00:46:09 -0500 Ready: False Restart Count: 0 Limits: memory: 2G Requests: memory: 1G Readiness: exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: CASSANDRA_MASTER: true CASSANDRA_DATA_VOLUME: /cassandra_data JVM_OPTS: -Dcassandra.commitlog.ignorereplayerrors=true ENABLE_PROMETHEUS_ENDPOINT: True TRUSTSTORE_NODES_AUTHORITIES: /hawkular-cassandra-certs/tls.peer.truststore.crt TRUSTSTORE_CLIENT_AUTHORITIES: /hawkular-cassandra-certs/tls.client.truststore.crt POD_NAMESPACE: openshift-infra (v1:metadata.namespace) MEMORY_LIMIT: 2000000000 (limits.memory) CPU_LIMIT: node allocatable (limits.cpu) Mounts: /cassandra_data from cassandra-data (rw) /hawkular-cassandra-certs from hawkular-cassandra-certs (rw) /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-zvpk7 (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: cassandra-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: hawkular-cassandra-certs: Type: Secret (a volume populated by a Secret) SecretName: hawkular-cassandra-certs Optional: false cassandra-token-zvpk7: Type: Secret (a volume populated by a Secret) SecretName: cassandra-token-zvpk7 Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 10s 10s 1 default-scheduler Normal Scheduled Successfully assigned hawkular-cassandra-1-jpz2v to 172.16.120.40 9s 9s 1 kubelet, 172.16.120.40 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "cassandra-data" 9s 9s 1 kubelet, 172.16.120.40 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "cassandra-token-zvpk7" 9s 9s 1 kubelet, 172.16.120.40 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "hawkular-cassandra-certs" 9s 9s 1 kubelet, 172.16.120.40 spec.containers{hawkular-cassandra-1} Normal Pulled Container image "registry.access.stage.redhat.com/openshift3/metrics-cassandra:v3.7" already present on machine 9s 9s 1 kubelet, 172.16.120.40 spec.containers{hawkular-cassandra-1} Normal Created Created container 9s 9s 1 kubelet, 172.16.120.40 spec.containers{hawkular-cassandra-1} Normal Started Started container 2s 2s 1 kubelet, 172.16.120.40 spec.containers{hawkular-cassandra-1} Warning Unhealthy Readiness probe failed: Could not get the Cassandra status. This may mean that the Cassandra instance is not up yet. Will try again /opt/apache-cassandra/bin/cassandra-docker-ready.sh: line 25: nodetool: command not found Expected results: The metrics works well in CRIO. Additional info: