Description of problem: cluster metrics with nfs as persistent storage fails hawkular-metrics and heapster to start Version-Release number of selected component (if applicable): 3.1 How reproducible: 100 % Steps to Reproduce: 1. Follow stop from docs till oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/v1.1/infrastructure-templates/enterprise/metrics-deployer.yaml -v HAWKULAR_METRICS_HOSTNAME=metrics.xpaas.xyz,IMAGE_PREFIX=registry.access.redhat.com/openshift3/,IMAGE_VERSION=latest,USE_PERSISTENT_STORAGE=true,CASSANDRA_PV_SIZE=5Gi | oc create -f - Actual results: From the 3 pods, hawukular-{cassanda,metrics} and heapster only hawlukar-cassandra runs without failing but it seems something fails hawkular-metrics to start and as hawkular-metrics is failing heapster fails too. Expected results: 3 working pods with nfs storage Additional info: [root@master1 ~]# oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-7ihsf 1/1 Running 0 9m hawkular-metrics-qzv7q 0/1 CrashLoopBackOff 6 9m heapster-q763k 0/1 CrashLoopBackOff 7 9m [root@master1 ~]# oc get pvc NAME LABELS STATUS VOLUME CAPACITY ACCESSMODES AGE metrics-cassandra-1 metrics-infra=hawkular-cassandra Bound pv0001 10Gi RWO,RWX 9m
The Heapster container requires the Hawkular Metrics container to be running or else it cannot start properly. And the Hawkular Metrics container requires the Cassandra container to fully start before it can be used. So in this case (and looking from the logs) Hawkular Metrics is having a problem with using Cassandra, which causes the the Hawkular Metrics container to not start and which in turn cause the Heapster container to not start. Its only Cassandra which uses persistent storage, and since Cassandra is starting I am not sure if its an issue with persistent storage or not. Can you describe a bit of the process you are using to deploy this? When you run "oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/v1.1/infrastructure-templates/enterprise/metrics-deployer.yaml -v HAWKULAR_METRICS_HOSTNAME=metrics.xpaas.xyz,IMAGE_PREFIX=registry.access.redhat.com/openshift3/,IMAGE_VERSION=latest,USE_PERSISTENT_STORAGE=true,CASSANDRA_PV_SIZE=5Gi | oc create -f -" Is the persistent storage empty here? Or does the persistent storage already contain some data in there? Is this the first time running this? or are you restarting the containers? etc
The storage is empty and once the pod runs the following is pv: ~~~ [root@nfs pv0001]# ll total 0 drwxr-xr-x. 2 nfsnobody nfsnobody 78 Jan 20 17:27 commitlog drwxr-xr-x. 6 nfsnobody nfsnobody 82 Jan 20 17:31 data ~~~ Export details ~~~ # cat /etc/exports /exports/pv0001 *(rw,sync,all_squash) ~~~ Same behavior remains after restarting the pods.
Sorry for taking so long to get back to this. Can you clarify a few things: - when you start it the first time with the pv (eg when the pv is empty). Does this currently work? This should be equivalent to starting without the pv which seems to be working. Is there anything in the Cassandra logs? - was another version of Hawkular Metrics ever running using that against a Cassandra using that pv? for instance the origin-metrics images. Or has it always been the OSE images? There may be a problem if you were originally using a different version of Hawkular Metrics and then try and use the OSE one. The schema that Hawkular Metrics uses between those version is different and wont automatically resolve it if the version changes.
@matt: The only pod that starts and create data is the hawkular-cassandra pod. ~~~ [root@nfs ~]# ll /exports/pv1n1g/* /exports/pv1n1g/commitlog: total 176 -rw-r--r--. 1 nfsnobody nfsnobody 33554432 Feb 9 10:25 CommitLog-5-1455006298657.log -rw-r--r--. 1 nfsnobody nfsnobody 33554432 Feb 9 10:06 CommitLog-5-1455006298658.log /exports/pv1n1g/data: total 12 drwxr-xr-x. 22 nfsnobody nfsnobody 4096 Feb 9 10:04 system drwxr-xr-x. 6 nfsnobody nfsnobody 4096 Feb 9 10:10 system_auth drwxr-xr-x. 4 nfsnobody nfsnobody 4096 Feb 9 10:08 system_distributed drwxr-xr-x. 4 nfsnobody nfsnobody 100 Feb 9 10:08 system_traces ~~~ Hawkular-metrics is failing with: ~~~ 03:32:50,742 DEBUG [org.jboss.as.config] (MSC service thread 1-3) VM Arguments: -D[Standalone] -XX:+UseCompressedOops -verbose:gc -Xloggc:/opt/eap/standalone/log/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1303m -Xmx1303m -XX:MaxPermSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager -Djava.awt.headless=true -Djboss.modules.policy-permissions=true -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/main/jboss-logmanager-1.5.4.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/javax.json-1.0.4.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -javaagent:/opt/eap/jolokia.jar=port=8778,host=127.0.0.1,discoveryEnabled=false -Dorg.jboss.boot.log.file=/opt/eap/standalone/log/server.log -Dlogging.configuration=file:/opt/eap/standalone/configuration/logging.properties 03:32:52,934 INFO [org.xnio] (MSC service thread 1-3) XNIO Version 3.0.14.GA-redhat-1 03:32:52,956 INFO [org.jboss.as.server] (Controller Boot Thread) JBAS015888: Creating http management service using socket-binding (management-http) 03:32:53,018 INFO [org.xnio.nio] (MSC service thread 1-3) XNIO NIO Implementation Version 3.0.14.GA-redhat-1 03:32:53,061 INFO [org.jboss.remoting] (MSC service thread 1-3) JBoss Remoting version 3.3.5.Final-redhat-1 *** JBossAS process (174) received TERM signal *** *** JBossAS process (174) received TERM signal *** ~~~ Heapster fails as hawkular-metrics is failing. The containers used are the latest from RH registry.
It looks like something external is killing your Hawkular-Metrics instance before it gets a chance to start up. There is no fatal error message in the logs. Is there anything under events when you do 'oc describe ${POD_NAME}'? We do use a postStart script to determine when the Hawkular Metrics instance is started, but that shouldn't normally cause the Hawkular Metrics instance to be terminated. Would it be possible for you to remove the 'lifecycle' and 'livenessProbe' section of the 'hawkular-metrics' rc? Just to see if we can rule out the postStart script causing this problem. If it is the postStart script which is determined to be causing the problem, can you please run it manually in the container and see what kind of error it outputs? (/opt/hawkular/scripts/hawkular-metrics-poststart.py). Ideally we would be able to fetch the logs from the postStart script, but I don't believe that is currently possible.
@Matt: Removing the he 'lifecycle' and 'livenessProbe' section of the 'hawkular-metrics' in the rc, didn't made the pod fail. When running the poststart script, no output are showed: ~~~ oc rsh hawkular-metrics-tax0m id: cannot find name for user ID 1000010000 <etrics-tax0m ~]$ /opt/hawkular/scripts/hawkular-metrics-poststart.py <etrics-tax0m ~]$ python /opt/hawkular/scripts/hawkular-metrics-poststart.py [I have no name!@hawkular-metrics-tax0m ~]$ ~~~
do you know what the exit code is for the script?
@matt: # oc rsh hawkular-metrics-tax0m id: cannot find name for user ID 1000010000 <etrics-tax0m ~]$ /opt/hawkular/scripts/hawkular-metrics-poststart.py [I have no name!@hawkular-metrics-tax0m ~]$ echo $? 1
ok, that is not good, the script is failing. And it looks like there were not println in the script either to say why its failing :( Could you post what the output of https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics/status is?
@matt: #curl https://metrics.xpaas.xyz/hawkular/metrics/status {"MetricsService":"FAILED","Implementation-Version":"0.8.0.Final-redhat-1","Built-From-Git-SHA1":"826f08dd34912ad455a4cb2b34f2e79cd79ace9a"}
Ok, the postStart script is working properly and restarting the container should be restarted if the state is 'FAILED'. Why its in a FAILED state is another question though. Can you paste the logs somewhere? There should be an error Hawkular Metrics not being able to connect to Cassandra? The logs should be under /opt/eap/standalone/log/server.log
Created attachment 1123510 [details] server.log server.log
@matt: I've attached the log in BZ.
Can you provide the output of running the following on the Cassandra instance: 'nodetool cfstats hawkular_metrics' Can you also provide the Cassandra logs?
I met the similar issue, it's failed to start hawkular-cassandra pod, like below: Tested latest images: openshift3/metrics-hawkular-metrics 5c02894a36cd openshift3/metrics-deployer eddb89c5bd34 openshift3/metrics-cassandra d01f8f782def openshift3/metrics-heapster 341bad0bb73f [chunchen@F17-CCY daily]$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESSMODES AGE metrics-cassandra-1 Bound logging-elasticsearch-pv-vyneg2ue 5Gi RWO 1h [chunchen@F17-CCY daily]$ oc describe pod hawkular-cassandra-1-n417r Name: hawkular-cassandra-1-n417r Namespace: chunmetrics Image(s): brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-cassandra:latest Node: openshift-114.lab.sjc.redhat.com/10.14.6.114 Start Time: Wed, 02 Mar 2016 15:41:15 +0800 Labels: metrics-infra=hawkular-cassandra,name=hawkular-cassandra-1,type=hawkular-cassandra Status: Pending Reason: Message: IP: Controllers: ReplicationController/hawkular-cassandra-1 Containers: hawkular-cassandra-1: Container ID: Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-cassandra:latest Image ID: Command: /opt/apache-cassandra/bin/cassandra-docker.sh --seed_provider_classname=org.hawkular.openshift.cassandra.OpenshiftSeedProvider --cluster_name=hawkular-metrics --data_volume=/cassandra_data --internode_encryption=all --require_node_auth=true --enable_client_encryption=true --require_client_auth=true --keystore_file=/secret/cassandra.keystore --keystore_password_file=/secret/cassandra.keystore.password --truststore_file=/secret/cassandra.truststore --truststore_password_file=/secret/cassandra.truststore.password --cassandra_pem_file=/secret/cassandra.pem QoS Tier: cpu: BestEffort memory: BestEffort State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment Variables: CASSANDRA_MASTER: true POD_NAMESPACE: chunmetrics (v1:metadata.namespace) Conditions: Type Status Ready False Volumes: cassandra-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: metrics-cassandra-1 ReadOnly: false hawkular-cassandra-secrets: Type: Secret (a secret that should populate this volume) SecretName: hawkular-cassandra-secrets cassandra-token-iozd5: Type: Secret (a secret that should populate this volume) SecretName: cassandra-token-iozd5 Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1h 1h 1 {scheduler } Scheduled Successfully assigned hawkular-cassandra-1-n417r to openshift-114.lab.sjc.redhat.com 1h 58m 29 {kubelet openshift-114.lab.sjc.redhat.com} FailedMount Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": failed to instantiate volume plugin for cassandra-data: The claim metrics-cassandra-1 is not yet bound to a volume 1h 58m 29 {kubelet openshift-114.lab.sjc.redhat.com} FailedSync Error syncing pod, skipping: failed to instantiate volume plugin for cassandra-data: The claim metrics-cassandra-1 is not yet bound to a volume 53m 53m 1 {kubelet openshift-114.lab.sjc.redhat.com} FailedSync Error syncing pod, skipping: failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: net/http: TLS handshake timeout 53m 53m 1 {kubelet openshift-114.lab.sjc.redhat.com} FailedMount Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: net/http: TLS handshake timeout 53m 53m 2 {kubelet openshift-114.lab.sjc.redhat.com} FailedMount Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: EOF 53m 53m 2 {kubelet openshift-114.lab.sjc.redhat.com} FailedSync Error syncing pod, skipping: failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: EOF 57m 48m 19 {kubelet openshift-114.lab.sjc.redhat.com} FailedSync Error syncing pod, skipping: exit status 32 57m 48m 19 {kubelet openshift-114.lab.sjc.redhat.com} FailedMount Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": exit status 32 47m 14m 70 {kubelet openshift-114.lab.sjc.redhat.com} FailedMount Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": exit status 32 47m 14m 70 {kubelet openshift-114.lab.sjc.redhat.com} FailedSync Error syncing pod, skipping: exit status 32 13m 11s 30 {kubelet openshift-114.lab.sjc.redhat.com} FailedMount Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": exit status 32 13m 11s 30 {kubelet openshift-114.lab.sjc.redhat.com} FailedSync Error syncing pod, skipping: exit status 32
@chunchen It appears that there is something wrong with your your persistent volume setup: Error syncing pod, skipping: failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: net/http: TLS handshake timeout Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: EOF Can you deploy any other pods which use persistent volumes on this system? I also don't believe this is related to the original issue posted here, could you please open a new bugzilla if you determine it is in fact related to metrics and not the persistent volume setup on that system?
@Matt Wringe I tried with new images(metrics-hawkular:d1fe5a5605da) and new persistent volume on latest OSE env, but my issue is not reproduced again.
Is there anything else that needs to be done with this issue? Or can it be closed?
According to comment #19, mark it as verified.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days