Bug 1300329 - cluster metrics with nfs as persistent storage fails hawkular-metrics and heapster to start [NEEDINFO]
cluster metrics with nfs as persistent storage fails hawkular-metrics and hea...
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Metrics (Show other bugs)
3.1.0
All Linux
medium Severity low
: ---
: ---
Assigned To: Matt Wringe
chunchen
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-20 08:39 EST by Christophe Augello
Modified: 2016-09-29 22:16 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-09-07 17:15:41 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mwringe: needinfo? (caugello)


Attachments (Terms of Use)
server.log (11.81 KB, text/plain)
2016-02-12 09:09 EST, Christophe Augello
no flags Details

  None (edit)
Description Christophe Augello 2016-01-20 08:39:50 EST
Description of problem:
cluster metrics with nfs as persistent storage fails hawkular-metrics and heapster to start 

Version-Release number of selected component (if applicable):
3.1

How reproducible:
100 %

Steps to Reproduce:
1. Follow stop from docs till

oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/v1.1/infrastructure-templates/enterprise/metrics-deployer.yaml -v HAWKULAR_METRICS_HOSTNAME=metrics.xpaas.xyz,IMAGE_PREFIX=registry.access.redhat.com/openshift3/,IMAGE_VERSION=latest,USE_PERSISTENT_STORAGE=true,CASSANDRA_PV_SIZE=5Gi | oc create -f -

Actual results:

From the 3 pods, hawukular-{cassanda,metrics} and heapster only hawlukar-cassandra runs without failing but it seems something fails hawkular-metrics to start and as hawkular-metrics is failing heapster fails too.

Expected results:

3 working pods with nfs storage

Additional info:

[root@master1 ~]# oc get po
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-7ihsf   1/1       Running            0          9m
hawkular-metrics-qzv7q       0/1       CrashLoopBackOff   6          9m
heapster-q763k               0/1       CrashLoopBackOff   7          9m
[root@master1 ~]# oc get pvc
NAME                  LABELS                             STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
metrics-cassandra-1   metrics-infra=hawkular-cassandra   Bound     pv0001    10Gi       RWO,RWX       9m
Comment 2 Matt Wringe 2016-01-20 10:28:57 EST
The Heapster container requires the Hawkular Metrics container to be running or else it cannot start properly. And the Hawkular Metrics container requires the Cassandra container to fully start before it can be used.

So in this case (and looking from the logs) Hawkular Metrics is having a problem with using Cassandra, which causes the the Hawkular Metrics container to not start and which in turn cause the Heapster container to not start.

Its only Cassandra which uses persistent storage, and since Cassandra is starting I am not sure if its an issue with persistent storage or not.

Can you describe a bit of the process you are using to deploy this?

When you run
"oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/v1.1/infrastructure-templates/enterprise/metrics-deployer.yaml -v HAWKULAR_METRICS_HOSTNAME=metrics.xpaas.xyz,IMAGE_PREFIX=registry.access.redhat.com/openshift3/,IMAGE_VERSION=latest,USE_PERSISTENT_STORAGE=true,CASSANDRA_PV_SIZE=5Gi | oc create -f -"

Is the persistent storage empty here? Or does the persistent storage already contain some data in there? Is this the first time running this? or are you restarting the containers? etc
Comment 3 Christophe Augello 2016-01-20 10:59:22 EST
The storage is empty and once the pod runs the following is pv:
~~~
[root@nfs pv0001]# ll
total 0
drwxr-xr-x. 2 nfsnobody nfsnobody 78 Jan 20 17:27 commitlog
drwxr-xr-x. 6 nfsnobody nfsnobody 82 Jan 20 17:31 data
~~~
Export details
~~~
# cat /etc/exports
/exports/pv0001         *(rw,sync,all_squash)
~~~
Same behavior remains after restarting the pods.
Comment 4 Matt Wringe 2016-02-03 17:42:54 EST
Sorry for taking so long to get back to this.

Can you clarify a few things:

- when you start it the first time with the pv (eg when the pv is empty). Does this currently work? This should be equivalent to starting without the pv which seems to be working. Is there anything in the Cassandra logs?

- was another version of Hawkular Metrics ever running using that against a Cassandra using that pv? for instance the origin-metrics images. Or has it always been the OSE images? There may be a problem if you were originally using a different version of Hawkular Metrics and then try and use the OSE one. The schema that Hawkular Metrics uses between those version is different and wont automatically resolve it if the version changes.
Comment 5 Christophe Augello 2016-02-09 03:48:02 EST
@matt:

The only pod that starts and create data is the hawkular-cassandra pod.
~~~
[root@nfs ~]# ll /exports/pv1n1g/*
/exports/pv1n1g/commitlog:
total 176
-rw-r--r--. 1 nfsnobody nfsnobody 33554432 Feb  9 10:25 CommitLog-5-1455006298657.log
-rw-r--r--. 1 nfsnobody nfsnobody 33554432 Feb  9 10:06 CommitLog-5-1455006298658.log

/exports/pv1n1g/data:
total 12
drwxr-xr-x. 22 nfsnobody nfsnobody 4096 Feb  9 10:04 system
drwxr-xr-x.  6 nfsnobody nfsnobody 4096 Feb  9 10:10 system_auth
drwxr-xr-x.  4 nfsnobody nfsnobody 4096 Feb  9 10:08 system_distributed
drwxr-xr-x.  4 nfsnobody nfsnobody  100 Feb  9 10:08 system_traces
~~~

Hawkular-metrics is failing with:
~~~
03:32:50,742 DEBUG [org.jboss.as.config] (MSC service thread 1-3) VM Arguments: -D[Standalone] -XX:+UseCompressedOops -verbose:gc -Xloggc:/opt/eap/standalone/log/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1303m -Xmx1303m -XX:MaxPermSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager -Djava.awt.headless=true -Djboss.modules.policy-permissions=true -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/main/jboss-logmanager-1.5.4.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/javax.json-1.0.4.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -javaagent:/opt/eap/jolokia.jar=port=8778,host=127.0.0.1,discoveryEnabled=false -Dorg.jboss.boot.log.file=/opt/eap/standalone/log/server.log -Dlogging.configuration=file:/opt/eap/standalone/configuration/logging.properties 
03:32:52,934 INFO  [org.xnio] (MSC service thread 1-3) XNIO Version 3.0.14.GA-redhat-1
03:32:52,956 INFO  [org.jboss.as.server] (Controller Boot Thread) JBAS015888: Creating http management service using socket-binding (management-http)
03:32:53,018 INFO  [org.xnio.nio] (MSC service thread 1-3) XNIO NIO Implementation Version 3.0.14.GA-redhat-1
03:32:53,061 INFO  [org.jboss.remoting] (MSC service thread 1-3) JBoss Remoting version 3.3.5.Final-redhat-1
*** JBossAS process (174) received TERM signal ***
*** JBossAS process (174) received TERM signal ***
~~~

Heapster fails as hawkular-metrics is failing. The containers used are the latest from RH registry.
Comment 6 Matt Wringe 2016-02-09 10:01:54 EST
It looks like something external is killing your Hawkular-Metrics instance before it gets a chance to start up. There is no fatal error message in the logs.

Is there anything under events when you do 'oc describe ${POD_NAME}'?

We do use a postStart script to determine when the Hawkular Metrics instance is started, but that shouldn't normally cause the Hawkular Metrics instance to be terminated.

Would it be possible for you to remove the 'lifecycle' and 'livenessProbe' section of the 'hawkular-metrics' rc? Just to see if we can rule out the postStart script causing this problem.

If it is the postStart script which is determined to be causing the problem, can you please run it manually in the container and see what kind of error it outputs? (/opt/hawkular/scripts/hawkular-metrics-poststart.py).

Ideally we would be able to fetch the logs from the postStart script, but I don't believe that is currently possible.
Comment 7 Christophe Augello 2016-02-10 02:39:41 EST
@Matt:

Removing the he 'lifecycle' and 'livenessProbe' section of the 'hawkular-metrics' in the rc, didn't made the pod fail.

When running the poststart script, no output are showed:
~~~
oc rsh hawkular-metrics-tax0m
id: cannot find name for user ID 1000010000
<etrics-tax0m ~]$ /opt/hawkular/scripts/hawkular-metrics-poststart.py        
<etrics-tax0m ~]$ python /opt/hawkular/scripts/hawkular-metrics-poststart.py 
[I have no name!@hawkular-metrics-tax0m ~]$ 
~~~
Comment 8 Matt Wringe 2016-02-10 09:02:56 EST
do you know what the exit code is for the script?
Comment 9 Christophe Augello 2016-02-10 09:36:31 EST
@matt:

# oc rsh hawkular-metrics-tax0m
id: cannot find name for user ID 1000010000
<etrics-tax0m ~]$ /opt/hawkular/scripts/hawkular-metrics-poststart.py        
[I have no name!@hawkular-metrics-tax0m ~]$ echo $?
1
Comment 10 Matt Wringe 2016-02-10 10:32:07 EST
ok, that is not good, the script is failing. And it looks like there were not println in the script either to say why its failing :(

Could you post what the output of https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics/status is?
Comment 11 Christophe Augello 2016-02-10 10:59:37 EST
@matt:

#curl https://metrics.xpaas.xyz/hawkular/metrics/status
{"MetricsService":"FAILED","Implementation-Version":"0.8.0.Final-redhat-1","Built-From-Git-SHA1":"826f08dd34912ad455a4cb2b34f2e79cd79ace9a"}
Comment 12 Matt Wringe 2016-02-10 11:27:28 EST
Ok, the postStart script is working properly and restarting the container should be restarted if the state is 'FAILED'.

Why its in a FAILED state is another question though. Can you paste the logs somewhere? There should be an error Hawkular Metrics not being able to connect to Cassandra? The logs should be under /opt/eap/standalone/log/server.log
Comment 13 Christophe Augello 2016-02-12 09:09 EST
Created attachment 1123510 [details]
server.log

server.log
Comment 14 Christophe Augello 2016-02-12 09:10:14 EST
@matt: I've attached the log in BZ.
Comment 15 Matt Wringe 2016-02-16 11:49:44 EST
Can you provide the output of running the following on the Cassandra instance: 'nodetool cfstats hawkular_metrics'

Can you also provide the Cassandra logs?
Comment 17 chunchen 2016-03-02 03:49:02 EST
I met the similar issue, it's failed to start hawkular-cassandra pod, like below:

Tested latest images:
openshift3/metrics-hawkular-metrics     5c02894a36cd
openshift3/metrics-deployer     eddb89c5bd34
openshift3/metrics-cassandra     d01f8f782def
openshift3/metrics-heapster     341bad0bb73f

[chunchen@F17-CCY daily]$ oc get pvc
NAME                  STATUS    VOLUME                              CAPACITY   ACCESSMODES   AGE
metrics-cassandra-1   Bound     logging-elasticsearch-pv-vyneg2ue   5Gi        RWO           1h

[chunchen@F17-CCY daily]$ oc describe pod hawkular-cassandra-1-n417r
Name:		hawkular-cassandra-1-n417r
Namespace:	chunmetrics
Image(s):	brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-cassandra:latest
Node:		openshift-114.lab.sjc.redhat.com/10.14.6.114
Start Time:	Wed, 02 Mar 2016 15:41:15 +0800
Labels:		metrics-infra=hawkular-cassandra,name=hawkular-cassandra-1,type=hawkular-cassandra
Status:		Pending
Reason:		
Message:	
IP:		
Controllers:	ReplicationController/hawkular-cassandra-1
Containers:
  hawkular-cassandra-1:
    Container ID:	
    Image:		brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-cassandra:latest
    Image ID:		
    Command:
      /opt/apache-cassandra/bin/cassandra-docker.sh
      --seed_provider_classname=org.hawkular.openshift.cassandra.OpenshiftSeedProvider
      --cluster_name=hawkular-metrics
      --data_volume=/cassandra_data
      --internode_encryption=all
      --require_node_auth=true
      --enable_client_encryption=true
      --require_client_auth=true
      --keystore_file=/secret/cassandra.keystore
      --keystore_password_file=/secret/cassandra.keystore.password
      --truststore_file=/secret/cassandra.truststore
      --truststore_password_file=/secret/cassandra.truststore.password
      --cassandra_pem_file=/secret/cassandra.pem
    QoS Tier:
      cpu:		BestEffort
      memory:		BestEffort
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Environment Variables:
      CASSANDRA_MASTER:	true
      POD_NAMESPACE:	chunmetrics (v1:metadata.namespace)
Conditions:
  Type		Status
  Ready 	False 
Volumes:
  cassandra-data:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	metrics-cassandra-1
    ReadOnly:	false
  hawkular-cassandra-secrets:
    Type:	Secret (a secret that should populate this volume)
    SecretName:	hawkular-cassandra-secrets
  cassandra-token-iozd5:
    Type:	Secret (a secret that should populate this volume)
    SecretName:	cassandra-token-iozd5
Events:
  FirstSeen	LastSeen	Count	From						SubobjectPath	Type		Reason		Message
  ---------	--------	-----	----						-------------	--------	------		-------
  1h		1h		1	{scheduler }									Scheduled	Successfully assigned hawkular-cassandra-1-n417r to openshift-114.lab.sjc.redhat.com
  1h		58m		29	{kubelet openshift-114.lab.sjc.redhat.com}					FailedMount	Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": failed to instantiate volume plugin for cassandra-data: The claim metrics-cassandra-1 is not yet bound to a volume
  1h		58m		29	{kubelet openshift-114.lab.sjc.redhat.com}					FailedSync	Error syncing pod, skipping: failed to instantiate volume plugin for cassandra-data: The claim metrics-cassandra-1 is not yet bound to a volume
  53m		53m		1	{kubelet openshift-114.lab.sjc.redhat.com}					FailedSync	Error syncing pod, skipping: failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: net/http: TLS handshake timeout
  53m		53m		1	{kubelet openshift-114.lab.sjc.redhat.com}					FailedMount	Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: net/http: TLS handshake timeout
  53m		53m		2	{kubelet openshift-114.lab.sjc.redhat.com}					FailedMount	Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: EOF
  53m		53m		2	{kubelet openshift-114.lab.sjc.redhat.com}					FailedSync	Error syncing pod, skipping: failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: EOF
  57m		48m		19	{kubelet openshift-114.lab.sjc.redhat.com}					FailedSync	Error syncing pod, skipping: exit status 32
  57m		48m		19	{kubelet openshift-114.lab.sjc.redhat.com}					FailedMount	Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": exit status 32
  47m		14m		70	{kubelet openshift-114.lab.sjc.redhat.com}					FailedMount	Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": exit status 32
  47m		14m		70	{kubelet openshift-114.lab.sjc.redhat.com}					FailedSync	Error syncing pod, skipping: exit status 32
  13m		11s		30	{kubelet openshift-114.lab.sjc.redhat.com}					FailedMount	Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": exit status 32
  13m		11s		30	{kubelet openshift-114.lab.sjc.redhat.com}					FailedSync	Error syncing pod, skipping: exit status 32
Comment 18 Matt Wringe 2016-03-02 09:28:03 EST
@chunchen It appears that there is something wrong with your your persistent volume setup:

Error syncing pod, skipping: failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: net/http: TLS handshake timeout

Unable to mount volumes for pod "hawkular-cassandra-1-n417r_chunmetrics": failed to instantiate volume plugin for cassandra-data: Get https://openshift-138.lab.sjc.redhat.com:8443/api/v1/namespaces/chunmetrics/persistentvolumeclaims/metrics-cassandra-1: EOF

Can you deploy any other pods which use persistent volumes on this system?

I also don't believe this is related to the original issue posted here, could you please open a new bugzilla if you determine it is in fact related to metrics and not the persistent volume setup on that system?
Comment 19 chunchen 2016-03-08 02:20:59 EST
@Matt Wringe  I tried with new images(metrics-hawkular:d1fe5a5605da) and new persistent volume on latest OSE env, but my issue is not reproduced again.
Comment 21 Matt Wringe 2016-07-18 17:52:56 EDT
Is there anything else that needs to be done with this issue? Or can it be closed?
Comment 22 chunchen 2016-08-15 03:18:49 EDT
According to comment #19, mark it as verified.

Note You need to log in before you can comment on or make changes to this bug.