Hide Forgot
Description of problem: Each time we restarted the heapster pod it works again for a few time but after this time again we have nothing (empty metrics). scenario 1) - --stats_resolution=5s As you can our performance tool (zabbix) told about a peak of 20Mbits during few minutes then stopped We can observ this peak with some metrics during this few minutes. After that we never got new metrics inside the openshift console. We stopped our test after 10-15 minutes because no more metrics were available in the console. scenario 2) increase - --stats_resolution=30s As you can see our performance tool told about a peak of 100Mbps during a few minutes then stopped. We can observ this peak with some metrics during this few minutes. After that we never got new metrics inside the openshift console. We stopped our test after 10-15 minutes because no more metrics were available in the console. scenario 3) increase - --stats_resolution=59s As you can see our performance tool told about a peak of 200Mbps (2 times more than the scenario 2) during a few minutes then stopped. You can also see that on the CPU screenshot, the cpus are well more stressed and can hit the 8 dedicated vcpu accordted to the VM. This time we have more metrics than before and the CPU & network bandwith remains around 200Mbps & 6-8 vcpus used. But I think these metrics are not so OK because it looked that there ware a lot of gaps in the "memory graph" and there was only one peak value for the CPU graph. After 30 minutes the (uncompleted ?) metrics continued to work successfully. Note that if we compare the same graph with time range "last 30 minutes" & "last 1 hour" the graph looks completely different. As you can on the files timerange_last_30m.jpg & timerange_last_1h.jpg, inside the first one we can observ a lot of gaps, it looks like data are missing, but on the second graph everything looks great. ----------------------------- Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1221133 [details] Cassandra logs
The stats_resolution value dictates how often metrics are gathered. In 3.1 the default is 10s. Setting the stats_resolution to 5s is going to cause metrics to be gathered more often, which will cause increased load on the system. Setting the stats_resolution above something like 28-29s is going to cause problems at the 30 minute interval in the console since its expecting data to be collected more often. This will cause empty segments in the graph at the 30 minute view, but not at the 1 hour view or greater.
If you want to use a hostPath setup, the steps to follow are below. Deploy metrics, but with 'USE_PERSISTENT_STORGE=false' since we don't want it to be creating and waiting for a PVC. # scale down the metric components: oc scale rc heapster --replicas=0; oc scale rc hawkular-metrics --replicas=0;oc scale rc hawkular-cassandra-1 --replicas=0 # grant permissions to the 'cassandra' service account to allow it to use a hostPath: oadm policy add-scc-to-user privileged system:serviceaccount:openshift-infra:cassandra # update the hawkular-cassandra-1 template to specify that it needs a privileged permission since it wants to use a hostPath: oc patch rc hawkular-cassandra-1 -p '{"spec":{"template":{"spec":{"containers":[{"name":"hawkular-cassandra-1","securityContext":{"privileged": true}}]}}}}' # change the volume from emptyDir to a hostpath (where $CASSANDRA_DATA_DIRECTORY is the directory you want to store metrics to): oc set volumes rc hawkular-cassandra-1 --add --overwrite --name=cassandra-data --type=hostPath --path=$CASSANDRA_DATA_DIRECTORY # specify that the hawkular-cassandra-1 pod can only be deployed to a specific node: oc patch rc hawkular-cassandra-1 -p '{"spec":{"template":{"spec":{"nodeSelector":{"${NODE_LABEL}":"${NODE_KEY}"}}}}}' # bring back up the metric components: oc scale rc heapster --replicas=1; oc scale rc hawkular-metrics --replicas=1;oc scale rc hawkular-cassandra-1 --replicas=1 Note: we will probably make this a more official option in a future release of OpenShift Metrics so that you only need to specify a few parameters to the deployer pod and it will take care of the rest.
I am closing this issue as 'CANTFIX', this is an issue with the speed of the writes to the network attached storage. There are other types of network attached storage that the user can use, or they can use host volumes, which will not have this issue.