Bug 1365787

Summary:

Failed to start hawkular-metrics pod when using registry.ops

Product:

OpenShift Container Platform

Reporter:

chunchen <chunchen>

Component:

Hawkular

Assignee:

Troy Dawson <tdawson>

Status:

CLOSED ERRATA

QA Contact:

chunchen <chunchen>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.3.0

CC:

aos-bugs, chunchen, penli, pweil, wsun, xtian

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-09-27 09:43:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
hawkular metrics pod log	none

Description chunchen 2016-08-10 08:34:21 UTC

Description of problem:
It's failed to start hawkular-metrics pod when using registry.ops, but It can work well when using images from registry.qe, met below errors:

"Failed to start service jboss.deployment.unit."activemq-rar.rar".STRUCTURE: org.jboss.msc.service.StartException in service jboss.deployment.unit."activemq-rar.rar".STRUCTURE: Failed to start service"

Version-Release number of selected component (if applicable):
openshift v3.3.0.17
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

 Version:         1.10.3
 API version:     1.22
 Package version: docker-common-1.10.3-46.el7.10.x86_64
 Go version:      go1.6.2
 Git commit:      2a93377-unsupported
 Built:           Fri Jul 29 13:45:25 2016
 OS/Arch:         linux/amd64

registry.ops.../openshift3/metrics-hawkular-metrics   3.3.0               3ab97c3f7395
registry.ops.../openshift3/metrics-cassandra          3.3.0               f460976d4f99
registry.ops.../openshift3/metrics-deployer           3.3.0               91a831b58627
registry.ops.../openshift3/metrics-heapster           3.3.0               179e3ed5c3b2

How reproducible:
Always

Steps to Reproduce:
1. Log into OpenShift server
2. Deploy metrics stack using images from Ops registry(registry.ops...com)
$ oc create serviceaccount metrics-deployer

$ oadm policy add-cluster-role-to-user cluster-reader system:serviceaccount:openshift-infra:heapster

$ oc policy add-role-to-user edit system:serviceaccount:openshift-infra:metrics-deployer

$ oc secrets new metrics-deployer nothing=/dev/null

$ oc new-app metrics-deployer-template -p IMAGE_PREFIX=registry.ops...com/openshift3/,IMAGE_VERSION=3.3.0,CASSANDRA_PV_SIZE=10Gi,CASSANDRA_NODES=1,MASTER_URL=https://ec2-...com:443,HAWKULAR_METRICS_HOSTNAME=hawkular-metrics.0810-2ui.qe.rhcloud.com,USE_PERSISTENT_STORAGE=false,USER_WRITE_ACCESS=true

3. Check the hawkular-metrics pod status and logs
oc get pods
oc logs hawkular-metrics-5smf8

Actual results:
[chunchen@F17-CCY daily]$ oc get pod
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-7fs9o   1/1       Running            0          1h
hawkular-metrics-5smf8       0/1       CrashLoopBackOff   16         1h
heapster-jf9ev               0/1       Running            12         1h
metrics-deployer-g50hu       1/1       Running            0          1h

The hawkular-metrics pod logs is attached.

Expected results:
The hawkular-metrics pod should startup when using Ops registry successfully.

Additional info:

Comment 1 chunchen 2016-08-10 08:35:41 UTC

Created attachment 1189496 [details]
hawkular metrics pod log

Comment 2 Matt Wringe 2016-08-10 20:46:35 UTC

I can't reproduce this. I just deployed the 3.3.0 images from registry.ops and it all works.

Is this something you can reproduce? does this happen consistently?

Comment 3 chunchen 2016-08-11 03:56:46 UTC

It happens consistently when I tested, could you try to deploy on an OSE containerized installation?

Comment 4 chunchen 2016-08-11 09:02:07 UTC

I also tried metrics deployment against OSE rpm installation, it can work well in such installation env.

Comment 5 Matt Wringe 2016-08-11 13:01:12 UTC

What exactly do you mean by an 'OSE containerized installation'?

Looking more closely at the logs, it appear the problem is that something is killing the pod while it is starting. Can you check and see what is listed under events for the Hawkular Pod?

Comment 6 chunchen 2016-08-12 03:44:04 UTC

"OSE" containerized installation" means installing OSE env via containerized method.

The hawkular pod events as below:

Events:
  FirstSeen	LastSeen	Count	From					SubobjectPath				Type		Reason		Message
  ---------	--------	-----	----					-------------				--------	------		-------
  12m		12m		1	{default-scheduler }								Normal		Scheduled	Successfully assigned hawkular-metrics-xhnnb to ip-172-18-4-201.ec2.internal
  12m		12m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Pulling		pulling image "registry.ops.openshift.com/openshift3/metrics-hawkular-metrics:3.3.0"
  7m		7m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Pulled		Successfully pulled image "registry.ops.openshift.com/openshift3/metrics-hawkular-metrics:3.3.0"
  7m		7m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Created		Created container with docker id 13f981cc5473
  7m		7m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Started		Started container with docker id 13f981cc5473
  4m		4m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Killing		Killing container with docker id 13f981cc5473: pod "hawkular-metrics-xhnnb_openshift-infra(30ed96c9-603d-11e6-9d1a-0eeb7993154f)" container "hawkular-metrics" is unhealthy, it will be killed and re-created.
  4m		4m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Created		Created container with docker id 891bea9986ca
  4m		4m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Started		Started container with docker id 891bea9986ca
  4m		4m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Killing		Killing container with docker id 891bea9986ca: pod "hawkular-metrics-xhnnb_openshift-infra(30ed96c9-603d-11e6-9d1a-0eeb7993154f)" container "hawkular-metrics" is unhealthy, it will be killed and re-created.
  4m		4m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Created		Created container with docker id d336a478d216
  4m		4m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Started		Started container with docker id d336a478d216
  3m		3m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Killing		Killing container with docker id d336a478d216: pod "hawkular-metrics-xhnnb_openshift-infra(30ed96c9-603d-11e6-9d1a-0eeb7993154f)" container "hawkular-metrics" is unhealthy, it will be killed and re-created.
  3m		3m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Created		Created container with docker id c491cbb597ee
  3m		3m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Started		Started container with docker id c491cbb597ee
  3m		3m		1	{kubelet ip-172-18-4-201.ec2.internal}						Warning		FailedSync	Error syncing pod, skipping: Error response from daemon: devmapper: Unknown device 88ada34788941c910b494d4587d21c4dd2315ce2d47e2a238f7d7ae903ceecf0
  2m		2m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Killing		Killing container with docker id c491cbb597ee: pod "hawkular-metrics-xhnnb_openshift-infra(30ed96c9-603d-11e6-9d1a-0eeb7993154f)" container "hawkular-metrics" is unhealthy, it will be killed and re-created.
  2m		2m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Created		Created container with docker id f8188adde671
  2m		2m		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Started		Started container with docker id f8188adde671
  5m		14s		9	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Warning		Unhealthy	Liveness probe failed: 
  4m		13s		5	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Pulled		Container image "registry.ops.openshift.com/openshift3/metrics-hawkular-metrics:3.3.0" already present on machine
  13s		13s		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Killing		Killing container with docker id f8188adde671: pod "hawkular-metrics-xhnnb_openshift-infra(30ed96c9-603d-11e6-9d1a-0eeb7993154f)" container "hawkular-metrics" is unhealthy, it will be killed and re-created.
  11s		11s		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Created		Created container with docker id 70a9683be499
  10s		10s		1	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Normal		Started		Started container with docker id 70a9683be499
  7m		<invalid>	43	{kubelet ip-172-18-4-201.ec2.internal}	spec.containers{hawkular-metrics}	Warning		Unhealthy	Readiness probe failed:

Comment 7 Matt Wringe 2016-08-15 21:03:42 UTC

""OSE" containerized installation" means installing OSE env via containerized method."

Can you please be specific about how you are installing OSE. Do you just mean you are running OpenShift itself in a docker container? Or running this under so other means? What exact steps are you following?

The issue here is that the Hawkular Metrics pod is being killed. The message in the logs about the "activemq-rar.rar" can be completely ignored, that error message is because the rar is being killed while it is starting (see "*** JBossAS process (160) received TERM signal ***" in the logs right above it).

Why its being killed is the problem. It looks like the liveness probe has failed: "container "hawkular-metrics" is unhealthy, it will be killed and re-created"

But the liveness probe should only fail under two situations:

1) the Hawkular Metrics service status is 'FAILED' (https://github.com/openshift/origin-metrics/blob/master/hawkular-metrics/hawkular-metrics-liveness.py#L41) but this isn't the case because the Hawkular Metrics war hasn't even started yet.

2) the other reason is that if its taken more than 3 minutes from the start of the metrics startup and the Hawkular Metrics service status is not 'STARTED' (https://github.com/openshift/origin-metrics/blob/master/hawkular-metrics/hawkular-metrics-liveness.py#L50). But from the logs this shouldn't be the case either.

If you run the non ops images in this containerised environment, does it still fail? Are there any more information from the OpenShift logs (not the container logs) over why this is failing?

I need an environment where this issue can be reproduced, otherwise there is not much more I can do to with this issue just based on the logs.

I have also opened https://bugzilla.redhat.com/show_bug.cgi?id=1367204 because the event logs really should be showing the reason for the failure.

Comment 12 chunchen 2016-08-18 05:39:22 UTC

Change the status to MODIFIED since the latest images do not sync to OPS registry till now.

Comment 14 chunchen 2016-08-19 03:48:53 UTC

The issue is also reproduced on OSE RPM installation env.

Comment 15 chunchen 2016-08-19 06:06:55 UTC

Could you help to check if the metrics images from OPS registry are sync or built correctly?

Comment 16 Troy Dawson 2016-08-19 21:06:57 UTC

mwringe has built new images.
New images have been synced to registy.ops.

Comment 18 chunchen 2016-08-22 05:49:26 UTC

It's fixed, checked with the latest metrics images, the test result as below:

[root@ip-172-18-12-152 ~]# oc get pod
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-sjjyo   1/1       Running   0          9m
hawkular-metrics-i7tti       1/1       Running   2          9m
heapster-erb9k               1/1       Running   2          9m

[root@ip-172-18-12-152 ~]# oc describe pod hawkular-metrics-i7tti
Name:			hawkular-metrics-i7tti
Namespace:		openshift-infra
Security Policy:	restricted
Node:			ip-172-18-0-250.ec2.internal/172.18.0.250
Start Time:		Mon, 22 Aug 2016 01:34:53 -0400
Labels:			metrics-infra=hawkular-metrics
			name=hawkular-metrics
Status:			Running
IP:			10.1.0.5
Controllers:		ReplicationController/hawkular-metrics
Containers:
  hawkular-metrics:
    Container ID:	docker://b7b957858e2cc95d3cd78aa3d84d991fa40b05d158557d52bcdf9b143ce1573f
    Image:		registry.ops.openshift.com/openshift3/metrics-hawkular-metrics:3.3.0
    Image ID:		docker://sha256:cd137686f61ef443d9319d9b7568b7609dda198e401d4e7324585d1a26fe5496
<----------snip------------>

Comment 20 errata-xmlrpc 2016-09-27 09:43:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933