Bug 1368095 - Failure to deploy router as part of OSE
Summary: Failure to deploy router as part of OSE
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: ---
: ---
Assignee: Scott Dodson
QA Contact: Johnny Liu
URL:
Whiteboard: container
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-18 11:57 UTC by Pavel Zagalsky
Modified: 2016-09-12 12:31 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-31 12:39:15 UTC
Target Upstream Version:


Attachments (Terms of Use)
HawkularCrash (34.23 KB, text/plain)
2016-08-18 11:57 UTC, Pavel Zagalsky
no flags Details
CassandraLog (444.91 KB, text/plain)
2016-08-18 14:34 UTC, Pavel Zagalsky
no flags Details

Description Pavel Zagalsky 2016-08-18 11:57:03 UTC
Created attachment 1191859 [details]
HawkularCrash

Description of problem:
Hawkular metrics crashes upon start

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Hawkular crashes immediately upon start
Please check in my Master:
10.35.161.112


Actual results:
Hawkular should be running and gathering metrics

Expected results:


Additional info:
Crash log attached

Comment 1 Matt Wringe 2016-08-18 14:04:34 UTC
Can you give a bit more information about your setup?

How was metrics installed? What secrets did you use, what deployment options were used. Etc.

The message in Hawkular Metrics is about not being able to perform a write to Cassandra. Can you please attach the Cassandra logs to this BZ?

Comment 2 Pavel Zagalsky 2016-08-18 14:14:58 UTC
I used this MOJO doc to install metrics:
https://mojo.redhat.com/docs/DOC-1060820
How can I get the Cassandra logs? Do you have  a command?

Comment 3 Matt Wringe 2016-08-18 14:25:00 UTC
It just 'oc logs ${POD_NAME}'

Or if you want a copy and paste command:
oc logs $(oc get pods | grep -i hawkular-cassandra | awk '{print $1}')

Comment 4 Pavel Zagalsky 2016-08-18 14:34:25 UTC
Created attachment 1191886 [details]
CassandraLog

Comment 5 Pavel Zagalsky 2016-08-18 14:34:50 UTC
Log attached ^

Comment 6 John Sanda 2016-08-18 16:25:25 UTC
Did the crash happen multiple times?

Comment 7 Pavel Zagalsky 2016-08-18 17:11:54 UTC
Hard to tell. Didn't see the dates in the log file, just the hours:minutes

Comment 8 Matt Wringe 2016-08-18 18:00:38 UTC
'oc get pods' should indicate the number of times the pod has been restarted

Comment 9 Pavel Zagalsky 2016-08-19 11:19:03 UTC
hawkular-cassandra-1-nj2w6   1/1       Running   5          75d
hawkular-metrics-32epb       1/1       Running   6          50d
heapster-i9wfy               1/1       Running   5          37d

Comment 10 Matt Wringe 2016-08-19 20:44:22 UTC
From the Hawkular Metrics logs attached, these are not the logs of a component which has crashed at startup. Those logs cover multiple hours.

Can you please attach the logs for the container which has crashed?

"oc logs -p $POD_NAME" will return the logs for the pod which has just crashed (-p returns the logs of the previous container)

Comment 11 Matt Wringe 2016-08-19 20:48:32 UTC
Also, from https://bugzilla.redhat.com/show_bug.cgi?id=1368095#c9 your Hawkular Metrics (and all metric components) are in the fully running and ready state, which is not something which could occur if the pod was constantly crashing at startup.

Can you please explain what the exact issue is that you are seeing? The original description of the issue does not match the logs or information you have since presented.

Comment 12 Matt Wringe 2016-08-19 20:51:17 UTC
Lowering the priority since https://bugzilla.redhat.com/show_bug.cgi?id=1368095#c9 shows that the pods are up and running and are not constantly crashing.

Comment 13 Nelly Credi 2016-08-29 10:46:30 UTC
We see it in the automation environments as well
both on OSE 3.2 and 3.3
the problem is not the metrics, but the router-1-deploy which stays "ContainerCreating" forever
once you try to deploy metrics in this status, the router-1-deploy is switched to "Error" and on logs it says that it didnt succeed within 600 secs
the metrics in this case will fail to deploy (obviously) but the root cause is router-1 deployment

Comment 14 Ben Bennett 2016-08-29 14:36:34 UTC
I suspect that this is due to the certificate generation that the service is requesting.  If OpenShift is not configured to generate certs then you get this error when deploying a router.

This comment has information on what needs to be present in the config:
  https://bugzilla.redhat.com/show_bug.cgi?id=1349144#c19

The ansible installer should be enabling the feature:
  https://github.com/openshift/openshift-ansible/issues/2345
  https://github.com/openshift/openshift-ansible/pull/2358

So, please make sure that the ansible installer you used had the fix.  Or if you did not use the ansible installer, can you post how you did the install.

Comment 15 Pavel Zagalsky 2016-08-30 07:53:01 UTC
I checked the /etc/origin/master/ path and there are no
service-signer.crt and service-signer.key files there.
How do I obtain them?

Comment 16 Nelly Credi 2016-08-30 08:10:40 UTC
for the automation it seems that this fix is working,
but now I am failing on the metrics deployment


[cloud-user@ose3-master ~]$ oc get pods --all-namespaces
NAMESPACE         NAME                      READY     STATUS             RESTARTS   AGE
default           docker-registry-1-yqxxa   1/1       Running            0          12m
default           router-1-bpnys            1/1       Running            0          12m
openshift-infra   metrics-deployer-7mpmf    0/1       ImagePullBackOff   0          12m


this is on 3.3 (didnt check 3.2 yet) and working from puddles

Comment 17 Nelly Credi 2016-08-30 09:18:55 UTC
for 3.2 it doesnt look like the router issue was fixed
was this patch merged into 3.2 branch as well?

Comment 18 Brenton Leanhardt 2016-08-30 13:29:08 UTC
As far as I know automatic certificate creation was never proposed for backporting.  We'd probably need someone from Ben's team to comment on the feasibility.  I know they did a lot of work to make the upgrade go smoothly so I think the solution is for the customer to upgrade.

Comment 19 Dafna Ron 2016-08-30 13:49:13 UTC
This is blocking metrics for us. added appropriate keywords and flags.

Comment 21 Nelly Credi 2016-08-30 13:59:47 UTC
adding more info 
my 3.2 is stuck as i described in comment #13
working with registry  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888
and using image openshift3/ose-deployer:v3.2.1.15

for the metrics ill open a different bug

Comment 22 Brenton Leanhardt 2016-08-30 14:06:02 UTC
Hi Dafna/Nelly,

There's a lot of confusion in this bugzilla.  Can you please state the versions of the ansible playbooks that you are using?

From what I can tell the issue is resolved on OCP 3.3.  It's not clear to me which versions of metrics, OCP and openshift-ansible are in use.

Comment 23 Pavel Zagalsky 2016-08-30 14:29:28 UTC
Brenton, I used the playbooks that can be found in this Mojo:
https://mojo.redhat.com/docs/DOC-1060820

Comment 24 Nelly Credi 2016-08-30 14:33:45 UTC
I opened a new bug for the metrics issue
https://bugzilla.redhat.com/show_bug.cgi?id=1371578
so lets leave this aside
at this point 3.3 router is deployed, 3.2 is not
seeing the same 'ContainerCreating' state

and working with registry  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888
and using image openshift3/ose-deployer:v3.2.1.15

Comment 25 Scott Dodson 2016-08-30 18:57:27 UTC
(In reply to Nelly Credi from comment #24)
> openshift3/ose-deployer:v3.2.1.15

This image was just built today, so that's probably the reason it's failing. It should be there now.

Comment 26 Nelly Credi 2016-08-31 08:11:10 UTC
thanks Scott
its working in the automation 
from my side this issue is resolved

Comment 27 Ben Bennett 2016-09-12 12:31:51 UTC
Sorry, missed the question since the issue was resolved.  We would not be able to backport easily.  The way we implemented the default certs relies upon features only present in 3.3.


Note You need to log in before you can comment on or make changes to this bug.