Bug 1369646 - Encounter "503 Service Unavailable" while accessing Kibana OPS UI after upgrading from logging 3.2.0 to 3.3.0
Summary: Encounter "503 Service Unavailable" while accessing Kibana OPS UI after upgra...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Luke Meyer
QA Contact: chunchen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-24 05:31 UTC by Xia Zhao
Modified: 2017-03-08 18:26 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2016-09-27 09:45:39 UTC
Target Upstream Version:


Attachments (Terms of Use)
Upgrade_pod_log (177.12 KB, text/plain)
2016-08-24 05:41 UTC, Xia Zhao
no flags Details
OPS UI kibana screenshot which running fine (79.19 KB, image/png)
2016-08-24 05:41 UTC, Xia Zhao
no flags Details
Non OPS Kiana UI sccreeshot which get bug repro (206.37 KB, image/png)
2016-08-24 05:42 UTC, Xia Zhao
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1933 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.3 Release Advisory 2016-09-27 13:24:36 UTC

Description Xia Zhao 2016-08-24 05:31:38 UTC
Description of problem:
Upgrade logging from 3.2.0 to 3.3.0 when ENABLE_OPS_CLUSTER=true, after upgrade pod finished successfully, Encounter "503 Service Unavailable" while accessing the Kibana OPS UI. The non-OPS UI worked fine.

Version-Release number of selected component (if applicable):
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd            3.3.0               4d87d421e950        5 days ago          238.7 MB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy         3.3.0               196ecb30fc93        2 weeks ago         229.2 MB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch      3.3.0               e71d2b04669c        4 weeks ago         426.9 MB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer           3.3.0               1c127f4f36a0        4 weeks ago         747.9 MB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator            3.3.0               2c88e1273c11        4 weeks ago         253.8 MB
brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-kibana             3.3.0               32d276bb46ae        8 weeks ago 

How reproducible:
Always

Steps to Reproduce:

0. Deploy 3.2.0 logging systems ( with OPS cluster enabled) :
IMAGE_PREFIX = brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/
IMAGE_VERSION = 3.2.0
ENABLE_OPS_CLUSTER=true

Make sure EFK pods are running fine, kibana and kibana OPS UI is accesible & functional.

1. Add yourself to cluster-admin
$ oadm policy add-cluster-role-to-user cluster-admin xiazhao@redhat.com

2. Delete existing templates if it exist
$ oc delete template logging-deployer-account-template logging-deployer-template
Error from server: templates "logging-deployer-account-template" not found
Error from server: templates "logging-deployer-template" not found

3. Create missing templates according to doc https://github.com/openshift/origin-aggregated-logging/tree/master/deployer#create-missing-templates:

$ oc create -f https://raw.githubusercontent.com/openshift/origin-aggregated-logging/master/deployer/deployer.yaml
template "logging-deployer-account-template" created
template "logging-deployer-template" created

Modify deployer template to use the new name for 3.3.0 deployer:
$ oc edit template logging-deployer-template -o yaml
changed from
 image: ${IMAGE_PREFIX}logging-deployment:${IMAGE_VERSION}
into 
 image: ${IMAGE_PREFIX}logging-deployer:${IMAGE_VERSION}

4. Create SA and permissions according to doc https://github.com/openshift/origin-aggregated-logging/tree/master/deployer#create-supporting-serviceaccount-and-permissions :

$ oc new-app logging-deployer-account-template
--> Deploying template logging-deployer-account-template for "logging-deployer-account-template"
--> Creating resources ...
    error: serviceaccounts "logging-deployer" already exists
    error: serviceaccounts "aggregated-logging-kibana" already exists
    error: serviceaccounts "aggregated-logging-elasticsearch" already exists
    error: serviceaccounts "aggregated-logging-fluentd" already exists
    serviceaccount "aggregated-logging-curator" created
    clusterrole "oauth-editor" created
    clusterrole "daemonset-admin" created
    rolebinding "logging-deployer-edit-role" created
    rolebinding "logging-deployer-dsadmin-role" created

$ oc policy add-role-to-user edit --serviceaccount logging-deployer
$ oc policy add-role-to-user daemonset-admin --serviceaccount logging-deployer
$ oadm policy add-cluster-role-to-user oauth-editor system:serviceaccount:logging:logging-deployer

5. Run logging deployer with parameters MODE=upgrade image_prefix = brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ image_version = 3.3.0:

$oc process logging-deployer-template -v\
ENABLE_OPS_CLUSTER=true,\
IMAGE_PREFIX=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/,\
KIBANA_HOSTNAME=kibana.0822-is1.qe.rhcloud.com,\
KIBANA_OPS_HOSTNAME=kibana-ops.0822-is1.qe.rhcloud.com,\
PUBLIC_MASTER_URL=https://host-8-172-89.host.centralci.eng.rdu2.redhat.com:8443,\
ES_INSTANCE_RAM=1024M,\
ES_CLUSTER_SIZE=1,\
MODE=upgrade,\
IMAGE_VERSION=3.3.0,\
MASTER_URL=https://host-8-172-89.host.centralci.eng.rdu2.redhat.com:8443\
|oc create -f -

6. Check logging pods after upgrade:
# oc get po
NAME                              READY     STATUS             RESTARTS   AGE
logging-curator-1-9zs7s           1/1       Running            0          6m
logging-curator-ops-1-n6f8r       1/1       Running            0          6m
logging-deployer-tw2e2            0/1       Completed          0          8m
logging-es-be6nb8x3-3-0zh3g       1/1       Running            0          6m
logging-es-ops-ht5m08g3-3-vir1b   1/1       Running            0          6m
logging-fluentd-0grmx             1/1       Running            0          6m
logging-fluentd-krx4v             1/1       Running            0          6m
logging-kibana-2-eybgx            2/2       Running            0          5m
logging-kibana-ops-2-owgru        2/2       Running            0          5m

7. Visit Kibana and Kibana OPS UI

Actual results:
Encounter "503 Service Unavailable" while accessing the Kibana OPS UI. The non-OPS UI worked fine.

Expected results:
Kibana OPS UI should work fine post upgrade

Additional info: 
Screenshots attached
Upgrade pod log attached

Comment 1 Xia Zhao 2016-08-24 05:41:13 UTC
Created attachment 1193472 [details]
Upgrade_pod_log

Comment 2 Xia Zhao 2016-08-24 05:41:56 UTC
Created attachment 1193473 [details]
OPS UI kibana screenshot which running fine

Comment 3 Xia Zhao 2016-08-24 05:42:42 UTC
Created attachment 1193474 [details]
Non OPS Kiana UI sccreeshot which get bug repro

Comment 14 Luke Meyer 2016-08-30 01:16:01 UTC
It is a cert problem as Paul said; the problem seems to be that logging-kibana-proxy and logging-kibana-ops-proxy secrets get different server certs (which is right) signed by different signers, which should not happen as every cert in the deployment should have the same signer. So, the routes should have been right (having the same CA for both), but the server cert on the kibana-ops instance is wrong. I need to figure out if that's something new or we just never noticed before we had and upgrade creating a reencrypt route.

Comment 15 Xia Zhao 2016-08-30 01:25:57 UTC
(In reply to Luke Meyer from comment #14)
> It is a cert problem as Paul said; the problem seems to be that
> logging-kibana-proxy and logging-kibana-ops-proxy secrets get different
> server certs (which is right) signed by different signers, which should not
> happen as every cert in the deployment should have the same signer. So, the
> routes should have been right (having the same CA for both), but the server
> cert on the kibana-ops instance is wrong. I need to figure out if that's
> something new or we just never noticed before we had and upgrade creating a
> reencrypt route.

Thanks for the info Luke. I'll keep the test env in comment #12 until it's finished in using by you.

Comment 16 Luke Meyer 2016-08-30 16:49:38 UTC
The problem is that in OSE 3.2, kibana and kibana-ops pods were created with separate secrets (though they had the same contents) and in 3.3 they are both created to use the same secret, logging-kibana-proxy. The logging-kibana-ops-proxy secret from the 3.2 installation is left unaltered by the upgrade, as is the kibana-ops DC secret volume mount, while all the other secrets are regenerated with a new signer. The routes are replaced with reencrypt routes looking for the new signer, so the kibana-ops cert isn't trusted.

I need to fix the upgrade so that it deletes the old secret and patches the kibana-ops DC to look at the right one.

Comment 18 Xia Zhao 2016-08-31 12:56:40 UTC
Retested with the latest 3.3.0 logging images on brew, The kibana ops pod did not start up successfully after upgrade: 

$ oc get po
NAME                              READY     STATUS              RESTARTS   AGE
logging-curator-1-2j802           1/1       Running             0          2h
logging-curator-ops-1-i219t       1/1       Running             0          2h
logging-deployer-zczlp            0/1       Error               0          2h
logging-es-60qdpasn-3-8grsp       1/1       Running             0          2h
logging-es-ops-pvesokep-3-cz3e1   1/1       Running             0          2h
logging-fluentd-seosv             1/1       Running             0          2h
logging-kibana-2-8068e            2/2       Running             0          2h
logging-kibana-ops-2-dapa2        0/2       ContainerCreating   0          2h


And the upgrade deployer pod failed by this error:

+++ oc get pod logging-kibana-ops-2-dapa2 -o 'jsonpath={.status.phase}'
++ [[ Running == \P\e\n\d\i\n\g ]]
+ sleep 1
+ (( i++  ))
+ (( i<=300 ))
+ eval '[[ "Running" == "$(oc get pod logging-kibana-ops-2-dapa2 -o jsonpath='\''{.status.phase}'\'')" ]]'
+++ oc get pod logging-kibana-ops-2-dapa2 -o 'jsonpath={.status.phase}'
++ [[ Running == \P\e\n\d\i\n\g ]]
+ sleep 1
logging-kibana-ops-2-dapa2 not started within 300 seconds
+ (( i++  ))
+ (( i<=300 ))
+ return 1
+ echo 'logging-kibana-ops-2-dapa2 not started within 300 seconds'
+ return 1


I will retry and update later.

Comment 19 Luke Meyer 2016-08-31 14:07:03 UTC
I think I need to redeploy the kibana-ops DC after modifying it; it's probably looking for a secret that no longer exists. Built logging-deployer:3.3.0-9.

Comment 20 openshift-github-bot 2016-08-31 21:35:10 UTC
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/221beecd5920f3f76d3694623d46faa6f372366e
origin fix for bug 1369646

Make the upgrade set the correct secret volume on the logging-kibana-ops
DC; in earlier versions, it got a separate-but-equal secret, but in the
present versions both kibana DCs should use the same
logging-kibana-proxy secret.

Comment 21 chunchen 2016-09-01 09:52:29 UTC
It's fixed with below latest images:

brew-pulp-docker01...com:8888/openshift3/logging-deployer        3.3.0               de84ad1448af        11 hours ago        760.1 MB
brew-pulp-docker01...com:8888/openshift3/logging-kibana          3.3.0               ad2713df85a7        11 hours ago        266.9 MB
brew-pulp-docker01...com:8888/openshift3/logging-fluentd         3.3.0               74505c2dd791        12 hours ago        238.7 MB
brew-pulp-docker01...com:8888/openshift3/logging-elasticsearch   3.3.0               f204bea758eb        5 days ago          426 MB
brew-pulp-docker01...com:8888/openshift3/logging-auth-proxy      3.3.0               196ecb30fc93        3 weeks ago         229.2 MB
brew-pulp-docker01...com:8888/openshift3/logging-curator         3.3.0               2c88e1273c11

Comment 23 errata-xmlrpc 2016-09-27 09:45:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933


Note You need to log in before you can comment on or make changes to this bug.