Description of problem: Upgrade logging from 3.2.0 to 3.3.0 when ENABLE_OPS_CLUSTER=true, after upgrade pod finished successfully, Encounter "503 Service Unavailable" while accessing the Kibana OPS UI. The non-OPS UI worked fine. Version-Release number of selected component (if applicable): brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd 3.3.0 4d87d421e950 5 days ago 238.7 MB brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy 3.3.0 196ecb30fc93 2 weeks ago 229.2 MB brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch 3.3.0 e71d2b04669c 4 weeks ago 426.9 MB brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer 3.3.0 1c127f4f36a0 4 weeks ago 747.9 MB brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-curator 3.3.0 2c88e1273c11 4 weeks ago 253.8 MB brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-kibana 3.3.0 32d276bb46ae 8 weeks ago How reproducible: Always Steps to Reproduce: 0. Deploy 3.2.0 logging systems ( with OPS cluster enabled) : IMAGE_PREFIX = brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ IMAGE_VERSION = 3.2.0 ENABLE_OPS_CLUSTER=true Make sure EFK pods are running fine, kibana and kibana OPS UI is accesible & functional. 1. Add yourself to cluster-admin $ oadm policy add-cluster-role-to-user cluster-admin xiazhao 2. Delete existing templates if it exist $ oc delete template logging-deployer-account-template logging-deployer-template Error from server: templates "logging-deployer-account-template" not found Error from server: templates "logging-deployer-template" not found 3. Create missing templates according to doc https://github.com/openshift/origin-aggregated-logging/tree/master/deployer#create-missing-templates: $ oc create -f https://raw.githubusercontent.com/openshift/origin-aggregated-logging/master/deployer/deployer.yaml template "logging-deployer-account-template" created template "logging-deployer-template" created Modify deployer template to use the new name for 3.3.0 deployer: $ oc edit template logging-deployer-template -o yaml changed from image: ${IMAGE_PREFIX}logging-deployment:${IMAGE_VERSION} into image: ${IMAGE_PREFIX}logging-deployer:${IMAGE_VERSION} 4. Create SA and permissions according to doc https://github.com/openshift/origin-aggregated-logging/tree/master/deployer#create-supporting-serviceaccount-and-permissions : $ oc new-app logging-deployer-account-template --> Deploying template logging-deployer-account-template for "logging-deployer-account-template" --> Creating resources ... error: serviceaccounts "logging-deployer" already exists error: serviceaccounts "aggregated-logging-kibana" already exists error: serviceaccounts "aggregated-logging-elasticsearch" already exists error: serviceaccounts "aggregated-logging-fluentd" already exists serviceaccount "aggregated-logging-curator" created clusterrole "oauth-editor" created clusterrole "daemonset-admin" created rolebinding "logging-deployer-edit-role" created rolebinding "logging-deployer-dsadmin-role" created $ oc policy add-role-to-user edit --serviceaccount logging-deployer $ oc policy add-role-to-user daemonset-admin --serviceaccount logging-deployer $ oadm policy add-cluster-role-to-user oauth-editor system:serviceaccount:logging:logging-deployer 5. Run logging deployer with parameters MODE=upgrade image_prefix = brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ image_version = 3.3.0: $oc process logging-deployer-template -v\ ENABLE_OPS_CLUSTER=true,\ IMAGE_PREFIX=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/,\ KIBANA_HOSTNAME=kibana.0822-is1.qe.rhcloud.com,\ KIBANA_OPS_HOSTNAME=kibana-ops.0822-is1.qe.rhcloud.com,\ PUBLIC_MASTER_URL=https://host-8-172-89.host.centralci.eng.rdu2.redhat.com:8443,\ ES_INSTANCE_RAM=1024M,\ ES_CLUSTER_SIZE=1,\ MODE=upgrade,\ IMAGE_VERSION=3.3.0,\ MASTER_URL=https://host-8-172-89.host.centralci.eng.rdu2.redhat.com:8443\ |oc create -f - 6. Check logging pods after upgrade: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-9zs7s 1/1 Running 0 6m logging-curator-ops-1-n6f8r 1/1 Running 0 6m logging-deployer-tw2e2 0/1 Completed 0 8m logging-es-be6nb8x3-3-0zh3g 1/1 Running 0 6m logging-es-ops-ht5m08g3-3-vir1b 1/1 Running 0 6m logging-fluentd-0grmx 1/1 Running 0 6m logging-fluentd-krx4v 1/1 Running 0 6m logging-kibana-2-eybgx 2/2 Running 0 5m logging-kibana-ops-2-owgru 2/2 Running 0 5m 7. Visit Kibana and Kibana OPS UI Actual results: Encounter "503 Service Unavailable" while accessing the Kibana OPS UI. The non-OPS UI worked fine. Expected results: Kibana OPS UI should work fine post upgrade Additional info: Screenshots attached Upgrade pod log attached
Created attachment 1193472 [details] Upgrade_pod_log
Created attachment 1193473 [details] OPS UI kibana screenshot which running fine
Created attachment 1193474 [details] Non OPS Kiana UI sccreeshot which get bug repro
It is a cert problem as Paul said; the problem seems to be that logging-kibana-proxy and logging-kibana-ops-proxy secrets get different server certs (which is right) signed by different signers, which should not happen as every cert in the deployment should have the same signer. So, the routes should have been right (having the same CA for both), but the server cert on the kibana-ops instance is wrong. I need to figure out if that's something new or we just never noticed before we had and upgrade creating a reencrypt route.
(In reply to Luke Meyer from comment #14) > It is a cert problem as Paul said; the problem seems to be that > logging-kibana-proxy and logging-kibana-ops-proxy secrets get different > server certs (which is right) signed by different signers, which should not > happen as every cert in the deployment should have the same signer. So, the > routes should have been right (having the same CA for both), but the server > cert on the kibana-ops instance is wrong. I need to figure out if that's > something new or we just never noticed before we had and upgrade creating a > reencrypt route. Thanks for the info Luke. I'll keep the test env in comment #12 until it's finished in using by you.
The problem is that in OSE 3.2, kibana and kibana-ops pods were created with separate secrets (though they had the same contents) and in 3.3 they are both created to use the same secret, logging-kibana-proxy. The logging-kibana-ops-proxy secret from the 3.2 installation is left unaltered by the upgrade, as is the kibana-ops DC secret volume mount, while all the other secrets are regenerated with a new signer. The routes are replaced with reencrypt routes looking for the new signer, so the kibana-ops cert isn't trusted. I need to fix the upgrade so that it deletes the old secret and patches the kibana-ops DC to look at the right one.
Retested with the latest 3.3.0 logging images on brew, The kibana ops pod did not start up successfully after upgrade: $ oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-2j802 1/1 Running 0 2h logging-curator-ops-1-i219t 1/1 Running 0 2h logging-deployer-zczlp 0/1 Error 0 2h logging-es-60qdpasn-3-8grsp 1/1 Running 0 2h logging-es-ops-pvesokep-3-cz3e1 1/1 Running 0 2h logging-fluentd-seosv 1/1 Running 0 2h logging-kibana-2-8068e 2/2 Running 0 2h logging-kibana-ops-2-dapa2 0/2 ContainerCreating 0 2h And the upgrade deployer pod failed by this error: +++ oc get pod logging-kibana-ops-2-dapa2 -o 'jsonpath={.status.phase}' ++ [[ Running == \P\e\n\d\i\n\g ]] + sleep 1 + (( i++ )) + (( i<=300 )) + eval '[[ "Running" == "$(oc get pod logging-kibana-ops-2-dapa2 -o jsonpath='\''{.status.phase}'\'')" ]]' +++ oc get pod logging-kibana-ops-2-dapa2 -o 'jsonpath={.status.phase}' ++ [[ Running == \P\e\n\d\i\n\g ]] + sleep 1 logging-kibana-ops-2-dapa2 not started within 300 seconds + (( i++ )) + (( i<=300 )) + return 1 + echo 'logging-kibana-ops-2-dapa2 not started within 300 seconds' + return 1 I will retry and update later.
I think I need to redeploy the kibana-ops DC after modifying it; it's probably looking for a secret that no longer exists. Built logging-deployer:3.3.0-9.
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/221beecd5920f3f76d3694623d46faa6f372366e origin fix for bug 1369646 Make the upgrade set the correct secret volume on the logging-kibana-ops DC; in earlier versions, it got a separate-but-equal secret, but in the present versions both kibana DCs should use the same logging-kibana-proxy secret.
It's fixed with below latest images: brew-pulp-docker01...com:8888/openshift3/logging-deployer 3.3.0 de84ad1448af 11 hours ago 760.1 MB brew-pulp-docker01...com:8888/openshift3/logging-kibana 3.3.0 ad2713df85a7 11 hours ago 266.9 MB brew-pulp-docker01...com:8888/openshift3/logging-fluentd 3.3.0 74505c2dd791 12 hours ago 238.7 MB brew-pulp-docker01...com:8888/openshift3/logging-elasticsearch 3.3.0 f204bea758eb 5 days ago 426 MB brew-pulp-docker01...com:8888/openshift3/logging-auth-proxy 3.3.0 196ecb30fc93 3 weeks ago 229.2 MB brew-pulp-docker01...com:8888/openshift3/logging-curator 3.3.0 2c88e1273c11
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1933