Bug 1546365
| Summary: | [free-int] kube-service-catalog/apiserver pod in crash loop after upgrade | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> | |
| Component: | Cluster Version Operator | Assignee: | Michael Gugino <mgugino> | |
| Status: | CLOSED ERRATA | QA Contact: | Weihua Meng <wmeng> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 3.9.0 | CC: | aos-bugs, bingli, dyocum, gpei, jialiu, jokerman, jpeeler, jupierce, mgugino, mmccomas, pmorie, rteague, sdodson, vrutkovs, wmeng, xtian | |
| Target Milestone: | --- | Flags: | jupierce:
needinfo-
|
|
| Target Release: | 3.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1555394 (view as bug list) | Environment: | ||
| Last Closed: | 2018-03-28 14:29:21 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1547803 | |||
| Bug Blocks: | 1555394 | |||
|
Description
Justin Pierce
2018-02-16 21:23:15 UTC
Jeff, will you please take a look at this (with me) on Monday? If the requestheader-client-ca file is missing, then that points to the aggregator not being set up. There was an internal email requesting this to be handled, but since this is on a new install then it must not have been. Was this on a new install or an upgrade? I meant to say above that it's likely the upgrade path is not handled. But a fresh install should be working, especially since it was for new 3.7 installs. jpeeler this was an upgrade from 3.9 to a slightly newer build of 3.9. However, it appears this is the first time the service catalog has been enabled during an upgrade of this cluster (kube-service-catalog did not previously exist). Need to ensure that wire_aggregator is called via control_plane_upgrade and moved out of upgrade.yml which only happens during all-in-one upgrades. Also need to fix 3.7 to ensure that the aggregator is installed during upgrades there as well. aggregator is setup during installs on new clusters since 3.7. aggregator is also configured on 3.7 upgrades on 3.7 branch. However, aggregator is not configured on 3.7 upgrades on 3.9. I will add the aggregator to 3.7 upgrades on 3.9. Then, all hosts should have aggregator by 3.7 and there is no need to run this during upgrades later. It looks like to replicate this, one must have a 3.6 release, upgrade to 3.7 on master. PR Submitted: https://github.com/openshift/openshift-ansible/pull/7233 Paths in the PR need to be corrected. New PR Created: https://github.com/openshift/openshift-ansible/pull/7270 Hi, Justin Can you help verify this bug? Thanks. (In reply to Michael Gugino from comment #7) > aggregator is also configured on 3.7 upgrades on 3.7 branch. However, > aggregator is not configured on 3.7 upgrades on 3.9. Does that mean free-int upgrade is using 3.9 code to run a 3.6->3.7 upgrade? If yes, personally this should be an invalid case. QE never encounter such issue. (In reply to Michael Gugino from comment #10) > New PR Created: https://github.com/openshift/openshift-ansible/pull/7270 In 3.9 installer, give a fix for v3_7/upgrade.yml, as far as I know, the playbook is only used for 3.6->3.7, does that mean we also support or agree user to use 3.9 installer to run 3.6->3.7 upgrade. This looks really strange. This would bring a lot of noise. Personally in order to avoid noise, we should not ship old version of upgrade code (e.g: 3.6->3.7) in in 3.9 installer, only keep 3.7->3.9 code. I tried with openshift-ansible-3.9.1-1.git.0.9862628.el7.noarch
# ansible-playbook -vvv /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade.yml
got failure during upgrading.
Failure summary:
1. Hosts: host-8-250-81.host.centralci.eng.rdu2.redhat.com
Play: Upgrade Service Catalog
Task: wait for api server to be ready
Message: Status code was not [200]: HTTP Error 500: Internal Server Error
This failure is the same with bug https://bugzilla.redhat.com/show_bug.cgi?id=1547803
So that bug blocks the verification of this bug.
The logic for task "wait for api server to be ready" changed started in openshift-ansible-3.9.0-0.47.0 https://github.com/openshift/openshift-ansible/commit/79e283ad98af57ecd4a4105fe561f0b0c4c53f6e#diff-ebcf31d9a3d2b05b096049dd00fb0b1b So the upgrade playbook can finish with openshift-playbook v3.9.0-0.45.0 BZ#1547803 is another issue, unrelated to this this bug. @Justin, to make testing move on, could you help confirm comment 13, if you agree this is an invalid test scenarios, I proposed to close this bug as NOTABUG. If not, QE would only run some regression testing to make sure no new issues are introduced (because QE could not reproduce this bug, we can not assure the PR really would help resolve your issue.) The free-int upgrade was run using 3.9 playbooks and it was upgrading a slightly older 3.9 environment. Since I was instructed to disable service broker for subsequent deployments to free-int, I will not be able to validate this either. Fixed. openshift-ansible-3.9.3-1.git.0.e166207.el7.noarch upgrade from openshift v3.9.0-0.38.0 # oc get pods -n kube-service-catalog NAME READY STATUS RESTARTS AGE apiserver-cclsb 1/1 Running 0 33m controller-manager-c6tr5 1/1 Running 0 33m On free-int, kube-service-catalog/apiserver pod in crash loop: # oc get pods -n kube-service-catalog NAME READY STATUS RESTARTS AGE apiserver-v5d5t 0/1 CrashLoopBackOff 2414 8d controller-manager-wp4vw 0/1 CrashLoopBackOff 1524 8d That looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1547803 - the broker gets deployed correctly, but once healthz endpoint is triggered the broker crashes Looking at free-int it's clear that the aggregator has not been deployed so I think we should ensure that it gets invoked during the control plane in 3.9 as well. https://github.com/openshift/openshift-ansible/pull/7478 I've verified locally that the patch will run the wire_aggregator tasks but since this problem is reported against an environment we cannot test this until that environment is upgraded again using a version of openshift-ansible which contains the fix. I'll move this to modified once the fix is merged though. Is free-int the only environment in which we've attempted to deploy the service catalog? Hi, @Scott I cannot reproduce it. Could you give more detailed steps to reproduce it? Thanks. (In reply to Weihua Meng from comment #25) > Hi, @Scott > I cannot reproduce it. > > Could you give more detailed steps to reproduce it? > Thanks. To replicate the issue: Upgrade a 3.6 cluster to a 3.7 cluster with the 3.7 GA release tag/rpm. We were missing this logic in this version of openshift-ansible. Upgrade to 3.9 using 3.9 branch befor the fix commit. Yeah, I think in free-int the series of events were 1) Install 3.6 2) Upgrade to 3.7 using playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_control_plane.yml and playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_nodes.yml 3) Upgrade to 3.9 using playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_control_plane.yml and playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_nodes.yml 4) Install service catalog 5) Service catalog crash loops because the API Aggregator was not configured in either step 2 or 3 as it should've been. Failed.
openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch
1. RPM install OCP v3.6.173.0.104
2. upgrade with openshift-ansible-3.7.14-1.git.0.4b35b2d.el7.noarch
openshift_enable_service_catalog=false
openshift_web_console_install=false
3. upgrade with openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch
openshift_enable_service_catalog=false
openshift_web_console_install=false
4. install service catalog with openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch
playbooks/openshift-service-catalog/config.yml
openshift_enable_service_catalog=true
openshift_service_catalog_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
ansible_service_broker_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
template_service_broker_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
template_service_broker_selector={"role": "node"}
openshift_web_console_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
Note: from the 397upgradelog, did not found play
name: Configure API aggregation on masters
which is added by new PR
Ansible failed when install Service catalog. Seems etcd issue # curl -k https://apiserver.kube-service-catalog.svc/healthz [+]ping ok [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/start-service-catalog-apiserver-informers ok [-]etcd failed: reason withheld What does "oc describe po -n kube-service-catalog -lapp=apiserver" report? My guess at the moment is even though the port number for etcd has been corrected to use 2379 instead of 4001, no ansible was added to correct installs in 3.9 because it's assumed the latest 3.7 code upgrade was done first. (Btw, when is 3.7.24+ errata being released?) I am looking into this to see if I can replicate with older version of 3.7. Okay, I believe I have isolated the root cause for this. It's our old
pal, openshift_facts.
If "master": {"etcd_port": "1111"}" is present inside
/etc/ansible/facts.d/openshift.fact, our installer goes with that
value, no matter what. (I set 1111 for testing in that file). We
don't override it, we preserve whatever was there.
Steps to reproduce:
1) Install cluster with 3.9, no service catalog
2) Inject aforementioned value into openshift.fact file (master key
will already be present in that json file, just need to add etcd_port
bits).
3) Attempt to install service catalog.
This will affect anyone who ever deployed with the old value as it
will be preserved by openshift_facts.
# oc describe po -n kube-service-catalog -lapp=apiserver
Name: apiserver-lzfzn
Namespace: kube-service-catalog
Node: wmengupgraderpm363-master-1/10.240.0.190
Start Time: Tue, 13 Mar 2018 01:07:49 -0400
Labels: app=apiserver
controller-revision-hash=687006152
pod-template-generation=1
Annotations: ca_hash=cbdc9f97cf232061e7083729fcd96335ee813aa6
openshift.io/scc=hostmount-anyuid
Status: Running
IP: 10.128.0.7
Controlled By: DaemonSet/apiserver
Containers:
apiserver:
Container ID: docker://744a0d4f47cf5647467a584a493779573808df6224106235c816f813bc8bf72f
Image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.7
Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:5de3bab01891975d221a03ad1905dc5671f9d4f23ee1099fae5f122f9715e950
Port: 6443/TCP
Command:
/usr/bin/service-catalog
Args:
apiserver
--storage-type
etcd
--secure-port
6443
--etcd-servers
https://wmengupgraderpm363-master-1:2379
--etcd-cafile
/etc/origin/master/master.etcd-ca.crt
--etcd-certfile
/etc/origin/master/master.etcd-client.crt
--etcd-keyfile
/etc/origin/master/master.etcd-client.key
-v
10
--cors-allowed-origins
localhost
--admission-control
KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck
--feature-gates
OriginatingIdentity=true
State: Running
Started: Tue, 13 Mar 2018 01:07:56 -0400
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/etc/origin/master from etcd-host-cert (ro)
/var/run/kubernetes-service-catalog from apiserver-ssl (ro)
/var/run/secrets/kubernetes.io/serviceaccount from service-catalog-apiserver-token-bqklm (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
apiserver-ssl:
Type: Secret (a volume populated by a Secret)
SecretName: apiserver-ssl
Optional: false
etcd-host-cert:
Type: HostPath (bare host directory volume)
Path: /etc/origin/master
HostPathType:
data-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
service-catalog-apiserver-token-bqklm:
Type: Secret (a volume populated by a Secret)
SecretName: service-catalog-apiserver-token-bqklm
Optional: false
QoS Class: BestEffort
Node-Selectors: openshift-infra=apiserver
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events: <none>
We need a rubost fix for it as we cannot guarantee all customers are using latest 3.7 errata build before upgrade to 3.9. And latest public errata has 3.7.23 for now. Jianlin, can you answer the question in commet 33? when is 3.7.24+ errata being released? Thanks. PR Created: https://github.com/openshift/openshift-ansible/pull/7516 This will affect all recent branches, will backport/forward port. @Jianlin, @Weihua, if `grep etcd_port /etc/ansible/facts.d/openshift.fact` returns a match on the first master host, then service catalog will not get the right value for etcd_port. This is due to the behavior that local_facts will take precedence over defaults in openshift_facts. This will affect any old cluster that had previously placed 'etcd_port' inside the 'master' dictionary inside that file. The master branch no longer places that value inside the fact file, but it will still respect old values that were placed there before. Hopefully in 3.10 we can remove this file entirely. (In reply to Weihua Meng from comment #38) > Jianlin, can you answer the question in commet 33? > when is 3.7.24+ errata being released? > Thanks. The latest 3.7.24+ errata will be released in https://errata.devel.redhat.com/advisory/32336, it is still in NEW_FILE state, because 3.9 is higher priority, so maybe it will be released after 3.9 GA. Backport to 3.7 created: https://github.com/openshift/openshift-ansible/pull/7523 Cloned for 3.7.z @Gaoyun please check if scale etcd is OK with this change. Thanks. @Justin, what is our on-line cluster config, dedicated etcd hosts or etcd on master hosts? Thanks. I found the cause for the failure.
--etcd-servers
https://wmengupgraderpm364-master-1:2379
external etcd is used for this cluster, so etcd server should be
https://wmengupgraderpm364-etcd-1:2379
PR created for 3.9: https://github.com/openshift/openshift-ansible/pull/7542 I'm unsure if 3.7 will be affected by this condition, it may be a regression in 3.9. Forking the specific scenario about upgraded 3.9 environments with external etcd to https://bugzilla.redhat.com/show_bug.cgi?id=1557036 Fixed with etcd in master hosts. openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7.noarch Issue with dedicated etcd hosts is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1557036 # oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-6-b7brh 1/1 Running 0 20m default registry-console-3-2kxmj 1/1 Running 0 27m default router-3-8fbpx 1/1 Running 0 22m default router-3-fmc5g 1/1 Running 0 20m install-test mongodb-1-kj7s5 1/1 Running 0 20m install-test nodejs-mongodb-example-1-p8j74 1/1 Running 0 20m kube-service-catalog apiserver-rxt7w 1/1 Running 0 7m kube-service-catalog controller-manager-w6jst 1/1 Running 0 7m openshift-ansible-service-broker asb-1-r2kmf 1/1 Running 2 6m openshift-ansible-service-broker asb-etcd-1-llcxt 1/1 Running 0 6m openshift-template-service-broker apiserver-s9fzt 1/1 Running 0 6m openshift-template-service-broker apiserver-xhxvd 1/1 Running 0 6m openshift-template-service-broker apiserver-zldrz 1/1 Running 0 6m wmeng cakephp-mysql-example-1-7rkcx 1/1 Running 0 1m wmeng cakephp-mysql-example-1-build 0/1 Completed 0 2m wmeng mysql-1-nm6ww 1/1 Running 0 2m # curl -k https://apiserver.kube-service-catalog.svc/healthz ok Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489 |