Description of problem: After deploying a new build of OCP v3.9 to free-int, the kube-service-catalog/apiserver is in a crash loop backoff. Version-Release number of the following components: v3.9.0-0.45.0 Additional info: [root@free-int-master-3c664 ~]# oc get pods -w NAME READY STATUS RESTARTS AGE apiserver-zrb2v 0/1 CrashLoopBackOff 8 19m controller-manager-sz4l5 1/1 Running 5 19m [root@free-int-master-3c664 ~]# oc logs apiserver-zrb2v ....each log ends with the line.... Error: cluster doesn't provide requestheader-client-ca-file Not sure if this is installer or service broker, but guessing installer since ca is mentioned.
Jeff, will you please take a look at this (with me) on Monday?
If the requestheader-client-ca file is missing, then that points to the aggregator not being set up. There was an internal email requesting this to be handled, but since this is on a new install then it must not have been.
Was this on a new install or an upgrade? I meant to say above that it's likely the upgrade path is not handled. But a fresh install should be working, especially since it was for new 3.7 installs.
jpeeler this was an upgrade from 3.9 to a slightly newer build of 3.9. However, it appears this is the first time the service catalog has been enabled during an upgrade of this cluster (kube-service-catalog did not previously exist).
Need to ensure that wire_aggregator is called via control_plane_upgrade and moved out of upgrade.yml which only happens during all-in-one upgrades. Also need to fix 3.7 to ensure that the aggregator is installed during upgrades there as well.
aggregator is setup during installs on new clusters since 3.7. aggregator is also configured on 3.7 upgrades on 3.7 branch. However, aggregator is not configured on 3.7 upgrades on 3.9. I will add the aggregator to 3.7 upgrades on 3.9. Then, all hosts should have aggregator by 3.7 and there is no need to run this during upgrades later. It looks like to replicate this, one must have a 3.6 release, upgrade to 3.7 on master.
PR Submitted: https://github.com/openshift/openshift-ansible/pull/7233
Paths in the PR need to be corrected.
New PR Created: https://github.com/openshift/openshift-ansible/pull/7270
Hi, Justin Can you help verify this bug? Thanks.
(In reply to Michael Gugino from comment #7) > aggregator is also configured on 3.7 upgrades on 3.7 branch. However, > aggregator is not configured on 3.7 upgrades on 3.9. Does that mean free-int upgrade is using 3.9 code to run a 3.6->3.7 upgrade? If yes, personally this should be an invalid case. QE never encounter such issue. (In reply to Michael Gugino from comment #10) > New PR Created: https://github.com/openshift/openshift-ansible/pull/7270 In 3.9 installer, give a fix for v3_7/upgrade.yml, as far as I know, the playbook is only used for 3.6->3.7, does that mean we also support or agree user to use 3.9 installer to run 3.6->3.7 upgrade. This looks really strange. This would bring a lot of noise. Personally in order to avoid noise, we should not ship old version of upgrade code (e.g: 3.6->3.7) in in 3.9 installer, only keep 3.7->3.9 code.
I tried with openshift-ansible-3.9.1-1.git.0.9862628.el7.noarch # ansible-playbook -vvv /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade.yml got failure during upgrading. Failure summary: 1. Hosts: host-8-250-81.host.centralci.eng.rdu2.redhat.com Play: Upgrade Service Catalog Task: wait for api server to be ready Message: Status code was not [200]: HTTP Error 500: Internal Server Error This failure is the same with bug https://bugzilla.redhat.com/show_bug.cgi?id=1547803 So that bug blocks the verification of this bug.
The logic for task "wait for api server to be ready" changed started in openshift-ansible-3.9.0-0.47.0 https://github.com/openshift/openshift-ansible/commit/79e283ad98af57ecd4a4105fe561f0b0c4c53f6e#diff-ebcf31d9a3d2b05b096049dd00fb0b1b So the upgrade playbook can finish with openshift-playbook v3.9.0-0.45.0
BZ#1547803 is another issue, unrelated to this this bug. @Justin, to make testing move on, could you help confirm comment 13, if you agree this is an invalid test scenarios, I proposed to close this bug as NOTABUG. If not, QE would only run some regression testing to make sure no new issues are introduced (because QE could not reproduce this bug, we can not assure the PR really would help resolve your issue.)
The free-int upgrade was run using 3.9 playbooks and it was upgrading a slightly older 3.9 environment. Since I was instructed to disable service broker for subsequent deployments to free-int, I will not be able to validate this either.
Fixed. openshift-ansible-3.9.3-1.git.0.e166207.el7.noarch upgrade from openshift v3.9.0-0.38.0 # oc get pods -n kube-service-catalog NAME READY STATUS RESTARTS AGE apiserver-cclsb 1/1 Running 0 33m controller-manager-c6tr5 1/1 Running 0 33m
On free-int, kube-service-catalog/apiserver pod in crash loop: # oc get pods -n kube-service-catalog NAME READY STATUS RESTARTS AGE apiserver-v5d5t 0/1 CrashLoopBackOff 2414 8d controller-manager-wp4vw 0/1 CrashLoopBackOff 1524 8d
That looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1547803 - the broker gets deployed correctly, but once healthz endpoint is triggered the broker crashes
Looking at free-int it's clear that the aggregator has not been deployed so I think we should ensure that it gets invoked during the control plane in 3.9 as well. https://github.com/openshift/openshift-ansible/pull/7478
I've verified locally that the patch will run the wire_aggregator tasks but since this problem is reported against an environment we cannot test this until that environment is upgraded again using a version of openshift-ansible which contains the fix. I'll move this to modified once the fix is merged though. Is free-int the only environment in which we've attempted to deploy the service catalog?
Hi, @Scott I cannot reproduce it. Could you give more detailed steps to reproduce it? Thanks.
(In reply to Weihua Meng from comment #25) > Hi, @Scott > I cannot reproduce it. > > Could you give more detailed steps to reproduce it? > Thanks. To replicate the issue: Upgrade a 3.6 cluster to a 3.7 cluster with the 3.7 GA release tag/rpm. We were missing this logic in this version of openshift-ansible. Upgrade to 3.9 using 3.9 branch befor the fix commit.
Yeah, I think in free-int the series of events were 1) Install 3.6 2) Upgrade to 3.7 using playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_control_plane.yml and playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_nodes.yml 3) Upgrade to 3.9 using playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_control_plane.yml and playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_nodes.yml 4) Install service catalog 5) Service catalog crash loops because the API Aggregator was not configured in either step 2 or 3 as it should've been.
Failed. openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch 1. RPM install OCP v3.6.173.0.104 2. upgrade with openshift-ansible-3.7.14-1.git.0.4b35b2d.el7.noarch openshift_enable_service_catalog=false openshift_web_console_install=false 3. upgrade with openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch openshift_enable_service_catalog=false openshift_web_console_install=false 4. install service catalog with openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch playbooks/openshift-service-catalog/config.yml openshift_enable_service_catalog=true openshift_service_catalog_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose- ansible_service_broker_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose- template_service_broker_prefix=registry.reg-aws.openshift.com:443/openshift3/ose- template_service_broker_selector={"role": "node"} openshift_web_console_prefix=registry.reg-aws.openshift.com:443/openshift3/ose- Note: from the 397upgradelog, did not found play name: Configure API aggregation on masters which is added by new PR
Ansible failed when install Service catalog. Seems etcd issue # curl -k https://apiserver.kube-service-catalog.svc/healthz [+]ping ok [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/start-service-catalog-apiserver-informers ok [-]etcd failed: reason withheld
What does "oc describe po -n kube-service-catalog -lapp=apiserver" report? My guess at the moment is even though the port number for etcd has been corrected to use 2379 instead of 4001, no ansible was added to correct installs in 3.9 because it's assumed the latest 3.7 code upgrade was done first. (Btw, when is 3.7.24+ errata being released?)
I am looking into this to see if I can replicate with older version of 3.7.
Okay, I believe I have isolated the root cause for this. It's our old pal, openshift_facts. If "master": {"etcd_port": "1111"}" is present inside /etc/ansible/facts.d/openshift.fact, our installer goes with that value, no matter what. (I set 1111 for testing in that file). We don't override it, we preserve whatever was there. Steps to reproduce: 1) Install cluster with 3.9, no service catalog 2) Inject aforementioned value into openshift.fact file (master key will already be present in that json file, just need to add etcd_port bits). 3) Attempt to install service catalog. This will affect anyone who ever deployed with the old value as it will be preserved by openshift_facts.
# oc describe po -n kube-service-catalog -lapp=apiserver Name: apiserver-lzfzn Namespace: kube-service-catalog Node: wmengupgraderpm363-master-1/10.240.0.190 Start Time: Tue, 13 Mar 2018 01:07:49 -0400 Labels: app=apiserver controller-revision-hash=687006152 pod-template-generation=1 Annotations: ca_hash=cbdc9f97cf232061e7083729fcd96335ee813aa6 openshift.io/scc=hostmount-anyuid Status: Running IP: 10.128.0.7 Controlled By: DaemonSet/apiserver Containers: apiserver: Container ID: docker://744a0d4f47cf5647467a584a493779573808df6224106235c816f813bc8bf72f Image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.7 Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:5de3bab01891975d221a03ad1905dc5671f9d4f23ee1099fae5f122f9715e950 Port: 6443/TCP Command: /usr/bin/service-catalog Args: apiserver --storage-type etcd --secure-port 6443 --etcd-servers https://wmengupgraderpm363-master-1:2379 --etcd-cafile /etc/origin/master/master.etcd-ca.crt --etcd-certfile /etc/origin/master/master.etcd-client.crt --etcd-keyfile /etc/origin/master/master.etcd-client.key -v 10 --cors-allowed-origins localhost --admission-control KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck --feature-gates OriginatingIdentity=true State: Running Started: Tue, 13 Mar 2018 01:07:56 -0400 Ready: True Restart Count: 0 Environment: <none> Mounts: /etc/origin/master from etcd-host-cert (ro) /var/run/kubernetes-service-catalog from apiserver-ssl (ro) /var/run/secrets/kubernetes.io/serviceaccount from service-catalog-apiserver-token-bqklm (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: apiserver-ssl: Type: Secret (a volume populated by a Secret) SecretName: apiserver-ssl Optional: false etcd-host-cert: Type: HostPath (bare host directory volume) Path: /etc/origin/master HostPathType: data-dir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: service-catalog-apiserver-token-bqklm: Type: Secret (a volume populated by a Secret) SecretName: service-catalog-apiserver-token-bqklm Optional: false QoS Class: BestEffort Node-Selectors: openshift-infra=apiserver Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute Events: <none>
We need a rubost fix for it as we cannot guarantee all customers are using latest 3.7 errata build before upgrade to 3.9. And latest public errata has 3.7.23 for now.
Jianlin, can you answer the question in commet 33? when is 3.7.24+ errata being released? Thanks.
PR Created: https://github.com/openshift/openshift-ansible/pull/7516 This will affect all recent branches, will backport/forward port. @Jianlin, @Weihua, if `grep etcd_port /etc/ansible/facts.d/openshift.fact` returns a match on the first master host, then service catalog will not get the right value for etcd_port. This is due to the behavior that local_facts will take precedence over defaults in openshift_facts. This will affect any old cluster that had previously placed 'etcd_port' inside the 'master' dictionary inside that file. The master branch no longer places that value inside the fact file, but it will still respect old values that were placed there before. Hopefully in 3.10 we can remove this file entirely.
(In reply to Weihua Meng from comment #38) > Jianlin, can you answer the question in commet 33? > when is 3.7.24+ errata being released? > Thanks. The latest 3.7.24+ errata will be released in https://errata.devel.redhat.com/advisory/32336, it is still in NEW_FILE state, because 3.9 is higher priority, so maybe it will be released after 3.9 GA.
Backport to 3.7 created: https://github.com/openshift/openshift-ansible/pull/7523
Cloned for 3.7.z
@Gaoyun please check if scale etcd is OK with this change. Thanks.
@Justin, what is our on-line cluster config, dedicated etcd hosts or etcd on master hosts? Thanks.
I found the cause for the failure. --etcd-servers https://wmengupgraderpm364-master-1:2379 external etcd is used for this cluster, so etcd server should be https://wmengupgraderpm364-etcd-1:2379
PR created for 3.9: https://github.com/openshift/openshift-ansible/pull/7542 I'm unsure if 3.7 will be affected by this condition, it may be a regression in 3.9.
Forking the specific scenario about upgraded 3.9 environments with external etcd to https://bugzilla.redhat.com/show_bug.cgi?id=1557036
Fixed with etcd in master hosts. openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7.noarch Issue with dedicated etcd hosts is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1557036 # oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-6-b7brh 1/1 Running 0 20m default registry-console-3-2kxmj 1/1 Running 0 27m default router-3-8fbpx 1/1 Running 0 22m default router-3-fmc5g 1/1 Running 0 20m install-test mongodb-1-kj7s5 1/1 Running 0 20m install-test nodejs-mongodb-example-1-p8j74 1/1 Running 0 20m kube-service-catalog apiserver-rxt7w 1/1 Running 0 7m kube-service-catalog controller-manager-w6jst 1/1 Running 0 7m openshift-ansible-service-broker asb-1-r2kmf 1/1 Running 2 6m openshift-ansible-service-broker asb-etcd-1-llcxt 1/1 Running 0 6m openshift-template-service-broker apiserver-s9fzt 1/1 Running 0 6m openshift-template-service-broker apiserver-xhxvd 1/1 Running 0 6m openshift-template-service-broker apiserver-zldrz 1/1 Running 0 6m wmeng cakephp-mysql-example-1-7rkcx 1/1 Running 0 1m wmeng cakephp-mysql-example-1-build 0/1 Completed 0 2m wmeng mysql-1-nm6ww 1/1 Running 0 2m # curl -k https://apiserver.kube-service-catalog.svc/healthz ok
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489