Bug 1546365 - [free-int] kube-service-catalog/apiserver pod in crash loop after upgrade
Summary: [free-int] kube-service-catalog/apiserver pod in crash loop after upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.9.0
Assignee: Michael Gugino
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On: 1547803
Blocks: 1555394
TreeView+ depends on / blocked
 
Reported: 2018-02-16 21:23 UTC by Justin Pierce
Modified: 2021-09-09 13:13 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1555394 (view as bug list)
Environment:
Last Closed: 2018-03-28 14:29:21 UTC
Target Upstream Version:
Embargoed:
jupierce: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1557036 0 unspecified CLOSED Environments upgraded to 3.9 and running external etcd hosts cannot install service catalog 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:29:53 UTC

Internal Links: 1557036

Description Justin Pierce 2018-02-16 21:23:15 UTC
Description of problem:
After deploying a new build of OCP v3.9 to free-int, the kube-service-catalog/apiserver is in a crash loop backoff. 


Version-Release number of the following components:
v3.9.0-0.45.0

Additional info:
[root@free-int-master-3c664 ~]# oc get pods -w
NAME                       READY     STATUS             RESTARTS   AGE
apiserver-zrb2v            0/1       CrashLoopBackOff   8          19m
controller-manager-sz4l5   1/1       Running            5          19m

[root@free-int-master-3c664 ~]# oc logs apiserver-zrb2v
....each log ends with the line....
Error: cluster doesn't provide requestheader-client-ca-file

Not sure if this is installer or service broker, but guessing installer since ca is mentioned.

Comment 1 Paul Morie 2018-02-18 01:23:23 UTC
Jeff, will you please take a look at this (with me) on Monday?

Comment 2 Jeff Peeler 2018-02-19 15:19:48 UTC
If the requestheader-client-ca file is missing, then that points to the aggregator not being set up. There was an internal email requesting this to be handled, but since this is on a new install then it must not have been.

Comment 3 Jeff Peeler 2018-02-19 15:25:07 UTC
Was this on a new install or an upgrade? I meant to say above that it's likely the upgrade path is not handled. But a fresh install should be working, especially since it was for new 3.7 installs.

Comment 4 Justin Pierce 2018-02-19 16:16:58 UTC
jpeeler this was an upgrade from 3.9 to a slightly newer build of 3.9. However, it appears this is the first time the service catalog has been enabled during an upgrade of this cluster (kube-service-catalog did not previously exist).

Comment 6 Scott Dodson 2018-02-19 20:35:44 UTC
Need to ensure that wire_aggregator is called via control_plane_upgrade and moved out of upgrade.yml which only happens during all-in-one upgrades. Also need to fix 3.7 to ensure that the aggregator is installed during upgrades there as well.

Comment 7 Michael Gugino 2018-02-21 15:29:57 UTC
aggregator is setup during installs on new clusters since 3.7.

aggregator is also configured on 3.7 upgrades on 3.7 branch.  However, aggregator is not configured on 3.7 upgrades on 3.9.

I will add the aggregator to 3.7 upgrades on 3.9.  Then, all hosts should have aggregator by 3.7 and there is no need to run this during upgrades later.

It looks like to replicate this, one must have a 3.6 release, upgrade to 3.7 on master.

Comment 8 Michael Gugino 2018-02-21 15:41:14 UTC
PR Submitted: https://github.com/openshift/openshift-ansible/pull/7233

Comment 9 Russell Teague 2018-02-23 17:04:28 UTC
Paths in the PR need to be corrected.

Comment 10 Michael Gugino 2018-02-23 17:52:38 UTC
New PR Created: https://github.com/openshift/openshift-ansible/pull/7270

Comment 12 Weihua Meng 2018-02-27 10:00:25 UTC
Hi, Justin
Can you help verify this bug? 
Thanks.

Comment 13 Johnny Liu 2018-03-02 09:24:09 UTC
(In reply to Michael Gugino from comment #7)
> aggregator is also configured on 3.7 upgrades on 3.7 branch.  However,
> aggregator is not configured on 3.7 upgrades on 3.9.
Does that mean free-int upgrade is using 3.9 code to run a 3.6->3.7 upgrade? If yes, personally this should be an invalid case. QE never encounter such issue.

(In reply to Michael Gugino from comment #10)
> New PR Created: https://github.com/openshift/openshift-ansible/pull/7270
In 3.9 installer, give a fix for v3_7/upgrade.yml, as far as I know, the playbook is only used for 3.6->3.7, does that mean we also support or agree user to use 3.9 installer to run 3.6->3.7 upgrade. This looks really strange. This would bring a lot of noise. Personally in order to avoid noise, we should not ship old version of upgrade code (e.g: 3.6->3.7) in in 3.9 installer, only keep 3.7->3.9 code.

Comment 14 Weihua Meng 2018-03-02 23:48:54 UTC
I tried with openshift-ansible-3.9.1-1.git.0.9862628.el7.noarch

# ansible-playbook -vvv /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade.yml

got failure during upgrading.

Failure summary:


  1. Hosts:    host-8-250-81.host.centralci.eng.rdu2.redhat.com
     Play:     Upgrade Service Catalog
     Task:     wait for api server to be ready
     Message:  Status code was not [200]: HTTP Error 500: Internal Server Error

This failure is the same with bug https://bugzilla.redhat.com/show_bug.cgi?id=1547803

So that bug blocks the verification of this bug.

Comment 15 Weihua Meng 2018-03-03 00:20:11 UTC
The logic for task "wait for api server to be ready" changed started in openshift-ansible-3.9.0-0.47.0

https://github.com/openshift/openshift-ansible/commit/79e283ad98af57ecd4a4105fe561f0b0c4c53f6e#diff-ebcf31d9a3d2b05b096049dd00fb0b1b

So the upgrade playbook can finish with openshift-playbook v3.9.0-0.45.0

Comment 16 Johnny Liu 2018-03-05 09:18:08 UTC
BZ#1547803 is another issue, unrelated to this this bug. 

@Justin, to make testing move on, could you help confirm comment 13, if you agree this is an invalid test scenarios, I proposed to close this bug as NOTABUG. If not, QE would only run some regression testing to make sure no new issues are introduced (because QE could not reproduce this bug, we can not assure the PR really would help resolve your issue.)

Comment 17 Justin Pierce 2018-03-05 14:00:12 UTC
The free-int upgrade was run using 3.9 playbooks and it was upgrading a slightly older 3.9 environment.

Since I was instructed to disable service broker for subsequent deployments to free-int, I will not be able to validate this either.

Comment 18 Weihua Meng 2018-03-07 09:36:49 UTC
Fixed.
openshift-ansible-3.9.3-1.git.0.e166207.el7.noarch

upgrade from openshift v3.9.0-0.38.0

# oc get pods -n kube-service-catalog
NAME                       READY     STATUS    RESTARTS   AGE
apiserver-cclsb            1/1       Running   0          33m
controller-manager-c6tr5   1/1       Running   0          33m

Comment 19 Bing Li 2018-03-09 02:26:58 UTC
On free-int, kube-service-catalog/apiserver pod in crash loop:

# oc get pods -n kube-service-catalog
NAME                       READY     STATUS             RESTARTS   AGE
apiserver-v5d5t            0/1       CrashLoopBackOff   2414       8d
controller-manager-wp4vw   0/1       CrashLoopBackOff   1524       8d

Comment 21 Vadim Rutkovsky 2018-03-09 14:55:35 UTC
That looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1547803 - the broker gets deployed correctly, but once healthz endpoint is triggered the broker crashes

Comment 23 Scott Dodson 2018-03-09 19:36:04 UTC
Looking at free-int it's clear that the aggregator has not been deployed so I think we should ensure that it gets invoked during the control plane in 3.9 as well.

https://github.com/openshift/openshift-ansible/pull/7478

Comment 24 Scott Dodson 2018-03-09 22:53:31 UTC
I've verified locally that the patch will run the wire_aggregator tasks but since this problem is reported against an environment we cannot test this until that environment is upgraded again using a version of openshift-ansible which contains the fix.

I'll move this to modified once the fix is merged though.

Is free-int the only environment in which we've attempted to deploy the service catalog?

Comment 25 Weihua Meng 2018-03-12 14:14:13 UTC
Hi, @Scott
I cannot reproduce it.

Could you give more detailed steps to reproduce it?
Thanks.

Comment 26 Michael Gugino 2018-03-12 14:58:12 UTC
(In reply to Weihua Meng from comment #25)
> Hi, @Scott
> I cannot reproduce it.
> 
> Could you give more detailed steps to reproduce it?
> Thanks.

To replicate the issue:

Upgrade a 3.6 cluster to a 3.7 cluster with the 3.7 GA release tag/rpm.  We were missing this logic in this version of openshift-ansible.

Upgrade to 3.9 using 3.9 branch befor the fix commit.

Comment 27 Scott Dodson 2018-03-12 15:13:00 UTC
Yeah, I think in free-int the series of events were

1) Install 3.6
2) Upgrade to 3.7 using playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_control_plane.yml and playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_nodes.yml
3) Upgrade to 3.9 using playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_control_plane.yml and playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_nodes.yml
4) Install service catalog
5) Service catalog crash loops because the API Aggregator was not configured in either step 2 or 3 as it should've been.

Comment 28 Weihua Meng 2018-03-13 07:46:06 UTC
Failed.
openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch

1. RPM install OCP v3.6.173.0.104
2. upgrade with openshift-ansible-3.7.14-1.git.0.4b35b2d.el7.noarch
openshift_enable_service_catalog=false
openshift_web_console_install=false

3. upgrade with openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch
openshift_enable_service_catalog=false
openshift_web_console_install=false

4. install service catalog with openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch

playbooks/openshift-service-catalog/config.yml

openshift_enable_service_catalog=true
openshift_service_catalog_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
ansible_service_broker_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
template_service_broker_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
template_service_broker_selector={"role": "node"}
openshift_web_console_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-

Note: from the 397upgradelog, did not found play 
name: Configure API aggregation on masters 
which is added by new PR

Comment 32 Weihua Meng 2018-03-13 07:59:56 UTC
Ansible failed when install Service catalog.

Seems etcd issue

# curl -k https://apiserver.kube-service-catalog.svc/healthz
[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-service-catalog-apiserver-informers ok
[-]etcd failed: reason withheld

Comment 33 Jeff Peeler 2018-03-13 14:23:15 UTC
What does "oc describe po -n kube-service-catalog -lapp=apiserver" report? My guess at the moment is even though the port number for etcd has been corrected to use 2379 instead of 4001, no ansible was added to correct installs in 3.9 because it's assumed the latest 3.7 code upgrade was done first. (Btw, when is 3.7.24+ errata being released?)

Comment 34 Michael Gugino 2018-03-13 15:45:14 UTC
I am looking into this to see if I can replicate with older version of 3.7.

Comment 35 Michael Gugino 2018-03-13 18:30:38 UTC
Okay, I believe I have isolated the root cause for this.  It's our old
pal, openshift_facts.

If "master": {"etcd_port": "1111"}" is present inside
/etc/ansible/facts.d/openshift.fact, our installer goes with that
value, no matter what.  (I set 1111 for testing in that file).  We
don't override it, we preserve whatever was there.

Steps to reproduce:

1) Install cluster with 3.9, no service catalog
2) Inject aforementioned value into openshift.fact file (master key
will already be present in that json file, just need to add etcd_port
bits).
3) Attempt to install service catalog.

This will affect anyone who ever deployed with the old value as it
will be preserved by openshift_facts.

Comment 36 Weihua Meng 2018-03-14 00:35:29 UTC
# oc describe po -n kube-service-catalog -lapp=apiserver
Name:           apiserver-lzfzn
Namespace:      kube-service-catalog
Node:           wmengupgraderpm363-master-1/10.240.0.190
Start Time:     Tue, 13 Mar 2018 01:07:49 -0400
Labels:         app=apiserver
                controller-revision-hash=687006152
                pod-template-generation=1
Annotations:    ca_hash=cbdc9f97cf232061e7083729fcd96335ee813aa6
                openshift.io/scc=hostmount-anyuid
Status:         Running
IP:             10.128.0.7
Controlled By:  DaemonSet/apiserver
Containers:
  apiserver:
    Container ID:  docker://744a0d4f47cf5647467a584a493779573808df6224106235c816f813bc8bf72f
    Image:         registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.7
    Image ID:      docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:5de3bab01891975d221a03ad1905dc5671f9d4f23ee1099fae5f122f9715e950
    Port:          6443/TCP
    Command:
      /usr/bin/service-catalog
    Args:
      apiserver
      --storage-type
      etcd
      --secure-port
      6443
      --etcd-servers
      https://wmengupgraderpm363-master-1:2379
      --etcd-cafile
      /etc/origin/master/master.etcd-ca.crt
      --etcd-certfile
      /etc/origin/master/master.etcd-client.crt
      --etcd-keyfile
      /etc/origin/master/master.etcd-client.key
      -v
      10
      --cors-allowed-origins
      localhost
      --admission-control
      KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck
      --feature-gates
      OriginatingIdentity=true
    State:          Running
      Started:      Tue, 13 Mar 2018 01:07:56 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/origin/master from etcd-host-cert (ro)
      /var/run/kubernetes-service-catalog from apiserver-ssl (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from service-catalog-apiserver-token-bqklm (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  apiserver-ssl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  apiserver-ssl
    Optional:    false
  etcd-host-cert:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/origin/master
    HostPathType:  
  data-dir:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  
  service-catalog-apiserver-token-bqklm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  service-catalog-apiserver-token-bqklm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  openshift-infra=apiserver
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:          <none>

Comment 37 Weihua Meng 2018-03-14 00:43:09 UTC
We need a rubost fix for it as we cannot guarantee all customers are using latest 3.7 errata build before upgrade to 3.9. 
And latest public errata has 3.7.23 for now.

Comment 38 Weihua Meng 2018-03-14 00:44:40 UTC
Jianlin, can you answer the question in commet 33?
when is 3.7.24+ errata being released?
Thanks.

Comment 39 Michael Gugino 2018-03-14 02:16:03 UTC
PR Created: https://github.com/openshift/openshift-ansible/pull/7516

This will affect all recent branches, will backport/forward port.

@Jianlin, @Weihua,

if `grep etcd_port /etc/ansible/facts.d/openshift.fact` returns a match on the first master host, then service catalog will not get the right value for etcd_port.  This is due to the behavior that local_facts will take precedence over defaults in openshift_facts.  This will affect any old cluster that had previously placed 'etcd_port' inside the 'master' dictionary inside that file.  The master branch no longer places that value inside the fact file, but it will still respect old values that were placed there before.

Hopefully in 3.10 we can remove this file entirely.

Comment 40 Johnny Liu 2018-03-14 07:13:07 UTC
(In reply to Weihua Meng from comment #38)
> Jianlin, can you answer the question in commet 33?
> when is 3.7.24+ errata being released?
> Thanks.

The latest 3.7.24+ errata will be released in https://errata.devel.redhat.com/advisory/32336, it is still in NEW_FILE state, because 3.9 is higher priority, so maybe it will be released after 3.9 GA.

Comment 41 Michael Gugino 2018-03-14 14:10:51 UTC
Backport to 3.7 created: https://github.com/openshift/openshift-ansible/pull/7523

Comment 42 Scott Dodson 2018-03-14 15:30:08 UTC
Cloned for 3.7.z

Comment 43 Weihua Meng 2018-03-15 02:39:56 UTC
@Gaoyun please check if scale etcd is OK with this change. Thanks.

Comment 45 Weihua Meng 2018-03-15 10:44:15 UTC
@Justin, what is our on-line cluster config, dedicated etcd hosts or etcd on master hosts? 
Thanks.

Comment 46 Weihua Meng 2018-03-15 10:54:06 UTC
I found the cause for the failure.

      --etcd-servers
      https://wmengupgraderpm364-master-1:2379

external etcd is used for this cluster, so etcd server should be
https://wmengupgraderpm364-etcd-1:2379

Comment 47 Michael Gugino 2018-03-15 18:38:32 UTC
PR created for 3.9: https://github.com/openshift/openshift-ansible/pull/7542

I'm unsure if 3.7 will be affected by this condition, it may be a regression in 3.9.

Comment 48 Scott Dodson 2018-03-15 20:14:42 UTC
Forking the specific scenario about upgraded 3.9 environments with external etcd to https://bugzilla.redhat.com/show_bug.cgi?id=1557036

Comment 49 Weihua Meng 2018-03-16 09:01:06 UTC
Fixed with etcd in master hosts.
openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7.noarch

Issue with dedicated etcd hosts is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1557036

# oc get pods --all-namespaces
NAMESPACE                           NAME                             READY     STATUS      RESTARTS   AGE
default                             docker-registry-6-b7brh          1/1       Running     0          20m
default                             registry-console-3-2kxmj         1/1       Running     0          27m
default                             router-3-8fbpx                   1/1       Running     0          22m
default                             router-3-fmc5g                   1/1       Running     0          20m
install-test                        mongodb-1-kj7s5                  1/1       Running     0          20m
install-test                        nodejs-mongodb-example-1-p8j74   1/1       Running     0          20m
kube-service-catalog                apiserver-rxt7w                  1/1       Running     0          7m
kube-service-catalog                controller-manager-w6jst         1/1       Running     0          7m
openshift-ansible-service-broker    asb-1-r2kmf                      1/1       Running     2          6m
openshift-ansible-service-broker    asb-etcd-1-llcxt                 1/1       Running     0          6m
openshift-template-service-broker   apiserver-s9fzt                  1/1       Running     0          6m
openshift-template-service-broker   apiserver-xhxvd                  1/1       Running     0          6m
openshift-template-service-broker   apiserver-zldrz                  1/1       Running     0          6m
wmeng                               cakephp-mysql-example-1-7rkcx    1/1       Running     0          1m
wmeng                               cakephp-mysql-example-1-build    0/1       Completed   0          2m
wmeng                               mysql-1-nm6ww                    1/1       Running     0          2m

# curl -k https://apiserver.kube-service-catalog.svc/healthz
ok

Comment 52 errata-xmlrpc 2018-03-28 14:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489


Note You need to log in before you can comment on or make changes to this bug.