Bug 1546365

Summary:	[free-int] kube-service-catalog/apiserver pod in crash loop after upgrade
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Cluster Version Operator	Assignee:	Michael Gugino <mgugino>
Status:	CLOSED ERRATA	QA Contact:	Weihua Meng <wmeng>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	aos-bugs, bingli, dyocum, gpei, jialiu, jokerman, jpeeler, jupierce, mgugino, mmccomas, pmorie, rteague, sdodson, vrutkovs, wmeng, xtian
Target Milestone:	---	Flags:	jupierce: needinfo-
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1555394 (view as bug list)		Environment:
Last Closed:	2018-03-28 14:29:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1547803
Bug Blocks:	1555394

Description Justin Pierce 2018-02-16 21:23:15 UTC

Description of problem:
After deploying a new build of OCP v3.9 to free-int, the kube-service-catalog/apiserver is in a crash loop backoff. 


Version-Release number of the following components:
v3.9.0-0.45.0

Additional info:
[root@free-int-master-3c664 ~]# oc get pods -w
NAME                       READY     STATUS             RESTARTS   AGE
apiserver-zrb2v            0/1       CrashLoopBackOff   8          19m
controller-manager-sz4l5   1/1       Running            5          19m

[root@free-int-master-3c664 ~]# oc logs apiserver-zrb2v
....each log ends with the line....
Error: cluster doesn't provide requestheader-client-ca-file

Not sure if this is installer or service broker, but guessing installer since ca is mentioned.

Comment 1 Paul Morie 2018-02-18 01:23:23 UTC

Jeff, will you please take a look at this (with me) on Monday?

Comment 2 Jeff Peeler 2018-02-19 15:19:48 UTC

If the requestheader-client-ca file is missing, then that points to the aggregator not being set up. There was an internal email requesting this to be handled, but since this is on a new install then it must not have been.

Comment 3 Jeff Peeler 2018-02-19 15:25:07 UTC

Was this on a new install or an upgrade? I meant to say above that it's likely the upgrade path is not handled. But a fresh install should be working, especially since it was for new 3.7 installs.

Comment 4 Justin Pierce 2018-02-19 16:16:58 UTC

jpeeler this was an upgrade from 3.9 to a slightly newer build of 3.9. However, it appears this is the first time the service catalog has been enabled during an upgrade of this cluster (kube-service-catalog did not previously exist).

Comment 6 Scott Dodson 2018-02-19 20:35:44 UTC

Need to ensure that wire_aggregator is called via control_plane_upgrade and moved out of upgrade.yml which only happens during all-in-one upgrades. Also need to fix 3.7 to ensure that the aggregator is installed during upgrades there as well.

Comment 7 Michael Gugino 2018-02-21 15:29:57 UTC

aggregator is setup during installs on new clusters since 3.7.

aggregator is also configured on 3.7 upgrades on 3.7 branch.  However, aggregator is not configured on 3.7 upgrades on 3.9.

I will add the aggregator to 3.7 upgrades on 3.9.  Then, all hosts should have aggregator by 3.7 and there is no need to run this during upgrades later.

It looks like to replicate this, one must have a 3.6 release, upgrade to 3.7 on master.

Comment 8 Michael Gugino 2018-02-21 15:41:14 UTC

PR Submitted: https://github.com/openshift/openshift-ansible/pull/7233

Comment 9 Russell Teague 2018-02-23 17:04:28 UTC

Paths in the PR need to be corrected.

Comment 10 Michael Gugino 2018-02-23 17:52:38 UTC

New PR Created: https://github.com/openshift/openshift-ansible/pull/7270

Comment 12 Weihua Meng 2018-02-27 10:00:25 UTC

Hi, Justin
Can you help verify this bug? 
Thanks.

Comment 13 Johnny Liu 2018-03-02 09:24:09 UTC

(In reply to Michael Gugino from comment #7)
> aggregator is also configured on 3.7 upgrades on 3.7 branch.  However,
> aggregator is not configured on 3.7 upgrades on 3.9.
Does that mean free-int upgrade is using 3.9 code to run a 3.6->3.7 upgrade? If yes, personally this should be an invalid case. QE never encounter such issue.

(In reply to Michael Gugino from comment #10)
> New PR Created: https://github.com/openshift/openshift-ansible/pull/7270
In 3.9 installer, give a fix for v3_7/upgrade.yml, as far as I know, the playbook is only used for 3.6->3.7, does that mean we also support or agree user to use 3.9 installer to run 3.6->3.7 upgrade. This looks really strange. This would bring a lot of noise. Personally in order to avoid noise, we should not ship old version of upgrade code (e.g: 3.6->3.7) in in 3.9 installer, only keep 3.7->3.9 code.

Comment 14 Weihua Meng 2018-03-02 23:48:54 UTC

I tried with openshift-ansible-3.9.1-1.git.0.9862628.el7.noarch

# ansible-playbook -vvv /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade.yml

got failure during upgrading.

Failure summary:


  1. Hosts:    host-8-250-81.host.centralci.eng.rdu2.redhat.com
     Play:     Upgrade Service Catalog
     Task:     wait for api server to be ready
     Message:  Status code was not [200]: HTTP Error 500: Internal Server Error

This failure is the same with bug https://bugzilla.redhat.com/show_bug.cgi?id=1547803

So that bug blocks the verification of this bug.

Comment 15 Weihua Meng 2018-03-03 00:20:11 UTC

The logic for task "wait for api server to be ready" changed started in openshift-ansible-3.9.0-0.47.0

https://github.com/openshift/openshift-ansible/commit/79e283ad98af57ecd4a4105fe561f0b0c4c53f6e#diff-ebcf31d9a3d2b05b096049dd00fb0b1b

So the upgrade playbook can finish with openshift-playbook v3.9.0-0.45.0

Comment 16 Johnny Liu 2018-03-05 09:18:08 UTC

BZ#1547803 is another issue, unrelated to this this bug. 

@Justin, to make testing move on, could you help confirm comment 13, if you agree this is an invalid test scenarios, I proposed to close this bug as NOTABUG. If not, QE would only run some regression testing to make sure no new issues are introduced (because QE could not reproduce this bug, we can not assure the PR really would help resolve your issue.)

Comment 17 Justin Pierce 2018-03-05 14:00:12 UTC

The free-int upgrade was run using 3.9 playbooks and it was upgrading a slightly older 3.9 environment.

Since I was instructed to disable service broker for subsequent deployments to free-int, I will not be able to validate this either.

Comment 18 Weihua Meng 2018-03-07 09:36:49 UTC

Fixed.
openshift-ansible-3.9.3-1.git.0.e166207.el7.noarch

upgrade from openshift v3.9.0-0.38.0

# oc get pods -n kube-service-catalog
NAME                       READY     STATUS    RESTARTS   AGE
apiserver-cclsb            1/1       Running   0          33m
controller-manager-c6tr5   1/1       Running   0          33m

Comment 19 Bing Li 2018-03-09 02:26:58 UTC

On free-int, kube-service-catalog/apiserver pod in crash loop:

# oc get pods -n kube-service-catalog
NAME                       READY     STATUS             RESTARTS   AGE
apiserver-v5d5t            0/1       CrashLoopBackOff   2414       8d
controller-manager-wp4vw   0/1       CrashLoopBackOff   1524       8d

Comment 21 Vadim Rutkovsky 2018-03-09 14:55:35 UTC

That looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1547803 - the broker gets deployed correctly, but once healthz endpoint is triggered the broker crashes

Comment 23 Scott Dodson 2018-03-09 19:36:04 UTC

Looking at free-int it's clear that the aggregator has not been deployed so I think we should ensure that it gets invoked during the control plane in 3.9 as well.

https://github.com/openshift/openshift-ansible/pull/7478

Comment 24 Scott Dodson 2018-03-09 22:53:31 UTC

I've verified locally that the patch will run the wire_aggregator tasks but since this problem is reported against an environment we cannot test this until that environment is upgraded again using a version of openshift-ansible which contains the fix.

I'll move this to modified once the fix is merged though.

Is free-int the only environment in which we've attempted to deploy the service catalog?

Comment 25 Weihua Meng 2018-03-12 14:14:13 UTC

Hi, @Scott
I cannot reproduce it.

Could you give more detailed steps to reproduce it?
Thanks.

Comment 26 Michael Gugino 2018-03-12 14:58:12 UTC

(In reply to Weihua Meng from comment #25)
> Hi, @Scott
> I cannot reproduce it.
> 
> Could you give more detailed steps to reproduce it?
> Thanks.

To replicate the issue:

Upgrade a 3.6 cluster to a 3.7 cluster with the 3.7 GA release tag/rpm.  We were missing this logic in this version of openshift-ansible.

Upgrade to 3.9 using 3.9 branch befor the fix commit.

Comment 27 Scott Dodson 2018-03-12 15:13:00 UTC

Yeah, I think in free-int the series of events were

1) Install 3.6
2) Upgrade to 3.7 using playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_control_plane.yml and playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_nodes.yml
3) Upgrade to 3.9 using playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_control_plane.yml and playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_nodes.yml
4) Install service catalog
5) Service catalog crash loops because the API Aggregator was not configured in either step 2 or 3 as it should've been.

Comment 28 Weihua Meng 2018-03-13 07:46:06 UTC

Failed.
openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch

1. RPM install OCP v3.6.173.0.104
2. upgrade with openshift-ansible-3.7.14-1.git.0.4b35b2d.el7.noarch
openshift_enable_service_catalog=false
openshift_web_console_install=false

3. upgrade with openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch
openshift_enable_service_catalog=false
openshift_web_console_install=false

4. install service catalog with openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch

playbooks/openshift-service-catalog/config.yml

openshift_enable_service_catalog=true
openshift_service_catalog_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
ansible_service_broker_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
template_service_broker_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
template_service_broker_selector={"role": "node"}
openshift_web_console_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-

Note: from the 397upgradelog, did not found play 
name: Configure API aggregation on masters 
which is added by new PR

Comment 32 Weihua Meng 2018-03-13 07:59:56 UTC

Ansible failed when install Service catalog.

Seems etcd issue

# curl -k https://apiserver.kube-service-catalog.svc/healthz
[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-service-catalog-apiserver-informers ok
[-]etcd failed: reason withheld

Comment 33 Jeff Peeler 2018-03-13 14:23:15 UTC

What does "oc describe po -n kube-service-catalog -lapp=apiserver" report? My guess at the moment is even though the port number for etcd has been corrected to use 2379 instead of 4001, no ansible was added to correct installs in 3.9 because it's assumed the latest 3.7 code upgrade was done first. (Btw, when is 3.7.24+ errata being released?)

Comment 34 Michael Gugino 2018-03-13 15:45:14 UTC

I am looking into this to see if I can replicate with older version of 3.7.

Comment 35 Michael Gugino 2018-03-13 18:30:38 UTC

Okay, I believe I have isolated the root cause for this.  It's our old
pal, openshift_facts.

If "master": {"etcd_port": "1111"}" is present inside
/etc/ansible/facts.d/openshift.fact, our installer goes with that
value, no matter what.  (I set 1111 for testing in that file).  We
don't override it, we preserve whatever was there.

Steps to reproduce:

1) Install cluster with 3.9, no service catalog
2) Inject aforementioned value into openshift.fact file (master key
will already be present in that json file, just need to add etcd_port
bits).
3) Attempt to install service catalog.

This will affect anyone who ever deployed with the old value as it
will be preserved by openshift_facts.

Comment 36 Weihua Meng 2018-03-14 00:35:29 UTC

# oc describe po -n kube-service-catalog -lapp=apiserver
Name:           apiserver-lzfzn
Namespace:      kube-service-catalog
Node:           wmengupgraderpm363-master-1/10.240.0.190
Start Time:     Tue, 13 Mar 2018 01:07:49 -0400
Labels:         app=apiserver
                controller-revision-hash=687006152
                pod-template-generation=1
Annotations:    ca_hash=cbdc9f97cf232061e7083729fcd96335ee813aa6
                openshift.io/scc=hostmount-anyuid
Status:         Running
IP:             10.128.0.7
Controlled By:  DaemonSet/apiserver
Containers:
  apiserver:
    Container ID:  docker://744a0d4f47cf5647467a584a493779573808df6224106235c816f813bc8bf72f
    Image:         registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.7
    Image ID:      docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:5de3bab01891975d221a03ad1905dc5671f9d4f23ee1099fae5f122f9715e950
    Port:          6443/TCP
    Command:
      /usr/bin/service-catalog
    Args:
      apiserver
      --storage-type
      etcd
      --secure-port
      6443
      --etcd-servers
      https://wmengupgraderpm363-master-1:2379
      --etcd-cafile
      /etc/origin/master/master.etcd-ca.crt
      --etcd-certfile
      /etc/origin/master/master.etcd-client.crt
      --etcd-keyfile
      /etc/origin/master/master.etcd-client.key
      -v
      10
      --cors-allowed-origins
      localhost
      --admission-control
      KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck
      --feature-gates
      OriginatingIdentity=true
    State:          Running
      Started:      Tue, 13 Mar 2018 01:07:56 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/origin/master from etcd-host-cert (ro)
      /var/run/kubernetes-service-catalog from apiserver-ssl (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from service-catalog-apiserver-token-bqklm (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  apiserver-ssl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  apiserver-ssl
    Optional:    false
  etcd-host-cert:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/origin/master
    HostPathType:  
  data-dir:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  
  service-catalog-apiserver-token-bqklm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  service-catalog-apiserver-token-bqklm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  openshift-infra=apiserver
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:          <none>

Comment 37 Weihua Meng 2018-03-14 00:43:09 UTC

We need a rubost fix for it as we cannot guarantee all customers are using latest 3.7 errata build before upgrade to 3.9. 
And latest public errata has 3.7.23 for now.

Comment 38 Weihua Meng 2018-03-14 00:44:40 UTC

Jianlin, can you answer the question in commet 33?
when is 3.7.24+ errata being released?
Thanks.

Comment 39 Michael Gugino 2018-03-14 02:16:03 UTC

PR Created: https://github.com/openshift/openshift-ansible/pull/7516

This will affect all recent branches, will backport/forward port.

@Jianlin, @Weihua,

if `grep etcd_port /etc/ansible/facts.d/openshift.fact` returns a match on the first master host, then service catalog will not get the right value for etcd_port.  This is due to the behavior that local_facts will take precedence over defaults in openshift_facts.  This will affect any old cluster that had previously placed 'etcd_port' inside the 'master' dictionary inside that file.  The master branch no longer places that value inside the fact file, but it will still respect old values that were placed there before.

Hopefully in 3.10 we can remove this file entirely.

Comment 40 Johnny Liu 2018-03-14 07:13:07 UTC

(In reply to Weihua Meng from comment #38)
> Jianlin, can you answer the question in commet 33?
> when is 3.7.24+ errata being released?
> Thanks.

The latest 3.7.24+ errata will be released in https://errata.devel.redhat.com/advisory/32336, it is still in NEW_FILE state, because 3.9 is higher priority, so maybe it will be released after 3.9 GA.

Comment 41 Michael Gugino 2018-03-14 14:10:51 UTC

Backport to 3.7 created: https://github.com/openshift/openshift-ansible/pull/7523

Comment 42 Scott Dodson 2018-03-14 15:30:08 UTC

Cloned for 3.7.z

Comment 43 Weihua Meng 2018-03-15 02:39:56 UTC

@Gaoyun please check if scale etcd is OK with this change. Thanks.

Comment 45 Weihua Meng 2018-03-15 10:44:15 UTC

@Justin, what is our on-line cluster config, dedicated etcd hosts or etcd on master hosts? 
Thanks.

Comment 46 Weihua Meng 2018-03-15 10:54:06 UTC

I found the cause for the failure.

      --etcd-servers
      https://wmengupgraderpm364-master-1:2379

external etcd is used for this cluster, so etcd server should be
https://wmengupgraderpm364-etcd-1:2379

Comment 47 Michael Gugino 2018-03-15 18:38:32 UTC

PR created for 3.9: https://github.com/openshift/openshift-ansible/pull/7542

I'm unsure if 3.7 will be affected by this condition, it may be a regression in 3.9.

Comment 48 Scott Dodson 2018-03-15 20:14:42 UTC

Forking the specific scenario about upgraded 3.9 environments with external etcd to https://bugzilla.redhat.com/show_bug.cgi?id=1557036

Comment 49 Weihua Meng 2018-03-16 09:01:06 UTC

Fixed with etcd in master hosts.
openshift-ansible-3.9.9-1.git.0.1a1f7d8.el7.noarch

Issue with dedicated etcd hosts is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1557036

# oc get pods --all-namespaces
NAMESPACE                           NAME                             READY     STATUS      RESTARTS   AGE
default                             docker-registry-6-b7brh          1/1       Running     0          20m
default                             registry-console-3-2kxmj         1/1       Running     0          27m
default                             router-3-8fbpx                   1/1       Running     0          22m
default                             router-3-fmc5g                   1/1       Running     0          20m
install-test                        mongodb-1-kj7s5                  1/1       Running     0          20m
install-test                        nodejs-mongodb-example-1-p8j74   1/1       Running     0          20m
kube-service-catalog                apiserver-rxt7w                  1/1       Running     0          7m
kube-service-catalog                controller-manager-w6jst         1/1       Running     0          7m
openshift-ansible-service-broker    asb-1-r2kmf                      1/1       Running     2          6m
openshift-ansible-service-broker    asb-etcd-1-llcxt                 1/1       Running     0          6m
openshift-template-service-broker   apiserver-s9fzt                  1/1       Running     0          6m
openshift-template-service-broker   apiserver-xhxvd                  1/1       Running     0          6m
openshift-template-service-broker   apiserver-zldrz                  1/1       Running     0          6m
wmeng                               cakephp-mysql-example-1-7rkcx    1/1       Running     0          1m
wmeng                               cakephp-mysql-example-1-build    0/1       Completed   0          2m
wmeng                               mysql-1-nm6ww                    1/1       Running     0          2m

# curl -k https://apiserver.kube-service-catalog.svc/healthz
ok

Comment 52 errata-xmlrpc 2018-03-28 14:29:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489