Bug 1902247

Summary:	openshift-oauth-apiserver apiserver pod crashloopbackoffs
Product:	OpenShift Container Platform	Reporter:	Ricardo Carrillo Cruz <ricarril>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.7	CC:	anbhat, aos-bugs, lszaszki, mfojtik
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	LifecycleReset
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Feature: operator should fall back to Spec.ServiceNetwork if status is not populated. Reason: etcd-endpoints configmap was becoming empty if network.Status.ServiceNetwork is not populated. cluster-etcd-operator will fail to scale up in that case. Result: Once the enhancement is implemented, the cluster comes up even if the Status.ServiceNetwork is not populated.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:34:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1958416

Description Ricardo Carrillo Cruz 2020-11-27 13:24:15 UTC

[ricky@localhost cluster-network-operator]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version False True 166m Working towards 4.7.0-0.ci-2020-11-27-070754: 73% complete
[ricky@localhost cluster-network-operator]$ oc -n openshift-oauth-apiserver get pods
NAME READY STATUS RESTARTS AGE
apiserver-748f8fcc55-fbldf 0/1 Init:0/1 0 155m
apiserver-7987c8fb46-t5zrh 0/1 CrashLoopBackOff 34 150m
apiserver-dc64445f6-2nkxt 0/1 Init:0/1 0 155m
[ricky@localhost cluster-network-operator]$ oc -n openshift-oauth-apiserver logs apiserver-7987c8fb46-t5zrh | head -n10
Error: --etcd-servers must be specified
Usage:
oauth-apiserver start [flags]

Flags:
--admission-control-config-file string File with admission control configuration.
--advertise-address ip The IP address on which to advertise the apiserver to members of the cluster. This address must be reachable by the rest of the cluster. If blank, the --bind-address will be used. If --bind-address is unspecified, the host's default interface will be used.
--audit-log-batch-buffer-size int The size of the buffer to store events before batching and writing. Only used in batch mode. (default 10000)
--audit-log-batch-max-size int The maximum size of a batch. Only used in batch mode. (default 1)
--audit-log-batch-max-wait duration The amount of time to wait before force writing the batch that hadn't reached the max size. Only used in batch mode.

Comment 1 Ricardo Carrillo Cruz 2020-11-27 13:24:44 UTC

https://drive.google.com/file/d/1jmytOyiI7mAFDP16b6xa4Ax2i9Hpa2Ui/view?usp=sharing

Comment 2 Lukasz Szaszkiewicz 2020-11-30 11:32:01 UTC

The provided must-gather shows that the mandatory etcd-endpoints configmap was empty - didn't have any IP addresses.
CAO didn't check the content of the cm, it simply tried to install the API server.
For the API server having "--etcd-servers" is mandatory.
I agree that the operators should examine the content of the cm before installing the API servers.



openshift-authentication-operator:
E1127 11:34:23.937053       1 base_controller.go:250] "ConfigObserver" controller failed to sync "key", err: configmaps openshift-etcd/etcd-endpoints: no etcd endpoint addresses found
E1127 11:34:24.937055       1 base_controller.go:250] "ConfigObserver" controller failed to sync "key", err: configmaps openshift-etcd/etcd-endpoints: no etcd endpoint addresses found


k get configmap -n openshift-etcd etcd-endpoints -oyaml
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    alpha.installer.openshift.io/etcd-bootstrap: 10.0.0.6
  creationTimestamp: "2020-11-27T09:45:04Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:alpha.installer.openshift.io/etcd-bootstrap: {}
    manager: cluster-bootstrap
    operation: Update
    time: "2020-11-27T09:45:04Z"
  name: etcd-endpoints
  namespace: openshift-etcd
  resourceVersion: "481"
  selfLink: /api/v1/namespaces/openshift-etcd/configmaps/etcd-endpoints
  uid: 70076ec7-9ebd-4c56-8a4f-ab5bdb1e52ca

Comment 3 Lukasz Szaszkiewicz 2020-11-30 11:33:14 UTC

I'm assigning this issue to the etcd team to have a look and see why the openshift-etcd/etcd-endpoints configmap was empty.

Comment 4 Sam Batschelet 2020-11-30 22:14:44 UTC

Usually, the install-config for the cluster is persisted to kube-system configmaps but it is not included in this must-gather. Could you post your install-config for review please?

Comment 5 Michal Fojtik 2020-12-30 22:58:21 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 6 Sam Batschelet 2021-01-22 16:03:32 UTC

we don't see any endpoints because we have not scaled up etcd.

```
- apiVersion: v1
  kind: ConfigMap
  metadata:
    annotations:
      alpha.installer.openshift.io/etcd-bootstrap: 10.0.0.6
```

etcd-operator logs seem to not find the necessary dependencies from the node to perform scale actions, we require internalIP to be populated[1].

> "BootstrapTeardownController" controller failed to sync "key", err: failed to get internal IP for node: networks.config.openshift.io/cluster: status.serviceNetwork not found


moving to SDN team to triage missing statusus.

[1] https://github.com/openshift/cluster-etcd-operator/blob/release-4.7/pkg/dnshelpers/util.go#L39

Comment 7 Dan Winship 2021-02-10 13:36:58 UTC

(In reply to Sam Batschelet from comment #6)
> > "BootstrapTeardownController" controller failed to sync "key", err: failed to get internal IP for node: networks.config.openshift.io/cluster: status.serviceNetwork not found
> 
> 
> moving to SDN team to triage missing statusus.

The network config spec is validated and copied to the status by the CNO, but CNO doesn't start until after bootstrap is complete. Code that runs at bootstrap time needs to look at spec.serviceNetwork instead of status.serviceNetwork.

We probably need to do something better here... I filed https://issues.redhat.com/browse/SDN-1615 about that.

But for now, cluster-etcd-operator should be using spec.serviceNetwork if status.serviceNetwork is unset.

Comment 8 Michal Fojtik 2021-02-10 14:25:09 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently.
The bug assignee was notified.

Comment 9 Michal Fojtik 2021-03-12 15:07:21 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 10 Michal Fojtik 2021-05-07 16:14:35 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset.
The bug assignee was notified.

Comment 11 Sam Batschelet 2021-05-07 16:17:02 UTC

> But for now, cluster-etcd-operator should be using spec.serviceNetwork if status.serviceNetwork is unset.

addressed Dan's comments in PR since this issue can result in the failed install fixing bug for 4.8

Comment 18 errata-xmlrpc 2021-07-27 22:34:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438