Bug 1804681

Summary:	kube-scheduler was deployed without having CA bundle containing kube-apiserver's CAs on bootstrap
Product:	OpenShift Container Platform	Reporter:	Tomáš Nožička <tnozicka>
Component:	kube-scheduler	Assignee:	Tomáš Nožička <tnozicka>
Status:	CLOSED DEFERRED	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.3.0	CC:	aos-bugs, jdesousa, lmohanty, maszulik, mfojtik, wking
Target Milestone:	---
Target Release:	4.4.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1810528 (view as bug list)		Environment:
Last Closed:	2020-09-11 11:02:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1810528
Bug Blocks:

Description Tomáš Nožička 2020-02-19 12:08:30 UTC

level=error msg="Cluster operator kube-controller-manager Degraded is True with MultipleConditionsMatching: NodeControllerDegraded: The master nodes not ready: node \"ci-op-n856n-m-0.c.openshift-gce-devel-ci.internal\" not ready since 2020-02-19 05:31:39 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)\nNodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: \nInstallerPodContainerWaitingDegraded: Pod \"installer-4-ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal\" on node \"ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal\" container \"installer\" is waiting for 36m3.68513657s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-4-ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal\" on node \"ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal\" observed degraded networking: Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-4-ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal_openshift-kube-controller-manager_45778f91-2fc1-4cf7-a05c-6bc2dd79bf9f_0(15cdbe4051ec6ae29b1c89ee64ffa369f36d9fc3fbd846c4adc4352261e28937): Multus: error adding pod to network \"ovn-kubernetes\": delegateAdd: error invoking DelegateAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '"


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/471

82 hits in past day (more overal) https://search.svc.ci.openshift.org/?search=Missing+CNI+default+network&maxAge=24h&context=2&type=junit

Comment 1 Juan Luis de Sousa-Valadas 2020-02-19 12:54:21 UTC

I see the pods are pending, so this is happening before.

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/471/artifacts/e2e-gcp/pods.json  | prowser-get-pod  | grep ovn | column -t
openshift-ovn-kubernetes  ovnkube-master-5kq5q  0/4  0  2020-02-19T05:31:29Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-master-c8nmw  0/4  0  2020-02-19T05:33:35Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-master-gpfb5  0/4  0  2020-02-19T05:39:27Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-node-4ztbb    0/3  0  2020-02-19T05:32:05Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-node-c8h2s    0/3  0  2020-02-19T05:31:37Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-node-nbtbw    3/3  0  2020-02-19T05:25:59Z  [{'ip':'10.0.0.6'}]  ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal
openshift-ovn-kubernetes  ovnkube-node-nzcs7    3/3  1  2020-02-19T05:25:59Z  [{'ip':'10.0.0.3'}]  ci-op-n856n-m-1.c.openshift-gce-devel-ci.internal
openshift-ovn-kubernetes  ovnkube-node-qrjmz    0/3  0  2020-02-19T05:31:58Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-node-sk278    0/3  0  2020-02-19T05:32:03Z  <Pending>            <Pending>

I've had a quick look at the events and I don't see any errors, checking the kube-scheduler logs I see kube-apiserver is refusing connections:
E0219 05:33:05.696098       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp [::1]:6443: connect: connection refused
E0219 05:33:05.696962       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: Get https://localhost:6443/apis/storage.k8s.io/v1beta1/csinodes?limit=500&resourceVersion=0: dial tcp [::1]:6443: connect: connection refused
E0219 05:33:05.697887       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicationController: Get https://localhost:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp [::1]:6443: connect: connection refused
E0219 05:33:05.699330       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: dial tcp [::1]:6443: connect: connection refused

Checking kube-apiserver logs I see bad certs:
I0219 06:07:59.255651       1 log.go:172] http: TLS handshake error from [::1]:54580: remote error: tls: bad certificate
I0219 06:07:59.425076       1 log.go:172] http: TLS handshake error from [::1]:54604: remote error: tls: bad certificate
I0219 06:07:59.506057       1 log.go:172] http: TLS handshake error from [::1]:54616: remote error: tls: bad certificate
I0219 06:07:59.549217       1 log.go:172] http: TLS handshake error from [::1]:54628: remote error: tls: bad certificate
I0219 06:07:59.549340       1 log.go:172] http: TLS handshake error from [::1]:54626: remote error: tls: bad certificate

I believe these certs are generated by the installer, so I'm moving it to the installer. Please let me know if these are generated by other component.

Comment 2 Scott Dodson 2020-02-19 18:42:50 UTC

This is kube-scheduler talking to kube-apiserver with incorrect certificates. Moving to kube-scheduler.

Comment 3 Tomáš Nožička 2020-02-27 16:50:39 UTC

---
I0219 05:27:44.415179       1 server.go:162] Starting Kubernetes Scheduler version v1.16.2

I0219 05:29:34.660585       1 event.go:255] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"98aa3807-f41e-4d0a-8d6d-507311ee00f7", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapCreated' Created ConfigMap/kube-apiserver-server-ca -n openshift-config-managed because it was missing
---

kube-scheduler was deployed without having CA bundle containing kube-apiserver's CAs because we ignore errors in

  https://github.com/openshift/cluster-kube-scheduler-operator/blame/6c156510ffb7737c190b36631dc2340e4e0d11fd/vendor/github.com/openshift/library-go/pkg/operator/resourcesynccontroller/core.go#L21-L23

combined with the lack of health checks for client connection to kube-apiserver we rolled out all 2 revision and the 3rd had unavailable local kube-apiserver whose deployment got stuck because due to the lack of scheduler we couldn't run OVN pods which through kubelet also broken running installer pods that don't need scheduler because they have nodeName set by the operator.


TODO:
- we need to wait for the CM to be available or fail building the CA bundle
- kube-scheduler needs health check on client connection to kube-apiserver

Comment 4 Lalatendu Mohanty 2020-02-27 18:16:34 UTC

Please provide answers to following questions to help to do the impact analysis. It is fine if you do not know answers to some but we can try to find answer to those later. 

What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
What kind of clusters are impacted because of the bug? 
What cluster functionality is degraded while hitting the bug?
Can this bug cause data loss?
Data loss = API server data loss or CRD state information loss etc. 
Is it possible to recover the cluster from the bug?
Is recovery automatic without intervention?  I.e. is the buggy condition transient?
Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix?
Is recovery possible only after more extensive cluster-admin intervention?
Is recovery impossible (bricked cluster)?
What is the observed rate of failure we see in CI?
Is there a manual workaround that exists to recover from the bug? 
What are manual steps? 
How long before the bug is fixed?

Comment 5 Tomáš Nožička 2020-03-02 14:35:56 UTC

@lmohanty is this an automated message? Why is it adding keyword Upgrades/UpgradeBlocker without any other info? The only failure referenced isn't an upgrade so it would be helpful to identify which one you talk/care about. 

The failure reported here was a bootstrap issue. That also implies some of the answers to the question being raised and invalidates some.

The "Missing CNI default network" network seems to be manifestation of broken kube-scheduler (certs in this instance) when OVN can't deploy, but it may be case by case that needs to be identified. It may be a result of different failure then kube-scheduler as well. As a result of this particular failure I am currently working on 2 fixes for cluster-kube-scheduler-operator. In this case the cert-recovery steps would fix the issue, although for failed bootstrap it's easier and faster to try again. 4.5 should recover automatically.

Comment 6 Lalatendu Mohanty 2020-03-02 15:54:02 UTC

@Tomas this issue was recently came up in CI tests and tracked as part of OTA team to identify issues that effect upgrades. It effects our ability to enable over the air updates in 4.3.z. Upgrades keyword is used for bugs that are related to upgrades. After we do not get any response to the questions for impact analysis we are marking it upgradeblocker to track it for quicker triage and root cause analysis. This bug came up while discussing https://bugzilla.redhat.com/show_bug.cgi?id=1802246. If you think this is not a blocker, you can give an explanation and remove the keyword. Usually the questions should give you fair idea of what we are looking for.

Comment 7 Tomáš Nožička 2020-03-02 17:10:30 UTC

From the BZ you referenced (https://bugzilla.redhat.com/show_bug.cgi?id=1802246) none of the schedulers is experiencing the cert issue tracked here, different bug it seems:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17262/artifacts/e2e-aws-upgrade/pods/openshift-kube-scheduler_openshift-kube-scheduler-ip-10-0-136-79.us-west-1.compute.internal_scheduler.log
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17262/artifacts/e2e-aws-upgrade/pods/openshift-kube-scheduler_openshift-kube-scheduler-ip-10-0-137-157.us-west-1.compute.internal_scheduler.log
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17262/artifacts/e2e-aws-upgrade/pods/openshift-kube-scheduler_openshift-kube-scheduler-ip-10-0-151-34.us-west-1.compute.internal_scheduler.log

As the other one is already tracked as upgradeblocker I am removing the keywords and making this title more clear.

Comment 8 Maciej Szulik 2020-03-17 17:44:42 UTC

This will land in a .z stream moving accordingly.

Comment 9 Michal Fojtik 2020-05-12 10:33:23 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing severity from "medium" to "low".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 10 Tomáš Nožička 2020-05-14 11:54:08 UTC

waiting for https://bugzilla.redhat.com/show_bug.cgi?id=1810528 to be backported

Comment 11 Tomáš Nožička 2020-05-20 08:36:02 UTC

This bug is actively worked on.

Comment 12 Tomáš Nožička 2020-06-18 09:14:32 UTC

This bug is actively worked on.

Comment 16 Michal Fojtik 2020-08-24 13:12:22 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.