Bug 1804681

Summary: kube-scheduler was deployed without having CA bundle containing kube-apiserver's CAs on bootstrap
Product: OpenShift Container Platform Reporter: Tomáš Nožička <tnozicka>
Component: kube-schedulerAssignee: Tomáš Nožička <tnozicka>
Status: CLOSED DEFERRED QA Contact: RamaKasturi <knarra>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.0CC: aos-bugs, jdesousa, lmohanty, maszulik, mfojtik, wking
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1810528 (view as bug list) Environment:
Last Closed: 2020-09-11 11:02:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1810528    
Bug Blocks:    

Description Tomáš Nožička 2020-02-19 12:08:30 UTC
level=error msg="Cluster operator kube-controller-manager Degraded is True with MultipleConditionsMatching: NodeControllerDegraded: The master nodes not ready: node \"ci-op-n856n-m-0.c.openshift-gce-devel-ci.internal\" not ready since 2020-02-19 05:31:39 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)\nNodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: \nInstallerPodContainerWaitingDegraded: Pod \"installer-4-ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal\" on node \"ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal\" container \"installer\" is waiting for 36m3.68513657s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-4-ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal\" on node \"ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal\" observed degraded networking: Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-4-ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal_openshift-kube-controller-manager_45778f91-2fc1-4cf7-a05c-6bc2dd79bf9f_0(15cdbe4051ec6ae29b1c89ee64ffa369f36d9fc3fbd846c4adc4352261e28937): Multus: error adding pod to network \"ovn-kubernetes\": delegateAdd: error invoking DelegateAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '"


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/471

82 hits in past day (more overal) https://search.svc.ci.openshift.org/?search=Missing+CNI+default+network&maxAge=24h&context=2&type=junit

Comment 1 Juan Luis de Sousa-Valadas 2020-02-19 12:54:21 UTC
I see the pods are pending, so this is happening before.

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/471/artifacts/e2e-gcp/pods.json  | prowser-get-pod  | grep ovn | column -t
openshift-ovn-kubernetes  ovnkube-master-5kq5q  0/4  0  2020-02-19T05:31:29Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-master-c8nmw  0/4  0  2020-02-19T05:33:35Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-master-gpfb5  0/4  0  2020-02-19T05:39:27Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-node-4ztbb    0/3  0  2020-02-19T05:32:05Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-node-c8h2s    0/3  0  2020-02-19T05:31:37Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-node-nbtbw    3/3  0  2020-02-19T05:25:59Z  [{'ip':'10.0.0.6'}]  ci-op-n856n-m-2.c.openshift-gce-devel-ci.internal
openshift-ovn-kubernetes  ovnkube-node-nzcs7    3/3  1  2020-02-19T05:25:59Z  [{'ip':'10.0.0.3'}]  ci-op-n856n-m-1.c.openshift-gce-devel-ci.internal
openshift-ovn-kubernetes  ovnkube-node-qrjmz    0/3  0  2020-02-19T05:31:58Z  <Pending>            <Pending>
openshift-ovn-kubernetes  ovnkube-node-sk278    0/3  0  2020-02-19T05:32:03Z  <Pending>            <Pending>

I've had a quick look at the events and I don't see any errors, checking the kube-scheduler logs I see kube-apiserver is refusing connections:
E0219 05:33:05.696098       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp [::1]:6443: connect: connection refused
E0219 05:33:05.696962       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: Get https://localhost:6443/apis/storage.k8s.io/v1beta1/csinodes?limit=500&resourceVersion=0: dial tcp [::1]:6443: connect: connection refused
E0219 05:33:05.697887       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicationController: Get https://localhost:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp [::1]:6443: connect: connection refused
E0219 05:33:05.699330       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: dial tcp [::1]:6443: connect: connection refused

Checking kube-apiserver logs I see bad certs:
I0219 06:07:59.255651       1 log.go:172] http: TLS handshake error from [::1]:54580: remote error: tls: bad certificate
I0219 06:07:59.425076       1 log.go:172] http: TLS handshake error from [::1]:54604: remote error: tls: bad certificate
I0219 06:07:59.506057       1 log.go:172] http: TLS handshake error from [::1]:54616: remote error: tls: bad certificate
I0219 06:07:59.549217       1 log.go:172] http: TLS handshake error from [::1]:54628: remote error: tls: bad certificate
I0219 06:07:59.549340       1 log.go:172] http: TLS handshake error from [::1]:54626: remote error: tls: bad certificate

I believe these certs are generated by the installer, so I'm moving it to the installer. Please let me know if these are generated by other component.

Comment 2 Scott Dodson 2020-02-19 18:42:50 UTC
This is kube-scheduler talking to kube-apiserver with incorrect certificates. Moving to kube-scheduler.

Comment 3 Tomáš Nožička 2020-02-27 16:50:39 UTC
---
I0219 05:27:44.415179       1 server.go:162] Starting Kubernetes Scheduler version v1.16.2

I0219 05:29:34.660585       1 event.go:255] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"98aa3807-f41e-4d0a-8d6d-507311ee00f7", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapCreated' Created ConfigMap/kube-apiserver-server-ca -n openshift-config-managed because it was missing
---

kube-scheduler was deployed without having CA bundle containing kube-apiserver's CAs because we ignore errors in

  https://github.com/openshift/cluster-kube-scheduler-operator/blame/6c156510ffb7737c190b36631dc2340e4e0d11fd/vendor/github.com/openshift/library-go/pkg/operator/resourcesynccontroller/core.go#L21-L23

combined with the lack of health checks for client connection to kube-apiserver we rolled out all 2 revision and the 3rd had unavailable local kube-apiserver whose deployment got stuck because due to the lack of scheduler we couldn't run OVN pods which through kubelet also broken running installer pods that don't need scheduler because they have nodeName set by the operator.


TODO:
- we need to wait for the CM to be available or fail building the CA bundle
- kube-scheduler needs health check on client connection to kube-apiserver

Comment 4 Lalatendu Mohanty 2020-02-27 18:16:34 UTC
Please provide answers to following questions to help to do the impact analysis. It is fine if you do not know answers to some but we can try to find answer to those later. 

What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
What kind of clusters are impacted because of the bug? 
What cluster functionality is degraded while hitting the bug?
Can this bug cause data loss?
Data loss = API server data loss or CRD state information loss etc. 
Is it possible to recover the cluster from the bug?
Is recovery automatic without intervention?  I.e. is the buggy condition transient?
Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix?
Is recovery possible only after more extensive cluster-admin intervention?
Is recovery impossible (bricked cluster)?
What is the observed rate of failure we see in CI?
Is there a manual workaround that exists to recover from the bug? 
What are manual steps? 
How long before the bug is fixed?

Comment 5 Tomáš Nožička 2020-03-02 14:35:56 UTC
@lmohanty is this an automated message? Why is it adding keyword Upgrades/UpgradeBlocker without any other info? The only failure referenced isn't an upgrade so it would be helpful to identify which one you talk/care about. 

The failure reported here was a bootstrap issue. That also implies some of the answers to the question being raised and invalidates some.

The "Missing CNI default network" network seems to be manifestation of broken kube-scheduler (certs in this instance) when OVN can't deploy, but it may be case by case that needs to be identified. It may be a result of different failure then kube-scheduler as well. As a result of this particular failure I am currently working on 2 fixes for cluster-kube-scheduler-operator. In this case the cert-recovery steps would fix the issue, although for failed bootstrap it's easier and faster to try again. 4.5 should recover automatically.

Comment 6 Lalatendu Mohanty 2020-03-02 15:54:02 UTC
@Tomas this issue was recently came up in CI tests and tracked as part of OTA team to identify issues that effect upgrades. It effects our ability to enable over the air updates in 4.3.z. Upgrades keyword is used for bugs that are related to upgrades. After we do not get any response to the questions for impact analysis we are marking it upgradeblocker to track it for quicker triage and root cause analysis. This bug came up while discussing https://bugzilla.redhat.com/show_bug.cgi?id=1802246. If you think this is not a blocker, you can give an explanation and remove the keyword. Usually the questions should give you fair idea of what we are looking for.

Comment 8 Maciej Szulik 2020-03-17 17:44:42 UTC
This will land in a .z stream moving accordingly.

Comment 9 Michal Fojtik 2020-05-12 10:33:23 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing severity from "medium" to "low".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 10 Tomáš Nožička 2020-05-14 11:54:08 UTC
waiting for https://bugzilla.redhat.com/show_bug.cgi?id=1810528 to be backported

Comment 11 Tomáš Nožička 2020-05-20 08:36:02 UTC
This bug is actively worked on.

Comment 12 Tomáš Nožička 2020-06-18 09:14:32 UTC
This bug is actively worked on.

Comment 16 Michal Fojtik 2020-08-24 13:12:22 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.