Bug 1736168
Summary: | Azure cluster stopped serving openshift apiserver resources | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Derek Carr <decarr> |
Component: | openshift-apiserver | Assignee: | Stefan Schimanski <sttts> |
Status: | CLOSED WORKSFORME | QA Contact: | Xingxing Xia <xxia> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.2.0 | CC: | aos-bugs, jokerman, mfojtik, nagrawal, sanchezl, wking, xxia, yinzhou |
Target Milestone: | --- | Keywords: | Reopened, TestBlocker |
Target Release: | 4.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-08-21 04:31:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1743114 | ||
Bug Blocks: |
Description
Derek Carr
2019-08-01 14:58:05 UTC
I think this might be the same problem as https://bugzilla.redhat.com/show_bug.cgi?id=1736800 Derek, are you able to fetch the logs from the openshift-apiserver pods and check if you see that cert error? Also I wonder why the must-gather does not include the pod logs.... https://bugzilla.redhat.com/show_bug.cgi?id=1736800 merged, lets move this on QE and verify if it also fixes the Azure problem. (In reply to Michal Fojtik from comment #2) > I think this might be the same problem as > https://bugzilla.redhat.com/show_bug.cgi?id=1736800 > > Derek, are you able to fetch the logs from the openshift-apiserver pods and > check if you see that cert error? > > Also I wonder why the must-gather does not include the pod logs.... I checked Derek's attachment, untar, cd the dir, found it is different issue from bug 1736800, see below important error I found: cd must-gather cat cluster-scoped-resources/operator.openshift.io/openshiftapiservers/cluster.yaml ... - lastTransitionTime: 2019-08-01T10:48:11Z message: no openshift-apiserver daemon pods available on any node. reason: NoAPIServerPod status: "False" type: Available ... cat namespaces/openshift-apiserver-operator/pods/openshift-apiserver-operator-64d54dc77f-wxp2b/openshift-apiserver-operator-64d54dc77f-wxp2b.yaml ... - lastProbeTime: null lastTransitionTime: 2019-07-31T15:54:51Z status: "False" type: Ready ... containerStatuses: - containerID: cri-o://fc204d9be3c57f43253625eca8d6502dcd360e9569c7d1f04ae5a57e1c93e49b image: registry.svc.ci.openshift.org/origin/4.2-2019-07-31-154026@sha256:473253d559e9fb3ac2c4dadafd48ae82d641d68950121fd56785878ef7f93e10 imageID: registry.svc.ci.openshift.org/origin/4.2-2019-07-31-154026@sha256:473253d559e9fb3ac2c4dadafd48ae82d641d68950121fd56785878ef7f93e10 lastState: terminated: containerID: cri-o://e91c6135c53e6e563b6884ab9322881f688754d843cbd4ce6479ee79eb805df0 exitCode: 255 finishedAt: 2019-07-31T15:54:49Z message: | }): type: 'Warning' reason: 'ConfigMapCreateFailed' Failed to create ConfigMap/client-ca -n openshift-apiserver: namespaces "openshift-apiserver" not found I0731 15:54:39.056798 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"2550f2ba-b3ab-11e9-a698-000d3a3e7fb4", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ConfigMapCreateFailed' Failed to create ConfigMap/aggregator-client-ca -n openshift-apiserver: namespaces "openshift-apiserver" not found I0731 15:54:39.120015 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"2550f2ba-b3ab-11e9-a698-000d3a3e7fb4", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ConfigMapCreateFailed' Failed to create ConfigMap/etcd-serving-ca -n openshift-apiserver: namespaces "openshift-apiserver" not found I0731 15:54:39.173887 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"2550f2ba-b3ab-11e9-a698-000d3a3e7fb4", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SecretCreateFailed' Failed to create Secret/etcd-client -n openshift-apiserver: namespaces "openshift-apiserver" not found I0731 15:54:49.856415 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "77caa34ea7e84a5509751294b69be3416206acf8b61d775526c8cb50f1448ebc", lastKnown: "") W0731 15:54:49.856923 1 builder.go:108] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created I0731 15:54:49.856998 1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "8b78c62dbfd093d3d69437a4d1e1cca7b62b2de151e9b9d860ffb7ec5040962b", lastKnown: "") F0731 15:54:49.857016 1 leaderelection.go:66] leaderelection lost reason: Error startedAt: 2019-07-31T15:53:17Z name: openshift-apiserver-operator ready: true restartCount: 1 state: running: startedAt: 2019-07-31T15:54:50Z ... All co's in my env: oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.0-0.nightly-2019-08-10-002649 True True True 24h cloud-credential 4.2.0-0.nightly-2019-08-10-002649 True False False 24h cluster-autoscaler 4.2.0-0.nightly-2019-08-10-002649 True False False 24h console 4.2.0-0.nightly-2019-08-10-002649 True True False 24h dns 4.2.0-0.nightly-2019-08-10-002649 False True True 75m image-registry 4.2.0-0.nightly-2019-08-10-002649 False True False 75m ingress 4.2.0-0.nightly-2019-08-10-002649 False True False 75m insights 4.2.0-0.nightly-2019-08-10-002649 True False False 24h kube-apiserver 4.2.0-0.nightly-2019-08-10-002649 True True True 24h kube-controller-manager 4.2.0-0.nightly-2019-08-10-002649 True False True 24h kube-scheduler 4.2.0-0.nightly-2019-08-10-002649 True False True 24h machine-api 4.2.0-0.nightly-2019-08-10-002649 True False False 24h machine-config 4.2.0-0.nightly-2019-08-10-002649 False False True 65m marketplace 4.2.0-0.nightly-2019-08-10-002649 True False False 24h monitoring 4.2.0-0.nightly-2019-08-10-002649 False True True 68m network 4.2.0-0.nightly-2019-08-10-002649 True True False 24h node-tuning 4.2.0-0.nightly-2019-08-10-002649 False False False 75m openshift-apiserver 4.2.0-0.nightly-2019-08-10-002649 False False False 75m openshift-controller-manager 4.2.0-0.nightly-2019-08-10-002649 False False False 75m openshift-samples 4.2.0-0.nightly-2019-08-10-002649 True False False 24h operator-lifecycle-manager 4.2.0-0.nightly-2019-08-10-002649 True False False 24h operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-08-10-002649 True False False 24h operator-lifecycle-manager-packageserver 4.2.0-0.nightly-2019-08-10-002649 False True False 75m service-ca 4.2.0-0.nightly-2019-08-10-002649 True True False 24h service-catalog-apiserver 4.2.0-0.nightly-2019-08-10-002649 True False False 24h service-catalog-controller-manager 4.2.0-0.nightly-2019-08-10-002649 True False False 24h storage 4.2.0-0.nightly-2019-08-10-002649 True False False 24h AWS IPI env also found the same issue: https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/65364/artifact/workdir/install-dir/auth/kubeconfig Env launching job parameters: https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/65364/ 4.2.0-0.nightly-2019-08-13-183722 What is the rationale for moving this bug to Node component? No reasoning is given. I see the summary change. Seems like it might have changed the nature of the report as well. I'm not sure this is the same issue as initially reported. I downloaded the kubeconfig from comment #8 but I can't get pods logs and I can't run `oc debug node` because the kubelets don't have their serving certs installed. Without direct ssh access to the instances in that cluster, I can't make progress on this. The kubelets are creating CSRs with the node-bootstrapper cert which I would not expect after they get their client/serving certs during installation. Has this cluster been disaster recovered using this document [1]? [1] https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html Looks like a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1736800 and the PR merged 7 days ago https://github.com/openshift/openshift-apiserver/pull/16 *** This bug has been marked as a duplicate of bug 1736800 *** (In reply to Seth Jennings from comment #9) > What is the rationale for moving this bug to Node component? No reasoning > is given. Not sure which is proper. Maybe kube-apiserver? (In reply to Seth Jennings from comment #10) > I see the summary change. Seems like it might have changed the nature of > the report as well. I'm not sure this is the same issue as initially > reported. The symptom described since comment 4 is completely same as original report's attachment. (In reply to Seth Jennings from comment #11) > I downloaded the kubeconfig from comment #8 but I can't get pods logs and I > can't run `oc debug node` because the kubelets don't have their serving > certs installed. Without direct ssh access to the instances in that > cluster, I can't make progress on this. Yes, as observed in comment 4, `oc logs` cannot run due to this bug. Maybe could you try to reproduce it? After all, the issue were met, not only in IPI on Azure but also in IPI on AWS per the comments. (In reply to Ryan Phillips from comment #13) > Has this cluster been disaster recovered using this document [1]? > > [1] > https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario- > 3-expired-certs.html No, it is a fresh installation running for > 24 hours. No DR operations were done to it. (In reply to Ryan Phillips from comment #14) > Looks like a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1736800 and > the PR merged 7 days ago > https://github.com/openshift/openshift-apiserver/pull/16 > > *** This bug has been marked as a duplicate of bug 1736800 *** As stated in comment 4, the symptom is different as bug 1736800. 1736800 can run `oc logs` while this cannot per comment 4. (In reply to Xingxing Xia from comment #8) > AWS IPI env also found the same issue: > https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ > Launch%20Environment%20Flexy/65364/artifact/workdir/install-dir/auth/ > kubeconfig > Env launching job parameters: > https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ > Launch%20Environment%20Flexy/65364/ > 4.2.0-0.nightly-2019-08-13-183722 Hmm, strangely, another newer payload 4.2.0-0.nightly-2019-08-15-033605 IPI on AWS env running > 25 hours didn't hit the issue. So reverting the bug title. But state comment 15 again, the symptom from this bug's attachment (and comment 5's successful reproducing) is different as bug 1736800. Migrating to openshift-apiserver based on https://bugzilla.redhat.com/show_bug.cgi?id=1736168#c4 stating the openshift-apiserver namespace is not created. Like comment 4 and 5 reproduced it by launching and leaving a latest Azure env, today launched Azure env again to confirm the bug is gone or not. Will give result after it runs > 24 hours next day. The launched env failed due to bug 1743114#c1 . Once new trying installation succeeds, will answer above whether "still reproducible" issue. Hi XingXing, Confirming that there is no fix needed from apiserver side? If so, why not keep this bug as on QE? Verification will be dependent on when 1743114 is fixed. Neelesh Agrawal, hmm, no, comment 4 ~ 5 reproduced it and left env but didn't get checked, thus there is uncertainty whether the issue is gone. I will retry launching env again. If still reproducible, fix would be needed. Fine to keep ON_QA till my retrying has result. Checked Azure env of 4.2.0-0.nightly-2019-08-19-201622 running > 24h, it does not reproduce above issues again, though strangely comment 5 reproduced it in a payload newer than bug 1736800 verified version. Thus closing it. |