Bug 1736168

Summary: Azure cluster stopped serving openshift apiserver resources
Product: OpenShift Container Platform Reporter: Derek Carr <decarr>
Component: openshift-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED WORKSFORME QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.2.0CC: aos-bugs, jokerman, mfojtik, nagrawal, sanchezl, wking, xxia, yinzhou
Target Milestone: ---Keywords: Reopened, TestBlocker
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-21 04:31:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1743114    
Bug Blocks:    

Description Derek Carr 2019-08-01 14:58:05 UTC
Description of problem:
Installed OpenShift on Azure, after 22h, no longer able to serve openshift-apiserver resources resulting in not being able to log into webconsole.

Version-Release number of selected component (if applicable):

4.2 dev stream (see must-gather report)

Additional info:

see attached must-gather report.

Comment 2 Michal Fojtik 2019-08-08 08:44:43 UTC
I think this might be the same problem as https://bugzilla.redhat.com/show_bug.cgi?id=1736800

Derek, are you able to fetch the logs from the openshift-apiserver pods and check if you see that cert error?

Also I wonder why the must-gather does not include the pod logs....

Comment 3 Michal Fojtik 2019-08-12 08:01:01 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1736800 merged, lets move this on QE and verify if it also fixes the Azure problem.

Comment 4 Xingxing Xia 2019-08-13 10:06:19 UTC
(In reply to Michal Fojtik from comment #2)
> I think this might be the same problem as
> https://bugzilla.redhat.com/show_bug.cgi?id=1736800
> 
> Derek, are you able to fetch the logs from the openshift-apiserver pods and
> check if you see that cert error?
> 
> Also I wonder why the must-gather does not include the pod logs....

I checked Derek's attachment, untar, cd the dir, found it is different issue from bug 1736800, see below important error I found:
cd must-gather
cat cluster-scoped-resources/operator.openshift.io/openshiftapiservers/cluster.yaml
...
  - lastTransitionTime: 2019-08-01T10:48:11Z
    message: no openshift-apiserver daemon pods available on any node.
    reason: NoAPIServerPod
    status: "False"
    type: Available
...

cat namespaces/openshift-apiserver-operator/pods/openshift-apiserver-operator-64d54dc77f-wxp2b/openshift-apiserver-operator-64d54dc77f-wxp2b.yaml
...
  - lastProbeTime: null
    lastTransitionTime: 2019-07-31T15:54:51Z
    status: "False"
    type: Ready
...
  containerStatuses:
  - containerID: cri-o://fc204d9be3c57f43253625eca8d6502dcd360e9569c7d1f04ae5a57e1c93e49b
    image: registry.svc.ci.openshift.org/origin/4.2-2019-07-31-154026@sha256:473253d559e9fb3ac2c4dadafd48ae82d641d68950121fd56785878ef7f93e10
    imageID: registry.svc.ci.openshift.org/origin/4.2-2019-07-31-154026@sha256:473253d559e9fb3ac2c4dadafd48ae82d641d68950121fd56785878ef7f93e10
    lastState:
      terminated:
        containerID: cri-o://e91c6135c53e6e563b6884ab9322881f688754d843cbd4ce6479ee79eb805df0
        exitCode: 255
        finishedAt: 2019-07-31T15:54:49Z
        message: |
          }): type: 'Warning' reason: 'ConfigMapCreateFailed' Failed to create ConfigMap/client-ca -n openshift-apiserver: namespaces "openshift-apiserver" not found
          I0731 15:54:39.056798       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"2550f2ba-b3ab-11e9-a698-000d3a3e7fb4", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ConfigMapCreateFailed' Failed to create ConfigMap/aggregator-client-ca -n openshift-apiserver: namespaces "openshift-apiserver" not found
          I0731 15:54:39.120015       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"2550f2ba-b3ab-11e9-a698-000d3a3e7fb4", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ConfigMapCreateFailed' Failed to create ConfigMap/etcd-serving-ca -n openshift-apiserver: namespaces "openshift-apiserver" not found
          I0731 15:54:39.173887       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"2550f2ba-b3ab-11e9-a698-000d3a3e7fb4", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SecretCreateFailed' Failed to create Secret/etcd-client -n openshift-apiserver: namespaces "openshift-apiserver" not found
          I0731 15:54:49.856415       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "77caa34ea7e84a5509751294b69be3416206acf8b61d775526c8cb50f1448ebc", lastKnown: "")
          W0731 15:54:49.856923       1 builder.go:108] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created
          I0731 15:54:49.856998       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "8b78c62dbfd093d3d69437a4d1e1cca7b62b2de151e9b9d860ffb7ec5040962b", lastKnown: "")
          F0731 15:54:49.857016       1 leaderelection.go:66] leaderelection lost
        reason: Error
        startedAt: 2019-07-31T15:53:17Z
    name: openshift-apiserver-operator
    ready: true
    restartCount: 1
    state:
      running:
        startedAt: 2019-07-31T15:54:50Z
...

Comment 7 Xingxing Xia 2019-08-13 10:30:58 UTC
All co's in my env:
oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-08-10-002649   True        True          True       24h
cloud-credential                           4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
cluster-autoscaler                         4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
console                                    4.2.0-0.nightly-2019-08-10-002649   True        True          False      24h
dns                                        4.2.0-0.nightly-2019-08-10-002649   False       True          True       75m
image-registry                             4.2.0-0.nightly-2019-08-10-002649   False       True          False      75m
ingress                                    4.2.0-0.nightly-2019-08-10-002649   False       True          False      75m
insights                                   4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
kube-apiserver                             4.2.0-0.nightly-2019-08-10-002649   True        True          True       24h
kube-controller-manager                    4.2.0-0.nightly-2019-08-10-002649   True        False         True       24h
kube-scheduler                             4.2.0-0.nightly-2019-08-10-002649   True        False         True       24h
machine-api                                4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
machine-config                             4.2.0-0.nightly-2019-08-10-002649   False       False         True       65m
marketplace                                4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
monitoring                                 4.2.0-0.nightly-2019-08-10-002649   False       True          True       68m
network                                    4.2.0-0.nightly-2019-08-10-002649   True        True          False      24h
node-tuning                                4.2.0-0.nightly-2019-08-10-002649   False       False         False      75m
openshift-apiserver                        4.2.0-0.nightly-2019-08-10-002649   False       False         False      75m
openshift-controller-manager               4.2.0-0.nightly-2019-08-10-002649   False       False         False      75m
openshift-samples                          4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
operator-lifecycle-manager                 4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-08-10-002649   False       True          False      75m
service-ca                                 4.2.0-0.nightly-2019-08-10-002649   True        True          False      24h
service-catalog-apiserver                  4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
service-catalog-controller-manager         4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h
storage                                    4.2.0-0.nightly-2019-08-10-002649   True        False         False      24h

Comment 9 Seth Jennings 2019-08-15 13:33:23 UTC
What is the rationale for moving this bug to Node component?  No reasoning is given.

Comment 10 Seth Jennings 2019-08-15 13:35:19 UTC
I see the summary change.  Seems like it might have changed the nature of the report as well.  I'm not sure this is the same issue as initially reported.

Comment 11 Seth Jennings 2019-08-15 14:15:46 UTC
I downloaded the kubeconfig from comment #8 but I can't get pods logs and I can't run `oc debug node` because the kubelets don't have their serving certs installed.  Without direct ssh access to the instances in that cluster, I can't make progress on this.

Comment 12 Seth Jennings 2019-08-15 15:23:20 UTC
The kubelets are creating CSRs with the node-bootstrapper cert which I would not expect after they get their client/serving certs during installation.

Comment 13 Ryan Phillips 2019-08-15 16:18:01 UTC
Has this cluster been disaster recovered using this document [1]?

[1] https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html

Comment 14 Ryan Phillips 2019-08-15 16:55:22 UTC
Looks like a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1736800 and the PR merged 7 days ago https://github.com/openshift/openshift-apiserver/pull/16

*** This bug has been marked as a duplicate of bug 1736800 ***

Comment 15 Xingxing Xia 2019-08-16 01:27:10 UTC
(In reply to Seth Jennings from comment #9)
> What is the rationale for moving this bug to Node component?  No reasoning
> is given.
Not sure which is proper. Maybe kube-apiserver?

(In reply to Seth Jennings from comment #10)
> I see the summary change.  Seems like it might have changed the nature of
> the report as well.  I'm not sure this is the same issue as initially
> reported.
The symptom described since comment 4 is completely same as original report's attachment.

(In reply to Seth Jennings from comment #11)
> I downloaded the kubeconfig from comment #8 but I can't get pods logs and I
> can't run `oc debug node` because the kubelets don't have their serving
> certs installed.  Without direct ssh access to the instances in that
> cluster, I can't make progress on this.
Yes, as observed in comment 4, `oc logs` cannot run due to this bug. Maybe could you try to reproduce it? After all, the issue were met, not only in IPI on Azure but also in IPI on AWS per the comments.

(In reply to Ryan Phillips from comment #13)
> Has this cluster been disaster recovered using this document [1]?
> 
> [1]
> https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-
> 3-expired-certs.html
No, it is a fresh installation running for > 24 hours. No DR operations were done to it.

(In reply to Ryan Phillips from comment #14)
> Looks like a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1736800 and
> the PR merged 7 days ago
> https://github.com/openshift/openshift-apiserver/pull/16
> 
> *** This bug has been marked as a duplicate of bug 1736800 ***
As stated in comment 4, the symptom is different as bug 1736800. 1736800 can run `oc logs` while this cannot per comment 4.

Comment 16 Xingxing Xia 2019-08-16 08:23:17 UTC
(In reply to Xingxing Xia from comment #8)
> AWS IPI env also found the same issue:
> https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/
> Launch%20Environment%20Flexy/65364/artifact/workdir/install-dir/auth/
> kubeconfig
> Env launching job parameters:
> https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/
> Launch%20Environment%20Flexy/65364/
> 4.2.0-0.nightly-2019-08-13-183722

Hmm, strangely, another newer payload 4.2.0-0.nightly-2019-08-15-033605 IPI on AWS env running > 25 hours didn't hit the issue. So reverting the bug title. But state comment 15 again, the symptom from this bug's attachment (and comment 5's successful reproducing) is different as bug 1736800.

Comment 17 Ryan Phillips 2019-08-16 14:51:36 UTC
Migrating to openshift-apiserver based on https://bugzilla.redhat.com/show_bug.cgi?id=1736168#c4 stating the openshift-apiserver namespace is not created.

Comment 19 Xingxing Xia 2019-08-19 10:36:56 UTC
Like comment 4 and 5 reproduced it by launching and leaving a latest Azure env, today launched Azure env again to confirm the bug is gone or not. Will give result after it runs > 24 hours next day.

Comment 20 Xingxing Xia 2019-08-19 12:13:20 UTC
The launched env failed due to bug 1743114#c1 . Once new trying installation succeeds, will answer above whether "still reproducible" issue.

Comment 21 Neelesh Agrawal 2019-08-19 19:22:30 UTC
Hi XingXing,

Confirming that there is no fix needed from apiserver side? If so, why not keep this bug as on QE? Verification will be dependent on when 1743114 is fixed.

Comment 22 Xingxing Xia 2019-08-20 02:06:32 UTC
Neelesh Agrawal, hmm, no, comment 4 ~ 5 reproduced it and left env but didn't get checked, thus there is uncertainty whether the issue is gone. I will retry launching env again. If still reproducible, fix would be needed. Fine to keep ON_QA till my retrying has result.

Comment 23 Xingxing Xia 2019-08-21 04:31:51 UTC
Checked Azure env of 4.2.0-0.nightly-2019-08-19-201622 running > 24h, it does not reproduce above issues again, though strangely comment 5 reproduced it in a payload newer than bug 1736800 verified version. Thus closing it.