Bug 1876486

Summary: Static pod installer controller deadlocks with non-existing installer pod, WAS: kube-apisrever of clsuter operator always with incorrect status due to pleg error
Product: OpenShift Container Platform Reporter: Maciej Szulik <maszulik>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED CURRENTRELEASE QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.3.0CC: aos-bugs, cblecker, cmeadors, jokerman, jupierce, kewang, maszulik, mfojtik, sanchezl, schoudha, sdodson, sttts, vlaad, wking, xxia
Target Milestone: ---Keywords: ServiceDeliveryImpact, Upgrades
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1876484 Environment:
Last Closed: 2021-01-20 14:46:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1876484    
Bug Blocks: 1822016    

Comment 1 Stefan Schimanski 2020-09-11 15:36:25 UTC
We are waiting for PRs to merge and to be verified for 4.5 and 4.4. Adding UpcomingSprint.

Comment 3 Stefan Schimanski 2020-10-02 09:04:50 UTC
These are waiting for the 4.4 PRs to merge. Adding UpcomingSprint.

Comment 4 Ke Wang 2020-10-28 11:18:19 UTC
$ oc get clusterversion
NAME      VERSION                                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         93m     Cluster version is 4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb

$ oc get infrastructures.config.openshift.io  -o json | jq .items[0].status.platform
"GCP"

$ oc get nodes
NAME                                                      STATUS   ROLES    AGE    VERSION
ci-ln-5qqkm-m-0.c.openshift-gce-devel-ci.internal         Ready    master   118m   v1.16.2+853223d
ci-ln-5qqkm-m-1.c.openshift-gce-devel-ci.internal         Ready    master   117m   v1.16.2+853223d
ci-ln-5qqkm-m-2.c.openshift-gce-devel-ci.internal         Ready    master   117m   v1.16.2+853223d
ci-ln-5qqkm-w-b-xbgwx.c.openshift-gce-devel-ci.internal   Ready    worker   103m   v1.16.2+853223d
ci-ln-5qqkm-w-c-9db7n.c.openshift-gce-devel-ci.internal   Ready    worker   103m   v1.16.2+853223d
ci-ln-5qqkm-w-d-j2w24.c.openshift-gce-devel-ci.internal   Ready    worker   103m   v1.16.2+853223d

$ oc get co
NAME                                       VERSION                                           AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      95m
cloud-credential                           4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      117m
cluster-autoscaler                         4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      104m
console                                    4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      98m
dns                                        4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      115m
image-registry                             4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      102m
ingress                                    4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      102m
insights                                   4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      105m
kube-apiserver                             4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      114m
kube-controller-manager                    4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      114m
kube-scheduler                             4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      114m
machine-api                                4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      105m
machine-config                             4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      115m
marketplace                                4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      105m
monitoring                                 4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      99m
network                                    4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      116m
node-tuning                                4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      105m
openshift-apiserver                        4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      107m
openshift-controller-manager               4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      114m
openshift-samples                          4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      103m
operator-lifecycle-manager                 4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      105m
operator-lifecycle-manager-catalog         4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      105m
operator-lifecycle-manager-packageserver   4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      98m
service-ca                                 4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      116m
service-catalog-apiserver                  4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      105m
service-catalog-controller-manager         4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      105m
storage                                    4.3.0-0.ci.test-2020-10-28-090331-ci-ln-mb5hmnb   True        False         False      105m

The cluster works well, so move the bug Verified.

Comment 5 Ke Wang 2020-10-28 14:57:22 UTC
Another installation on AWS:
$ oc get clusterversion
NAME      VERSION                                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         169m    Cluster version is 4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2

$ oc get infrastructures.config.openshift.io  -o json | jq .items[0].status.platform
"AWS"

$ oc get no
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-130-235.us-east-2.compute.internal   Ready    master   3h6m   v1.16.2+853223d
ip-10-0-130-68.us-east-2.compute.internal    Ready    worker   175m   v1.16.2+853223d
ip-10-0-132-184.us-east-2.compute.internal   Ready    master   3h6m   v1.16.2+853223d
ip-10-0-134-163.us-east-2.compute.internal   Ready    worker   175m   v1.16.2+853223d
ip-10-0-156-227.us-east-2.compute.internal   Ready    master   3h6m   v1.16.2+853223d
ip-10-0-157-107.us-east-2.compute.internal   Ready    worker   175m   v1.16.2+853223d

$ oc get co
NAME                                       VERSION                                           AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      168m
cloud-credential                           4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h6m
cluster-autoscaler                         4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      178m
console                                    4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      171m
dns                                        4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h3m
image-registry                             4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      173m
ingress                                    4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      173m
insights                                   4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      179m
kube-apiserver                             4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h2m
kube-controller-manager                    4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h2m
kube-scheduler                             4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h2m
machine-api                                4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      179m
machine-config                             4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h4m
marketplace                                4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      178m
monitoring                                 4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      173m
network                                    4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h5m
node-tuning                                4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      168m
openshift-apiserver                        4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h
openshift-controller-manager               4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h2m
openshift-samples                          4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      178m
operator-lifecycle-manager                 4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h
operator-lifecycle-manager-catalog         4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h
operator-lifecycle-manager-packageserver   4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      179m
service-ca                                 4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h5m
service-catalog-apiserver                  4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h
service-catalog-controller-manager         4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      3h
storage                                    4.3.0-0.ci.test-2020-10-28-113014-ci-ln-r5hszw2   True        False         False      179m

The cluster works well.

Comment 6 W. Trevor King 2020-10-28 15:34:43 UTC
Straight from POST to VERIFIED, with all three linked PRs still open?  I'm confused...

Comment 7 Xingxing Xia 2020-10-29 02:04:47 UTC
Ke Wang, do not move to VERIFIED.
We follow the pre-merge verification process https://issues.redhat.com/browse/DPTP-660 and goal https://issues.redhat.com/browse/OCPQE-815 , only to make the verification earlier with dev-approved PRs. Without the processes, once the bug hits failedQA verification, it has to be assigned back for more fix, this is not good to smooth z-stream release.
So let's just do pre-merged-PR verification, add QE /lgtm in PR, leave the status as is, instead of moving to VERIFIED. Once the dev-approved PR(s) get merged and it changes to ON_QA, the bot is supposed to move it VERIFIED automatically if the PR(s) already go into the related nightly build, or if the bot does not work in this automation, you move it manually. More details in DPTP-660

Comment 8 Ke Wang 2020-10-29 02:21:18 UTC
Xingxing, thank you for letting me understand pre-merge verification process more clearly, will handle this process carefully next time.

Comment 9 Maciej Szulik 2020-10-29 09:31:15 UTC
The linked PRs merged, moving to modified.

Comment 11 Ke Wang 2020-11-02 03:08:56 UTC
- Checked all PRs already have been loaded in 4.3.z nightly build,

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-10-31-224411 | grep cluster-kube-apiserver-operator 
  cluster-kube-apiserver-operator               https://github.com/openshift/cluster-kube-apiserver-operator               e6a57821a09bbd4b06d617216ffc78a3edf5de54

$ git log --date local '--pretty=%h %an %cd - %s' e6a57821 | grep '#960'
e6a57821 OpenShift Merge Robot Thu Oct 29 03:14:49 2020 - Merge pull request #960 from soltysh/bug1876486

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-10-31-224411 | grep cluster-kube-scheduler-operator
  cluster-kube-scheduler-operator               https://github.com/openshift/cluster-kube-scheduler-operator               0acedc7bfdaf0b7b369fcc4feb2c71c684b51571

$ git log --date local '--pretty=%h %an %cd - %s' 0acedc7 | grep '#285'
0acedc7b OpenShift Merge Robot Thu Oct 29 02:40:42 2020 - Merge pull request #285 from soltysh/bug1876486

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-10-31-224411 | grep cluster-kube-controller-manager-operator 
  cluster-kube-controller-manager-operator      https://github.com/openshift/cluster-kube-controller-manager-operator      85291ab536a3a258d847e12964b4c08bb402ba87
  
$ git log --date local '--pretty=%h %an %cd - %s' 85291ab | grep '#458'
85291ab5 OpenShift Merge Robot Thu Oct 29 02:21:17 2020 - Merge pull request #458 from soltysh/bug1876486

- Tried the installation with PRs loaded 4.3.z nightly build,
----------
Connected IPI install on Azure.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-10-31-224411   True        False         19m     Cluster version is 4.3.0-0.nightly-2020-10-31-224411

$ oc get infrastructures.config.openshift.io  -o json | jq .items[0].status.platform
"Azure"

$ oc get node
NAME                                           STATUS   ROLES    AGE   VERSION
kewang0131-2h6ch-master-0                      Ready    master   42m   v1.16.2+853223d
kewang0131-2h6ch-master-1                      Ready    master   42m   v1.16.2+853223d
kewang0131-2h6ch-master-2                      Ready    master   41m   v1.16.2+853223d
kewang0131-2h6ch-worker-northcentralus-jnj7t   Ready    worker   28m   v1.16.2+853223d
kewang0131-2h6ch-worker-northcentralus-tk624   Ready    worker   27m   v1.16.2+853223d

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      19m
cloud-credential                           4.3.0-0.nightly-2020-10-31-224411   True        False         False      42m
cluster-autoscaler                         4.3.0-0.nightly-2020-10-31-224411   True        False         False      35m
console                                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      22m
dns                                        4.3.0-0.nightly-2020-10-31-224411   True        False         False      40m
image-registry                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      26m
ingress                                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      25m
insights                                   4.3.0-0.nightly-2020-10-31-224411   True        False         False      37m
kube-apiserver                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      38m
kube-controller-manager                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      38m
kube-scheduler                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      39m
machine-api                                4.3.0-0.nightly-2020-10-31-224411   True        False         False      36m
machine-config                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      40m
marketplace                                4.3.0-0.nightly-2020-10-31-224411   True        False         False      36m
monitoring                                 4.3.0-0.nightly-2020-10-31-224411   True        False         False      22m
network                                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      41m
node-tuning                                4.3.0-0.nightly-2020-10-31-224411   True        False         False      24m
openshift-apiserver                        4.3.0-0.nightly-2020-10-31-224411   True        False         False      30m
openshift-controller-manager               4.3.0-0.nightly-2020-10-31-224411   True        False         False      40m
openshift-samples                          4.3.0-0.nightly-2020-10-31-224411   True        False         False      35m
operator-lifecycle-manager                 4.3.0-0.nightly-2020-10-31-224411   True        False         False      37m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-10-31-224411   True        False         False      37m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-10-31-224411   True        False         False      36m
service-ca                                 4.3.0-0.nightly-2020-10-31-224411   True        False         False      41m
service-catalog-apiserver                  4.3.0-0.nightly-2020-10-31-224411   True        False         False      36m
service-catalog-controller-manager         4.3.0-0.nightly-2020-10-31-224411   True        False         False      37m
storage                                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      37m



--------
Disconnected UPI install on vSphere.
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-10-31-224411   True        False         43m     Cluster version is 4.3.0-0.nightly-2020-10-31-224411

$ oc get infrastructures.config.openshift.io  -o json | jq .items[0].status.platform
"VSphere"

$ oc get node
NAME              STATUS   ROLES    AGE   VERSION
compute-0         Ready    worker   61m   v1.16.2+853223d
compute-1         Ready    worker   61m   v1.16.2+853223d
control-plane-0   Ready    master   61m   v1.16.2+853223d
control-plane-1   Ready    master   61m   v1.16.2+853223d
control-plane-2   Ready    master   61m   v1.16.2+853223d

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      47m
cloud-credential                           4.3.0-0.nightly-2020-10-31-224411   True        False         False      61m
cluster-autoscaler                         4.3.0-0.nightly-2020-10-31-224411   True        False         False      53m
console                                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      20m
dns                                        4.3.0-0.nightly-2020-10-31-224411   True        False         False      60m
image-registry                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      25m
ingress                                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      25m
insights                                   4.3.0-0.nightly-2020-10-31-224411   True        False         False      53m
kube-apiserver                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      54m
kube-controller-manager                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      54m
kube-scheduler                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      55m
machine-api                                4.3.0-0.nightly-2020-10-31-224411   True        False         False      53m
machine-config                             4.3.0-0.nightly-2020-10-31-224411   True        False         False      54m
marketplace                                4.3.0-0.nightly-2020-10-31-224411   True        False         False      19m
monitoring                                 4.3.0-0.nightly-2020-10-31-224411   True        False         False      34m
network                                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      60m
node-tuning                                4.3.0-0.nightly-2020-10-31-224411   True        False         False      23m
openshift-apiserver                        4.3.0-0.nightly-2020-10-31-224411   True        False         False      21m
openshift-controller-manager               4.3.0-0.nightly-2020-10-31-224411   True        False         False      55m
openshift-samples                          4.3.0-0.nightly-2020-10-31-224411   True        False         False      14m
operator-lifecycle-manager                 4.3.0-0.nightly-2020-10-31-224411   True        False         False      54m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-10-31-224411   True        False         False      54m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-10-31-224411   True        False         False      20m
service-ca                                 4.3.0-0.nightly-2020-10-31-224411   True        False         False      60m
service-catalog-apiserver                  4.3.0-0.nightly-2020-10-31-224411   True        False         False      54m
service-catalog-controller-manager         4.3.0-0.nightly-2020-10-31-224411   True        False         False      54m
storage                                    4.3.0-0.nightly-2020-10-31-224411   True        False         False      53m

Both clusters work fine, so move the bug VERIFIED.