Bug 1881660

Summary:	READYMACHINECOUNT 0 for worker nodes on most of the recent OCP installations
Product:	OpenShift Container Platform	Reporter:	Anurag saxena <anusaxen>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, jokerman, kgarriso, zzhao
Version:	4.6	Flags:	anusaxen: needinfo-
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-24 21:42:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Anurag saxena 2020-09-22 20:17:11 UTC

Description of problem: READYMACHINECOUNT is displaying 0 under machineconfigpools causing installations issues on local QE env

$ oc get machineconfigpools
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-bf3de50a465da97c0aa74b3de2fa7c1f   True      False      False      3              3                   3                     0                      92m
worker   rendered-worker-4bf15950e4c24dc17126a09e0d50b0f4   False     True       False      3              0                   0                     0                      92m

oc get clusteroperators machine-config -oyaml and oc get machineconfigpools -oyaml captured here
https://privatebin-it-iso.int.open.paas.redhat.com/?82565729db2b9703#FdTB8xKK64KIz8NiO5SzHhR+vx9NKL48qqLgsMCNI2E=

Version-Release number of selected component (if applicable):4.6.0-0.nightly-2020-09-22-130743


How reproducible:Intermittent


Steps to Reproduce:
1.Install multinode OCP cluster on 4.6 - 3 masters, 3 workers
2.
3.

Actual results: READYMACHINECOUNT is not 3


Expected results: READYMACHINECOUNT should be 3


Additional info:
$ oc get nodes
NAME                                                        STATUS   ROLES    AGE    VERSION
qe-anuragcp3-62sfz-master-0.c.openshift-qe.internal         Ready    master   101m   v1.19.0+7e8389f
qe-anuragcp3-62sfz-master-1.c.openshift-qe.internal         Ready    master   101m   v1.19.0+7e8389f
qe-anuragcp3-62sfz-master-2.c.openshift-qe.internal         Ready    master   101m   v1.19.0+7e8389f
qe-anuragcp3-62sfz-worker-a-kntcs.c.openshift-qe.internal   Ready    worker   73m    v1.19.0+7e8389f
qe-anuragcp3-62sfz-worker-b-c87rv.c.openshift-qe.internal   Ready    worker   73m    v1.19.0+7e8389f
qe-anuragcp3-62sfz-worker-c-7f5gt.c.openshift-qe.internal   Ready    worker   73m    v1.19.0+7e8389f
[anusaxen@anusaxen verification-tests]$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-09-22-130743   True        False         False      64m
cloud-credential                           4.6.0-0.nightly-2020-09-22-130743   True        False         False      103m
cluster-autoscaler                         4.6.0-0.nightly-2020-09-22-130743   True        False         False      92m
config-operator                            4.6.0-0.nightly-2020-09-22-130743   True        False         False      99m
console                                    4.6.0-0.nightly-2020-09-22-130743   True        False         False      69m
csi-snapshot-controller                    4.6.0-0.nightly-2020-09-22-130743   True        False         False      76m
dns                                        4.6.0-0.nightly-2020-09-22-130743   True        False         False      97m
etcd                                       4.6.0-0.nightly-2020-09-22-130743   True        False         False      97m
image-registry                             4.6.0-0.nightly-2020-09-22-130743   True        False         False      72m
ingress                                    4.6.0-0.nightly-2020-09-22-130743   True        False         False      72m
insights                                   4.6.0-0.nightly-2020-09-22-130743   True        False         False      92m
kube-apiserver                             4.6.0-0.nightly-2020-09-22-130743   True        False         False      95m
kube-controller-manager                    4.6.0-0.nightly-2020-09-22-130743   True        False         False      96m
kube-scheduler                             4.6.0-0.nightly-2020-09-22-130743   True        False         False      97m
kube-storage-version-migrator              4.6.0-0.nightly-2020-09-22-130743   True        False         False      72m
machine-api                                4.6.0-0.nightly-2020-09-22-130743   True        False         False      72m
machine-approver                           4.6.0-0.nightly-2020-09-22-130743   True        False         False      94m
machine-config                             4.6.0-0.nightly-2020-09-22-130743   True        False         False      93m
marketplace                                4.6.0-0.nightly-2020-09-22-130743   True        False         False      76m
monitoring                                 4.6.0-0.nightly-2020-09-22-130743   True        False         False      66m
network                                    4.6.0-0.nightly-2020-09-22-130743   True        False         False      99m
node-tuning                                4.6.0-0.nightly-2020-09-22-130743   True        False         False      99m
openshift-apiserver                        4.6.0-0.nightly-2020-09-22-130743   True        False         False      92m
openshift-controller-manager               4.6.0-0.nightly-2020-09-22-130743   True        False         False      91m
openshift-samples                          4.6.0-0.nightly-2020-09-22-130743   True        False         False      92m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-09-22-130743   True        False         False      98m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-09-22-130743   True        False         False      98m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-09-22-130743   True        False         False      93m
service-ca                                 4.6.0-0.nightly-2020-09-22-130743   True        False         False      99m
storage                                    4.6.0-0.nightly-2020-09-22-130743   True        False         False      99m

Comment 1 zhaozhanqi 2020-09-23 02:04:15 UTC

Anurag, what's the network pulgin is using for your env, if it's OVN, it should be same issue with https://bugzilla.redhat.com/show_bug.cgi?id=1880974

the root reason is machine-config-daemon pod cannot access 172.30.0.1:443.

Comment 2 Anurag saxena 2020-09-23 02:31:48 UTC

Noticed on sdn/ovn both. So seems like irrespective of network plugin but rather machine-config related

Comment 3 Anurag saxena 2020-09-23 16:54:09 UTC

Yep, as per multiple runs today seems like its not happening on OpenshiftSDN but on OVNKUbernetes so right Zhanqi, i suspect https://bugzilla.redhat.com/show_bug.cgi?id=1880974 might be the cause. Will let dev evaluate the statement. Thanks

Comment 4 Kirsten Garrison 2020-09-23 18:23:37 UTC

Please provide a must gather from this cluster.

Comment 5 Kirsten Garrison 2020-09-23 18:26:34 UTC

Also the privatebin link above requires a password...

Comment 6 Anurag saxena 2020-09-23 19:35:33 UTC

(In reply to Kirsten Garrison from comment #5)
> Also the privatebin link above requires a password...

ahh because "=" is excluded hyperlinked somehow

https://privatebin-it-iso.int.open.paas.redhat.com/?82565729db2b9703#FdTB8xKK64KIz8NiO5SzHhR+vx9NKL48qqLgsMCNI2E=

Comment 7 Kirsten Garrison 2020-09-23 19:38:17 UTC

Also can you please include a full must gather from this cluster as requested above?

Comment 8 Anurag saxena 2020-09-23 19:47:34 UTC

(In reply to Kirsten Garrison from comment #7)
> Also can you please include a full must gather from this cluster as
> requested above?

Yes, Kirsten. Thats in my action item list but my last cluster got pruned So repro'ing this on a new cluster and will provide gather details.

Comment 9 Kirsten Garrison 2020-09-23 20:14:29 UTC

Thanks Anurag! Let's leave the need info on the BZ until the must gather is shared for tracking purposes.

Comment 11 Kirsten Garrison 2020-09-23 21:15:44 UTC

Thanks for the kubeconfig!

The MCO looks like it's operating as expected but the daemons can't reach the nodes. I see in all of the daemon logs on the workers an unexpected error:

E0923 21:05:21.644735    2124 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I0923 21:06:10.673012    2124 trace.go:205] Trace[16495265]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (23-Sep-2020 21:05:40.672) (total time: 30000ms):
Trace[16495265]: [30.000763329s] [30.000763329s] END
E0923 21:06:10.673089    2124 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout


Which is preventing the MCO from finishing. Looking in the logs this is OVN as well.

So I'd confirm your idea from:https://bugzilla.redhat.com/show_bug.cgi?id=1881660#c3 

I'm going to pass it to that team and let them verify/dupe this BZ to https://bugzilla.redhat.com/show_bug.cgi?id=1880974

Comment 12 Anurag saxena 2020-09-24 19:51:59 UTC

Thanks Kirsten for initial investigation.
This looks good on today's build 4.6.0-0.nightly-2020-09-24-111253 after https://bugzilla.redhat.com/show_bug.cgi?id=1880974 is merged

Comment 13 Kirsten Garrison 2020-09-24 21:28:08 UTC

Good to hear @Anurag - thanks for the update! Feel free to close this as a dupe.

Comment 14 Anurag saxena 2020-09-24 21:42:07 UTC


*** This bug has been marked as a duplicate of bug 1880974 ***