Description of problem: READYMACHINECOUNT is displaying 0 under machineconfigpools causing installations issues on local QE env $ oc get machineconfigpools NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-bf3de50a465da97c0aa74b3de2fa7c1f True False False 3 3 3 0 92m worker rendered-worker-4bf15950e4c24dc17126a09e0d50b0f4 False True False 3 0 0 0 92m oc get clusteroperators machine-config -oyaml and oc get machineconfigpools -oyaml captured here https://privatebin-it-iso.int.open.paas.redhat.com/?82565729db2b9703#FdTB8xKK64KIz8NiO5SzHhR+vx9NKL48qqLgsMCNI2E= Version-Release number of selected component (if applicable):4.6.0-0.nightly-2020-09-22-130743 How reproducible:Intermittent Steps to Reproduce: 1.Install multinode OCP cluster on 4.6 - 3 masters, 3 workers 2. 3. Actual results: READYMACHINECOUNT is not 3 Expected results: READYMACHINECOUNT should be 3 Additional info: $ oc get nodes NAME STATUS ROLES AGE VERSION qe-anuragcp3-62sfz-master-0.c.openshift-qe.internal Ready master 101m v1.19.0+7e8389f qe-anuragcp3-62sfz-master-1.c.openshift-qe.internal Ready master 101m v1.19.0+7e8389f qe-anuragcp3-62sfz-master-2.c.openshift-qe.internal Ready master 101m v1.19.0+7e8389f qe-anuragcp3-62sfz-worker-a-kntcs.c.openshift-qe.internal Ready worker 73m v1.19.0+7e8389f qe-anuragcp3-62sfz-worker-b-c87rv.c.openshift-qe.internal Ready worker 73m v1.19.0+7e8389f qe-anuragcp3-62sfz-worker-c-7f5gt.c.openshift-qe.internal Ready worker 73m v1.19.0+7e8389f [anusaxen@anusaxen verification-tests]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-22-130743 True False False 64m cloud-credential 4.6.0-0.nightly-2020-09-22-130743 True False False 103m cluster-autoscaler 4.6.0-0.nightly-2020-09-22-130743 True False False 92m config-operator 4.6.0-0.nightly-2020-09-22-130743 True False False 99m console 4.6.0-0.nightly-2020-09-22-130743 True False False 69m csi-snapshot-controller 4.6.0-0.nightly-2020-09-22-130743 True False False 76m dns 4.6.0-0.nightly-2020-09-22-130743 True False False 97m etcd 4.6.0-0.nightly-2020-09-22-130743 True False False 97m image-registry 4.6.0-0.nightly-2020-09-22-130743 True False False 72m ingress 4.6.0-0.nightly-2020-09-22-130743 True False False 72m insights 4.6.0-0.nightly-2020-09-22-130743 True False False 92m kube-apiserver 4.6.0-0.nightly-2020-09-22-130743 True False False 95m kube-controller-manager 4.6.0-0.nightly-2020-09-22-130743 True False False 96m kube-scheduler 4.6.0-0.nightly-2020-09-22-130743 True False False 97m kube-storage-version-migrator 4.6.0-0.nightly-2020-09-22-130743 True False False 72m machine-api 4.6.0-0.nightly-2020-09-22-130743 True False False 72m machine-approver 4.6.0-0.nightly-2020-09-22-130743 True False False 94m machine-config 4.6.0-0.nightly-2020-09-22-130743 True False False 93m marketplace 4.6.0-0.nightly-2020-09-22-130743 True False False 76m monitoring 4.6.0-0.nightly-2020-09-22-130743 True False False 66m network 4.6.0-0.nightly-2020-09-22-130743 True False False 99m node-tuning 4.6.0-0.nightly-2020-09-22-130743 True False False 99m openshift-apiserver 4.6.0-0.nightly-2020-09-22-130743 True False False 92m openshift-controller-manager 4.6.0-0.nightly-2020-09-22-130743 True False False 91m openshift-samples 4.6.0-0.nightly-2020-09-22-130743 True False False 92m operator-lifecycle-manager 4.6.0-0.nightly-2020-09-22-130743 True False False 98m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-09-22-130743 True False False 98m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-22-130743 True False False 93m service-ca 4.6.0-0.nightly-2020-09-22-130743 True False False 99m storage 4.6.0-0.nightly-2020-09-22-130743 True False False 99m
Anurag, what's the network pulgin is using for your env, if it's OVN, it should be same issue with https://bugzilla.redhat.com/show_bug.cgi?id=1880974 the root reason is machine-config-daemon pod cannot access 172.30.0.1:443.
Noticed on sdn/ovn both. So seems like irrespective of network plugin but rather machine-config related
Yep, as per multiple runs today seems like its not happening on OpenshiftSDN but on OVNKUbernetes so right Zhanqi, i suspect https://bugzilla.redhat.com/show_bug.cgi?id=1880974 might be the cause. Will let dev evaluate the statement. Thanks
Please provide a must gather from this cluster.
Also the privatebin link above requires a password...
(In reply to Kirsten Garrison from comment #5) > Also the privatebin link above requires a password... ahh because "=" is excluded hyperlinked somehow https://privatebin-it-iso.int.open.paas.redhat.com/?82565729db2b9703#FdTB8xKK64KIz8NiO5SzHhR+vx9NKL48qqLgsMCNI2E=
Also can you please include a full must gather from this cluster as requested above?
(In reply to Kirsten Garrison from comment #7) > Also can you please include a full must gather from this cluster as > requested above? Yes, Kirsten. Thats in my action item list but my last cluster got pruned So repro'ing this on a new cluster and will provide gather details.
Thanks Anurag! Let's leave the need info on the BZ until the must gather is shared for tracking purposes.
Thanks for the kubeconfig! The MCO looks like it's operating as expected but the daemons can't reach the nodes. I see in all of the daemon logs on the workers an unexpected error: E0923 21:05:21.644735 2124 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I0923 21:06:10.673012 2124 trace.go:205] Trace[16495265]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (23-Sep-2020 21:05:40.672) (total time: 30000ms): Trace[16495265]: [30.000763329s] [30.000763329s] END E0923 21:06:10.673089 2124 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout Which is preventing the MCO from finishing. Looking in the logs this is OVN as well. So I'd confirm your idea from:https://bugzilla.redhat.com/show_bug.cgi?id=1881660#c3 I'm going to pass it to that team and let them verify/dupe this BZ to https://bugzilla.redhat.com/show_bug.cgi?id=1880974
Thanks Kirsten for initial investigation. This looks good on today's build 4.6.0-0.nightly-2020-09-24-111253 after https://bugzilla.redhat.com/show_bug.cgi?id=1880974 is merged
Good to hear @Anurag - thanks for the update! Feel free to close this as a dupe.
*** This bug has been marked as a duplicate of bug 1880974 ***