Description of problem: The e2e tests for MCO (TestMCDeployed) are failing pretty regularly, https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op?buildId= The TestMCDeployed creates three machine configs and checks if the system is able to reconcile to the new MCs. The reconciliation gets stuck on 2nd MC in my observation. The MCD pod logs errors that suggest networking is down: ------------ E0119 15:57:47.358940 2432 reflector.go:123] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0119 15:57:47.359765 2432 trace.go:116] Trace[1070293612]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (started: 2020-01-19 15:57:17.359198877 +0000 UTC m=+2049.631878558) (total time: 30.000545685s): Trace[1070293612]: [30.000545685s] [30.000545685s] END E0119 15:57:47.359784 2432 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0119 15:58:18.359806 2432 trace.go:116] Trace[192921532]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (started: 2020-01-19 15:57:48.359164106 +0000 UTC m=+2080.631843557) (total time: 30.000603851s): Trace[192921532]: [30.000603851s] [30.000603851s] END E0119 15:58:18.359839 2432 reflector.go:123] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0119 15:58:18.361096 2432 trace.go:116] Trace[54103616]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (started: 2020-01-19 15:57:48.360292081 +0000 UTC m=+2080.632971513) (total time: 30.00076369s): Trace[54103616]: [30.00076369s] [30.00076369s] END E0119 15:58:18.361115 2432 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0119 15:58:49.360768 2432 trace.go:116] Trace[842277839]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (started: 2020-01-19 15:58:19.360058085 +0000 UTC m=+2111.632737476) (total time: 30.00063226s): Trace[842277839]: [30.00063226s] [30.00063226s] END E0119 15:58:49.360797 2432 reflector.go:123] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0119 15:58:49.361692 2432 trace.go:116] Trace[1265690540]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (started: 2020-01-19 15:58:19.36122169 +0000 UTC m=+2111.633901218) (total time: 30.000452683s): Trace[1265690540]: [30.000452683s] [30.000452683s] END E0119 15:58:49.361714 2432 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0119 15:59:20.361596 2432 trace.go:116] Trace[1996325786]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (started: 2020-01-19 15:58:50.360994566 +0000 UTC m=+2142.633674215) (total time: 30.000559905s): Trace[1996325786]: [30.000559905s] [30.000559905s] END E0119 15:59:20.361629 2432 reflector.go:123] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0119 15:59:20.362553 2432 trace.go:116] Trace[1228547051]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (started: 2020-01-19 15:58:50.362082103 +0000 UTC m=+2142.634761555) (total time: 30.000452881s): Trace[1228547051]: [30.000452881s] [30.000452881s] END E0119 15:59:20.362577 2432 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout ------- The SDN pods are stuck in error state and are not getting recreated/reconciled because kubelet is in `NotReady,NotSchedulable` state ----- [alpatel@alpatel dev-installer]$ oc get pods -n openshift-sdn NAME READY STATUS RESTARTS AGE ovs-69k2t 1/1 Running 1 120m ovs-fnd6q 1/1 Running 0 129m ovs-g9x9n 0/1 Error 2 120m ovs-hgtps 1/1 Running 2 120m ovs-jp9xl 1/1 Running 0 129m ovs-x7th9 1/1 Running 0 129m sdn-bhcvq 1/1 Running 0 129m sdn-controller-hfpfl 1/1 Running 0 129m sdn-controller-sz2xw 1/1 Running 0 129m sdn-controller-w4658 1/1 Running 0 129m sdn-h2k8w 1/1 Running 1 120m sdn-htd6g 1/1 Running 0 129m sdn-htqzd 0/1 CrashLoopBackOff 12 120m sdn-kbjx8 1/1 Running 2 120m sdn-zbdxp 1/1 Running 0 129m -------- Status of pod sdn-htqzd status: conditions: - lastProbeTime: null lastTransitionTime: "2020-01-19T15:23:08Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2020-01-19T15:22:54Z" message: 'containers with unready status: [sdn]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2020-01-19T15:23:04Z" message: 'containers with unready status: [sdn]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2020-01-19T14:00:03Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://c204d23ec9e717a869a9f94bebdc60beb5d4b889335f919c2f7c4ff8c3fbb6ce image: registry.svc.ci.openshift.org/ocp/4.4-2020-01-16-152257@sha256:f556fedc5e6984f5461407063fa1919b09f5d7d3ac8e3e361a27264b5c4a811c imageID: registry.svc.ci.openshift.org/ocp/4.4-2020-01-16-152257@sha256:f556fedc5e6984f5461407063fa1919b09f5d7d3ac8e3e361a27264b5c4a811c lastState: terminated: containerID: cri-o://c204d23ec9e717a869a9f94bebdc60beb5d4b889335f919c2f7c4ff8c3fbb6ce exitCode: 255 finishedAt: "2020-01-19T16:05:21Z" message: | 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:11.262570 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:12.262529 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:13.262563 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:14.262533 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:15.262567 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:16.262569 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:17.262519 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:18.262703 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:19.262543 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:20.262529 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:21.262560 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0119 16:05:21.262805 23473 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory F0119 16:05:21.262897 23473 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition reason: Error startedAt: "2020-01-19T16:04:21Z" name: sdn ready: false restartCount: 13 started: false state: waiting: message: back-off 5m0s restarting failed container=sdn pod=sdn-htqzd_openshift-sdn(a9124b95-a6aa-4444-af49-37b65c18920c) reason: CrashLoopBackOff hostIP: 10.0.169.160 initContainerStatuses: - containerID: cri-o://dd1247b7c1a597226c210b25794a73c31cb0fe4812b5798775f16d2970820fdd image: registry.svc.ci.openshift.org/ocp/4.4-2020-01-16-152257@sha256:f8de726661ce92ee52c4de8498a9f2868a4569b7ae62e59442d09ccbb78302b5 imageID: registry.svc.ci.openshift.org/ocp/4.4-2020-01-16-152257@sha256:f8de726661ce92ee52c4de8498a9f2868a4569b7ae62e59442d09ccbb78302b5 lastState: {} name: install-cni-plugins ready: true restartCount: 1 state: terminated: containerID: cri-o://dd1247b7c1a597226c210b25794a73c31cb0fe4812b5798775f16d2970820fdd exitCode: 0 finishedAt: "2020-01-19T15:23:07Z" reason: Completed startedAt: "2020-01-19T15:23:07Z" phase: Running podIP: 10.0.169.160 podIPs: - ip: 10.0.169.160 qosClass: Burstable startTime: "2020-01-19T14:00:03Z" Status of pod status: conditions: - lastProbeTime: null lastTransitionTime: "2020-01-19T14:00:04Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2020-01-19T15:22:54Z" message: 'containers with unready status: [openvswitch]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2020-01-19T15:23:05Z" message: 'containers with unready status: [openvswitch]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2020-01-19T14:00:04Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://83044d22e4b5123ef9e34439b5357344e22608d4210ea30c30683ed2c1ea242b image: registry.svc.ci.openshift.org/ocp/4.4-2020-01-16-152257@sha256:f556fedc5e6984f5461407063fa1919b09f5d7d3ac8e3e361a27264b5c4a811c imageID: registry.svc.ci.openshift.org/ocp/4.4-2020-01-16-152257@sha256:f556fedc5e6984f5461407063fa1919b09f5d7d3ac8e3e361a27264b5c4a811c lastState: {} name: openvswitch ready: false restartCount: 2 started: false state: terminated: containerID: cri-o://83044d22e4b5123ef9e34439b5357344e22608d4210ea30c30683ed2c1ea242b exitCode: 255 finishedAt: "2020-01-19T15:23:03Z" message: | tion reset by peer) 2020-01-19T15:21:42.833Z|00118|connmgr|INFO|br0<->unix#1285: 2 flow_mods in the last 0 s (2 deletes) 2020-01-19T15:21:42.985Z|00119|connmgr|INFO|br0<->unix#1288: 4 flow_mods in the last 0 s (4 deletes) 2020-01-19T15:21:43.053Z|00120|bridge|INFO|bridge br0: deleted interface veth5f6d5e65 on port 8 2020-01-19T15:21:43.177Z|00121|connmgr|INFO|br0<->unix#1291: 2 flow_mods in the last 0 s (2 deletes) 2020-01-19T15:21:43.301Z|00122|connmgr|INFO|br0<->unix#1295: 4 flow_mods in the last 0 s (4 deletes) 2020-01-19T15:21:43.371Z|00123|bridge|INFO|bridge br0: deleted interface vethf3d84766 on port 7 2020-01-19T15:21:43.447Z|00124|connmgr|INFO|br0<->unix#1298: 2 flow_mods in the last 0 s (2 deletes) 2020-01-19T15:21:43.612Z|00125|connmgr|INFO|br0<->unix#1301: 4 flow_mods in the last 0 s (4 deletes) 2020-01-19T15:21:43.652Z|00126|bridge|INFO|bridge br0: deleted interface veth1d4e8e49 on port 11 2020-01-19T15:21:43.712Z|00127|connmgr|INFO|br0<->unix#1304: 2 flow_mods in the last 0 s (2 deletes) 2020-01-19T15:21:43.764Z|00128|connmgr|INFO|br0<->unix#1307: 4 flow_mods in the last 0 s (4 deletes) 2020-01-19T15:21:43.826Z|00129|bridge|INFO|bridge br0: deleted interface veth9ea99a48 on port 9 2020-01-19T15:21:43.905Z|00130|connmgr|INFO|br0<->unix#1310: 2 flow_mods in the last 0 s (2 deletes) 2020-01-19T15:21:43.961Z|00131|connmgr|INFO|br0<->unix#1313: 4 flow_mods in the last 0 s (4 deletes) 2020-01-19T15:21:44.056Z|00132|bridge|INFO|bridge br0: deleted interface vethaaf9278c on port 5 2020-01-19T15:22:12.291Z|00133|connmgr|INFO|br0<->unix#1335: 2 flow_mods in the last 0 s (2 deletes) 2020-01-19T15:22:12.325Z|00134|connmgr|INFO|br0<->unix#1339: 4 flow_mods in the last 0 s (4 deletes) 2020-01-19T15:22:12.354Z|00135|bridge|INFO|bridge br0: deleted interface vethba87ba79 on port 10 2020-01-19T15:22:12.298Z|00019|jsonrpc|WARN|unix#1239: receive error: Connection reset by peer 2020-01-19T15:22:12.298Z|00020|reconnect|WARN|unix#1239: connection dropped (Connection reset by peer) Exiting ovs-vswitchd (2570). Terminated reason: Error startedAt: "2020-01-19T14:51:49Z" hostIP: 10.0.169.160 phase: Running podIP: 10.0.169.160 podIPs: - ip: 10.0.169.160 qosClass: Burstable startTime: "2020-01-19T14:00:04Z" ------ Node Statuses: $ oc get nodes alpatel: Sun Jan 19 11:12:15 2020 NAME STATUS ROLES AGE VERSION ip-10-0-130-26.ec2.internal Ready worker 131m v1.17.0 ip-10-0-130-92.ec2.internal Ready master 141m v1.17.0 ip-10-0-149-91.ec2.internal Ready master 141m v1.17.0 ip-10-0-154-163.ec2.internal Ready worker 132m v1.17.0 ip-10-0-162-226.ec2.internal Ready master 141m v1.17.0 ip-10-0-169-160.ec2.internal NotReady,SchedulingDisabled worker 132m v1.17.0 ------ $ oc describe node ip-10-0-169-160.ec2.internal | grep reason:NetworkPluginNotReady -C 2 DiskPressure False Sun, 19 Jan 2020 11:08:16 -0500 Sun, 19 Jan 2020 10:23:04 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Sun, 19 Jan 2020 11:08:16 -0500 Sun, 19 Jan 2020 10:23:04 -0500 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Sun, 19 Jan 2020 11:08:16 -0500 Sun, 19 Jan 2020 10:23:04 -0500 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network Addresses: InternalIP: 10.0.169.160 Version-Release number of selected component (if applicable): 4.4 How reproducible: fairly reproducible Steps to Reproduce: 1. Create a 4.4 cluster with the recent build 2. Run `make test-e2e` 3. Watch on sdn pods, nodes and mco/mcd pods. Additional info: The kubelet is manually restarted by running `systemctl restart kubelet.service` the system will get temporarily fixed until it hits a similar issue on another worker node.
Is this something for the SDN team? the only change related to network that I can find that went in recently is https://github.com/openshift/machine-config-operator/commit/b0637fc8a51618842aac832b7363ce219ac731c2 (cleanup sdn ips on reboot) cc'ing Ryan on that and moving to SDN
*** Bug 1793012 has been marked as a duplicate of this bug. ***
The RHCOS/CRI-O rollback brought the CI in MCO back to normal - there’s at least one PR in crio meant to fix this bug, I’m moving this to Node then.
Might be a dup of bug 1794493. We'll see if this clears up once [1] lands (and we revert the RHCOS rollback?). [1]: https://github.com/openshift/machine-config-operator/pull/1405
Sorry to bother. Just to understand what to track: is this going to be fixed by the RHCOS rollback, by the CRI-O fix [1] or via https://bugzilla.redhat.com/show_bug.cgi?id=1794493 ? [1]: https://github.com/cri-o/cri-o/pull/3138
(In reply to Federico Paolinelli from comment #11) > Sorry to bother. > Just to understand what to track: is this going to be fixed by the RHCOS > rollback, by the CRI-O fix [1] or via > https://bugzilla.redhat.com/show_bug.cgi?id=1794493 ? > > > [1]: https://github.com/cri-o/cri-o/pull/3138 This has been fixed temporarily in the CI by rolling back CRI-O to v1.16.x The CRI-O PR aims at fixing this permanently (it's just a revert tho, I'll let the crio team speak for it). The BZ you linked (https://bugzilla.redhat.com/show_bug.cgi?id=1794493) has nothing to do with this also - that's another BZ being tracked there (which targets 4.4)
Hi Alay, can you tell me how to get the repo in your reproduce step : Run `make test-e2e` ?
confirm cri-o 1.17 with CoreOS 44.81.202003092122-0, verified! $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-03-10-002851 True False 26h Cluster version is 4.4.0-0.nightly-2020-03-10-002851 $oc get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-56-239.us-east-2.compute.internal Ready master 27h v1.17.1 10.0.56.239 <none> Red Hat Enterprise Linux CoreOS 44.81.202003092122-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 ip-10-0-60-181.us-east-2.compute.internal Ready worker 24h v1.17.1 10.0.60.181 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-1062.12.1.el7.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el7 ip-10-0-62-127.us-east-2.compute.internal Ready master 27h v1.17.1 10.0.62.127 <none> Red Hat Enterprise Linux CoreOS 44.81.202003092122-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 ip-10-0-62-200.us-east-2.compute.internal Ready worker 24h v1.17.1 10.0.62.200 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-1062.12.1.el7.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el7 ip-10-0-63-25.us-east-2.compute.internal Ready worker 26h v1.17.1 10.0.63.25 <none> Red Hat Enterprise Linux CoreOS 44.81.202003092122-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 ip-10-0-63-67.us-east-2.compute.internal Ready worker 24h v1.17.1 10.0.63.67 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-1062.12.1.el7.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el7 ip-10-0-63-75.us-east-2.compute.internal Ready worker 26h v1.17.1 10.0.63.75 <none> Red Hat Enterprise Linux CoreOS 44.81.202003092122-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 ip-10-0-68-250.us-east-2.compute.internal Ready master 27h v1.17.1 10.0.68.250 <none> Red Hat Enterprise Linux CoreOS 44.81.202003092122-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 ip-10-0-69-52.us-east-2.compute.internal Ready worker 26h v1.17.1 10.0.69.52 <none> Red Hat Enterprise Linux CoreOS 44.81.202003092122-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 $ oc debug node/ip-10-0-56-239.us-east-2.compute.internal sh-4.4# rpm -qa | grep cri-o cri-o-1.17.0-8.dev.rhaos4.4.git36920a5.el8.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days