Bug 1793012
| Summary: | Node NotReady after applying two machine configs to the same pool | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Federico Paolinelli <fpaoline> |
| Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
| Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.4 | CC: | fromani, msluiter, scuppett |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-20 14:42:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
is this this one https://bugzilla.redhat.com/show_bug.cgi?id=1792749? *** This bug has been marked as a duplicate of bug 1792749 *** |
Description of problem: If I apply a machine config to the worker pool, wait for it to be ready, and apply a second one, the node does not get back to the ready state. Version-Release number of selected component (if applicable): oc version Client Version: v4.2.0 Server Version: 4.4.0-0.nightly-2020-01-20-081123 Kubernetes Version: v1.17.0 How reproducible: Really often if not always Steps to Reproduce: 1. Apply any mc like: apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: samplemc spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8,hello filesystem: root mode: 420 path: /tmp/foo.txt and wait for the MachineConfigPool to get back to the Updated state ``` oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-a377cf240a65d7a8fa30619f6c37fb90 True False False 1 1 1 0 47m worker rendered-worker-f866ee1180b333e57dde24e794b876d5 True False False 1 1 1 0 47m ``` 2. Apply another one: apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: foo spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8,hello filesystem: root mode: 420 path: /tmp/foo1.txt Actual results: The first node rebooted never get Ready and the MCP is stuck in Updating Expected results: All the nodes get back to Ready and the MachineConfigPool is Updated. Additional info: The node is not ready because of: Ready False Mon, 20 Jan 2020 14:30:08 +0100 Mon, 20 Jan 2020 13:34:56 +0100 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network The sdn pod on that node is not able to start because it's not finding the ovs socket: ``` oc -n openshift-sdn get pod sdn-gxlgp -o yaml ... lastState: terminate d: containerID: cri-o://0998aef659e81d9ae0a641fe5ada78c564a1048a2d1b1965a4082f769c170d3f exitCode: 255 finishedAt: "2020-01-20T13:11:10Z" message: | 21186 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0120 13:11:00.076144 21186 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0120 13:11:01.075847 21186 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory [....] F0120 13:11:10.076349 21186 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition ``` This is because the ovs pod is not starting (and I think this is the root cause). oc get pods -n openshift-sdn ovs-cjrzd NAME READY STATUS RESTARTS AGE ovs-cjrzd 0/1 Error 1 100m This is because of Warning FailedCreatePodSandBox 51m (x3 over 51m) kubelet, test1-5f65q-worker-0-sq48t (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_ovs-cjrzd_openshift-sdn_046bb2b4-5ac7-48b4-9348-f79de18866d1_1 for id eab93df02f5d960d3e182fc719660cf30b736d06d7b0b6803bd07a887d250e4b: name is reserved Normal SandboxChanged 2m47s (x235 over 52m) kubelet, test1-5f65q-worker-0-sq48t Pod sandbox changed, it will be killed and re-created.