Bug 1793012 - Node NotReady after applying two machine configs to the same pool
Summary: Node NotReady after applying two machine configs to the same pool
Keywords:
Status: CLOSED DUPLICATE of bug 1792749
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.4
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: ---
Assignee: Antonio Murdaca
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-20 13:51 UTC by Federico Paolinelli
Modified: 2020-01-20 17:29 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-20 14:42:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Federico Paolinelli 2020-01-20 13:51:17 UTC
Description of problem:
If I apply a machine config to the worker pool, wait for it to be ready, and apply a second one, the node does not get back to the ready state.

Version-Release number of selected component (if applicable):

oc version
Client Version: v4.2.0
Server Version: 4.4.0-0.nightly-2020-01-20-081123
Kubernetes Version: v1.17.0


How reproducible:
Really often if not always

Steps to Reproduce:
1. Apply any mc like:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: samplemc
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
        - contents:
            source: data:text/plain;charset=utf-8,hello
          filesystem: root
          mode: 420
          path: /tmp/foo.txt

and wait for the MachineConfigPool to get back to the Updated state

```
oc get mcp                       
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-a377cf240a65d7a8fa30619f6c37fb90   True      False      False      1              1                   1                     0                      47m
worker   rendered-worker-f866ee1180b333e57dde24e794b876d5   True      False      False      1              1                   1                     0                      47m
```



2. Apply another one:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: foo
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
        - contents:
            source: data:text/plain;charset=utf-8,hello
          filesystem: root
          mode: 420
          path: /tmp/foo1.txt


Actual results:
The first node rebooted never get Ready and the MCP is stuck in Updating

Expected results:
All the nodes get back to Ready and the MachineConfigPool is Updated.


Additional info:

The node is not ready because of:

  Ready            False   Mon, 20 Jan 2020 14:30:08 +0100   Mon, 20 Jan 2020 13:34:56 +0100   KubeletNotReady              runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network

The sdn pod on that node is not able to start because it's not finding the ovs socket:

```
oc -n openshift-sdn get pod sdn-gxlgp -o yaml
    ...
    lastState:
      terminate d:
        containerID: cri-o://0998aef659e81d9ae0a641fe5ada78c564a1048a2d1b1965a4082f769c170d3f
        exitCode: 255
        finishedAt: "2020-01-20T13:11:10Z"
        message: |
          21186 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
          I0120 13:11:00.076144   21186 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
          I0120 13:11:01.075847   21186 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
          [....]
          F0120 13:11:10.076349   21186 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition
```

This is because the ovs pod is not starting (and I think this is the root cause).

oc get pods -n openshift-sdn ovs-cjrzd 
NAME        READY   STATUS   RESTARTS   AGE
ovs-cjrzd   0/1     Error    1          100m


This is because of 

Warning  FailedCreatePodSandBox  51m (x3 over 51m)      kubelet, test1-5f65q-worker-0-sq48t  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_ovs-cjrzd_openshift-sdn_046bb2b4-5ac7-48b4-9348-f79de18866d1_1 for id eab93df02f5d960d3e182fc719660cf30b736d06d7b0b6803bd07a887d250e4b: name is reserved
  Normal   SandboxChanged          2m47s (x235 over 52m)  kubelet, test1-5f65q-worker-0-sq48t  Pod sandbox changed, it will be killed and re-created.

Comment 2 Antonio Murdaca 2020-01-20 14:28:08 UTC
is this this one https://bugzilla.redhat.com/show_bug.cgi?id=1792749?

Comment 3 Antonio Murdaca 2020-01-20 14:42:28 UTC

*** This bug has been marked as a duplicate of bug 1792749 ***


Note You need to log in before you can comment on or make changes to this bug.