1729510 – MCD does not wait for nodes to drain

Bug 1729510 - MCD does not wait for nodes to drain

Summary: MCD does not wait for nodes to drain

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1744948 (view as bug list)
Depends On:	1729512 1737379 1743846
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-12 12:49 UTC by Antonio Murdaca
Modified:	2019-09-10 15:59 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1729243
Clones:	1743846 (view as bug list)
Environment:
Last Closed:	2019-09-10 15:59:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1071	0	None	closed	Bug 1729510: [release-4.1] backports for drain fixes	2020-04-14 14:26:01 UTC
Red Hat Product Errata	RHSA-2019:2594	0	None	None	None	2019-09-10 15:59:38 UTC

Description Antonio Murdaca 2019-07-12 12:49:59 UTC

The MCO is also affected by this bug as we use the library to drain nodes.

+++ This bug was initially created as a clone of Bug #1729243 +++

Description of problem:
When deleting a machine object, the machine-controller first attempts to cordon and drain the node.  Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain.  This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.  Delete worker machine object on 4.1 cluster on IPI.

Actual results:
Drain reports complete before pods are actually evicted or deleted.

Expected results:
We should wait to ensure services aren't interrupted.

Additional info:
Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875

Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234

--- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

Comment 1 Antonio Murdaca 2019-08-23 10:40:38 UTC

*** Bug 1744948 has been marked as a duplicate of this bug. ***

Comment 4 Micah Abbott 2019-09-03 14:39:03 UTC

Discussed with Antonio about how to reproduce this for the MCO side of things.  Any new MachineConfig will trigger a drain of the node as the new MC is applied, so steps look like:

1.  Create cluster
2.  Create MachineConfig for worker
3.  `oc logs -f <worker MCD pod>`
4.  Watch for messages about draining the node

```
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-08-30-035445   True        False         47m     Cluster version is 4.1.0-0.nightly-2019-08-30-035445

### Figure out which nodes are workers
$ oc get node                 
NAME                                         STATUS   ROLES    AGE   VERSION 
ip-10-0-140-86.us-west-2.compute.internal    Ready    master   62m   v1.13.4+3bd346709                                                                                                                                                                                                      
ip-10-0-143-236.us-west-2.compute.internal   Ready    worker   55m   v1.13.4+3bd346709                                                       
ip-10-0-155-84.us-west-2.compute.internal    Ready    worker   55m   v1.13.4+3bd346709                                 
ip-10-0-157-48.us-west-2.compute.internal    Ready    master   62m   v1.13.4+3bd346709             
ip-10-0-166-99.us-west-2.compute.internal    Ready    worker   54m   v1.13.4+3bd346709                                                                                                                                                                                                      
ip-10-0-168-59.us-west-2.compute.internal    Ready    master   63m   v1.13.4+3bd346709

### Find a MCD pod on a worker node
$ oc -n openshift-machine-config-operator describe po machine-config-daemon-4njpl | grep Node:
Node:               ip-10-0-155-84.us-west-2.compute.internal/10.0.155.84                                     

### Create a MachineConfig
$ cat ../machineConfigs/chrony.yaml 
apiVersion: machineconfiguration.openshift.io/v1                                                                                              
kind: MachineConfig                                                                                                                           
metadata:                                                                                                                                     
  labels:                                                                                                                                     
    machineconfiguration.openshift.io/role: worker                                                                                            
  name: 60-examplecorp-chrony                                                                                                                 
spec:                                                                                                                                         
  config:                                                                                                                                                                                                                                                                                   
    ignition:                                                   
      version: 2.2.0                                                                                                                          
    storage:                                                                                                                                  
      files:                                                                                                                                  
      - contents:                                                                                                                             
          source: data:text/plain;charset=utf-8;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK                                                     
        filesystem: root                                                                                                                      
        mode: 0644                                                                                                                            
        path: /etc/chrony.conf                                                                                                                

### Apply the MachineConfig
$ oc apply -f ../machineConfigs/chrony.yaml 
machineconfig.machineconfiguration.openshift.io/60-examplecorp-chrony created

### Watch the MCD pod for drain messages
$ oc -n openshift-machine-config-operator logs -f pod/machine-config-daemon-4njpl
I0903 13:30:57.790426    4480 start.go:67] Version: 4.1.14-201908291507-dirty (e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab)
I0903 13:30:57.791293    4480 start.go:100] Starting node writer                                                                              
I0903 13:30:57.796310    4480 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0903 13:30:58.038685    4480 daemon.go:200] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b68f108c63ae9f7245621be001cda06d0d909565dc63624530144570c11dba3 (410.8.20190830.0)
I0903 13:30:58.044703    4480 start.go:196] Calling chroot("/rootfs")                                                                         
I0903 13:30:58.044739    4480 start.go:206] Starting MachineConfigDaemon                               
I0903 13:30:58.044811    4480 update.go:847] Starting to manage node: ip-10-0-155-84.us-west-2.compute.internal
I0903 13:30:58.074577    4480 run.go:22] Running captured: rpm-ostree status                        
I0903 13:30:58.473083    4480 daemon.go:740] State: idle   
AutomaticUpdates: disabled                                                                                                                    
Deployments:                                                                                                                                  
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b68f108c63ae9f7245621be001cda06d0d909565dc63624530144570c11dba3
              CustomOrigin: Managed by pivot tool                                                                                             
                   Version: 410.8.20190830.0 (2019-08-30T02:31:46Z)                                                                           

  pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/ootpa@sha256:683a6a866a8ec789fedb5da63b6a2ff68c1b0788ec90e7def778f0c4c13197a4
              CustomOrigin: Provisioned from oscontainer
                   Version: 410.8.20190520.0 (2019-05-20T20:10:04Z)
I0903 13:30:58.473113    4480 run.go:22] Running captured: journalctl --list-boots
I0903 13:30:58.526842    4480 daemon.go:747] journalctl --list-boots:
-1 d7495f00adbb4f6e836df08416314f18 Tue 2019-09-03 13:25:34 UTC—Tue 2019-09-03 13:28:33 UTC
 0 7aa2913ce3b943ad8dfbca79db301fac Tue 2019-09-03 13:28:46 UTC—Tue 2019-09-03 13:30:58 UTC
I0903 13:30:58.526939    4480 daemon.go:494] Enabling Kubelet Healthz Monitor
I0903 13:30:59.451443    4480 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node ip-10-0-155-84.us-west-2.compute.internal: map[machine.openshift.io/machine:openshift-machine-api/miabbott-4-1-14-5t67r-worker-us-west-2b-vcpjd volumes.kubernetes.io/contro
ller-managed-attach-detach:true], in cluster bootstrap, loading initial node annotation from /etc/machine-config-daemon/node-annotations.json
I0903 13:30:59.457046    4480 node.go:45] Setting initial node config: rendered-worker-ed526cce1edf9d8c4d643d17760ccd25
I0903 13:30:59.501338    4480 update.go:734] logger doesn't support --jounald, grepping the journal
I0903 13:30:59.542539    4480 update.go:847] error loading pending config open /etc/machine-config-daemon/state.json: no such file or directory
I0903 13:30:59.554603    4480 daemon.go:667] In bootstrap mode
I0903 13:30:59.554627    4480 daemon.go:695] Current+desired config: rendered-worker-ed526cce1edf9d8c4d643d17760ccd25
I0903 13:30:59.561204    4480 daemon.go:854] No bootstrap pivot required; unlinking bootstrap node annotations
I0903 13:30:59.561255    4480 update.go:847] Validating against current config rendered-worker-ed526cce1edf9d8c4d643d17760ccd25
I0903 13:30:59.563831    4480 daemon.go:898] Validating against current config rendered-worker-ed526cce1edf9d8c4d643d17760ccd25
I0903 13:30:59.576253    4480 daemon.go:904] Validated on-disk state
I0903 13:30:59.585952    4480 daemon.go:951] In desired config rendered-worker-ed526cce1edf9d8c4d643d17760ccd25
E0903 13:36:10.918151    4480 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=9, ErrCode=NO_ERROR, debug=""
E0903 13:36:10.918151    4480 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=9, ErrCode=NO_ERROR, debug=""
W0903 13:36:11.141549    4480 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 12587 (14989)

I0903 14:26:18.060959    4480 update.go:185] Checking reconcilable for config rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 to rendered-worker-24ffb68b002cb7511301664cb41c0a68
I0903 14:26:18.062693    4480 update.go:847] Starting update from rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 to rendered-worker-24ffb68b002cb7511301664cb41c0a68
I0903 14:26:18.065515    4480 update.go:399] Updating files
I0903 14:26:18.065528    4480 update.go:601] Writing file "/etc/tmpfiles.d/cleanup-cni.conf"
I0903 14:26:18.067103    4480 update.go:601] Writing file "/etc/systemd/system.conf.d/kubelet-cgroups.conf"
I0903 14:26:18.068593    4480 update.go:601] Writing file "/var/lib/kubelet/config.json"
I0903 14:26:18.070069    4480 update.go:601] Writing file "/etc/kubernetes/ca.crt"
I0903 14:26:18.071754    4480 update.go:601] Writing file "/etc/sysctl.d/forward.conf"
I0903 14:26:18.073279    4480 update.go:601] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0903 14:26:18.074157    4480 update.go:601] Writing file "/etc/containers/registries.conf"
I0903 14:26:18.075597    4480 update.go:601] Writing file "/etc/containers/storage.conf"
I0903 14:26:18.077298    4480 update.go:601] Writing file "/etc/crio/crio.conf"
I0903 14:26:18.079188    4480 update.go:601] Writing file "/etc/kubernetes/cloud.conf"
I0903 14:26:18.080075    4480 update.go:601] Writing file "/etc/kubernetes/kubelet-ca.crt"
I0903 14:26:18.081642    4480 update.go:601] Writing file "/etc/kubernetes/kubelet.conf"
I0903 14:26:18.083098    4480 update.go:601] Writing file "/etc/chrony.conf"
I0903 14:26:18.084600    4480 update.go:556] Writing systemd unit "kubelet.service"
I0903 14:26:18.086184    4480 update.go:573] Enabling systemd unit "kubelet.service"
I0903 14:26:18.086275    4480 update.go:493] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink
I0903 14:26:18.086287    4480 update.go:418] Deleting stale data
I0903 14:26:18.086306    4480 update.go:663] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0903 14:26:18.089973    4480 update.go:812] logger doesn't support --jounald, logging json directly
I0903 14:26:18.093733    4480 update.go:847] Update prepared; beginning drain
I0903 14:26:18.101639    4480 update.go:89] cordoned node "ip-10-0-155-84.us-west-2.compute.internal"
I0903 14:26:18.144107    4480 update.go:93] ignoring DaemonSet-managed pods: tuned-jvg2z, dns-default-268kq, node-ca-rwrml, machine-config-daemon-4njpl, node-exporter-ldmv7, multus-tpwj4, ovs-4c5sg, sdn-4lml2; deleting pods with local storage: alertmanager-main-2, kube-state-metrics-
64c4bdf979-m6jsj, prometheus-adapter-7f77c6997f-dbbnz, prometheus-adapter-7f77c6997f-ptlrr
I0903 14:26:22.056307    4480 update.go:89] pod "router-default-5cf946464-l6st8" removed (evicted)
I0903 14:26:22.856358    4480 update.go:89] pod "certified-operators-c4ff7fc9b-8kclm" removed (evicted)
I0903 14:26:24.256264    4480 update.go:89] pod "prometheus-adapter-7f77c6997f-ptlrr" removed (evicted)
I0903 14:26:24.656182    4480 update.go:89] pod "redhat-operators-7699569b86-fnvtm" removed (evicted)
I0903 14:26:26.456437    4480 update.go:89] pod "image-registry-f7cd488bf-xmth9" removed (evicted)
I0903 14:26:26.656932    4480 update.go:89] pod "alertmanager-main-2" removed (evicted)
I0903 14:26:27.456095    4480 update.go:89] pod "telemeter-client-b65c57ff7-kh6h9" removed (evicted)
I0903 14:26:27.856365    4480 update.go:89] pod "community-operators-6f95cf67c6-fh8l6" removed (evicted)
I0903 14:26:28.056288    4480 update.go:89] pod "prometheus-adapter-7f77c6997f-dbbnz" removed (evicted)
I0903 14:26:29.411252    4480 update.go:89] pod "kube-state-metrics-64c4bdf979-m6jsj" removed (evicted)
I0903 14:26:29.411283    4480 update.go:89] drained node "ip-10-0-155-84.us-west-2.compute.internal"
I0903 14:26:29.411291    4480 update.go:847] drain complete
I0903 14:26:29.413821    4480 update.go:847] initiating reboot: Node will reboot into config rendered-worker-24ffb68b002cb7511301664cb41c0a68
I0903 14:26:29.471061    4480 start.go:215] Shutting down MachineConfigDaemon

### Verify MachineConfigPool is updated and all nodes are healthy
$ oc get machineconfig
NAME                                                        GENERATEDBYCONTROLLER                      IGNITIONVERSION   CREATED
00-master                                                   e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m
00-worker                                                   e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m
01-master-container-runtime                                 e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m
01-master-kubelet                                           e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m
01-worker-container-runtime                                 e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m
01-worker-kubelet                                           e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m
60-examplecorp-chrony                                                                                  2.2.0             12m
99-master-f1c24a07-ce4d-11e9-99b1-02f047b21df0-registries   e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m
99-master-ssh                                                                                          2.2.0             75m
99-worker-f1c3fa01-ce4d-11e9-99b1-02f047b21df0-registries   e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m
99-worker-ssh                                                                                          2.2.0             75m
rendered-master-8b9815ae15d192479af438faa89e5271            e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m
rendered-worker-24ffb68b002cb7511301664cb41c0a68            e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             12m
rendered-worker-ed526cce1edf9d8c4d643d17760ccd25            e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             74m

$ oc describe machineconfigpool/worker
Name:         worker
Namespace:                                                 
Labels:       <none>
Annotations:  <none>                                              
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool        
Metadata:                                                         
  Creation Timestamp:  2019-09-03T13:22:57Z
  Generation:          3                                 
  Resource Version:    33272                                      
  Self Link:           /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  UID:                 f1c3fa01-ce4d-11e9-99b1-02f047b21df0
Spec:                                                             
  Configuration:                           
    Name:  rendered-worker-24ffb68b002cb7511301664cb41c0a68
    Source:                                                       
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig                                                                                                             
      Name:         00-worker                                     
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig          
      Name:         01-worker-container-runtime
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-worker-kubelet
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         60-examplecorp-chrony
      API Version:  machineconfiguration.openshift.io/v1                                                                                      
      Kind:         MachineConfig
      Name:         99-worker-f1c3fa01-ce4d-11e9-99b1-02f047b21df0-registries
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-worker-ssh
  Machine Config Selector:
    Match Labels:
      machineconfiguration.openshift.io/role:  worker
  Max Unavailable:                             <nil>
  Node Selector:
    Match Labels:
      node-role.kubernetes.io/worker:  
  Paused:                              false
Status:
  Conditions:
    Last Transition Time:  2019-09-03T13:23:31Z
    Message:               
    Reason:                
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2019-09-03T13:23:36Z
    Message:               
    Reason:                
    Status:                False
    Type:                  NodeDegraded
    Last Transition Time:  2019-09-03T13:23:36Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-09-03T14:35:45Z
    Message:               
    Reason:                All nodes are updated with rendered-worker-24ffb68b002cb7511301664cb41c0a68
    Status:                True
    Type:                  Updated
    Last Transition Time:  2019-09-03T14:35:45Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updating
  Configuration:
    Name:  rendered-worker-24ffb68b002cb7511301664cb41c0a68
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   60-examplecorp-chrony
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-f1c3fa01-ce4d-11e9-99b1-02f047b21df0-registries
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-ssh
  Degraded Machine Count:     0
  Machine Count:              3
  Observed Generation:        3
  Ready Machine Count:        3
  Unavailable Machine Count:  0
  Updated Machine Count:      3
Events:                       <none>

$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-140-86.us-west-2.compute.internal    Ready    master   76m   v1.13.4+3bd346709
ip-10-0-143-236.us-west-2.compute.internal   Ready    worker   69m   v1.13.4+3bd346709
ip-10-0-155-84.us-west-2.compute.internal    Ready    worker   69m   v1.13.4+3bd346709
ip-10-0-157-48.us-west-2.compute.internal    Ready    master   76m   v1.13.4+3bd346709
ip-10-0-166-99.us-west-2.compute.internal    Ready    worker   69m   v1.13.4+3bd346709
ip-10-0-168-59.us-west-2.compute.internal    Ready    master   77m   v1.13.4+3bd346709
```
```

Comment 6 errata-xmlrpc 2019-09-10 15:59:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2594

Note You need to log in before you can comment on or make changes to this bug.