Bug 2005127 - BYOH Windows instance configured with DNS name got deconfigured immediately on UPI baremetal
Summary: BYOH Windows instance configured with DNS name got deconfigured immediately o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.z
Assignee: Mohammad Saif Shaikh
QA Contact: gaoshang
URL:
Whiteboard:
Depends On: 2005126
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-16 20:28 UTC by OpenShift BugZilla Robot
Modified: 2021-09-21 11:11 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-21 11:11:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift windows-machine-config-operator pull 679 0 None Merged [release-4.8] Bug 2005127: Validate instance addresses before deconfiguring BYOH nodes 2021-09-17 17:23:00 UTC
Red Hat Product Errata RHBA-2021:3215 0 None None None 2021-09-21 11:11:52 UTC

Description OpenShift BugZilla Robot 2021-09-16 20:28:22 UTC
+++ This bug was initially created as a clone of Bug #2001547 +++

Description of problem:
After configured a BYOH Windows instance with DNS name in config map, it got deconfigured immediately, this happens on UPI cluster on baremetal, BYOH Windows instance configured with IP address have not this issue.

Following is the DNS name and config map used:

PS C:\Users\Administrator> nslookup 10.0.55.187
Server:  ip-10-0-0-2.us-east-2.compute.internal
Address:  10.0.0.2

Name:    ip-10-0-55-187.us-east-2.compute.internal
Address:  10.0.55.187

# cat configmap_byoh.yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: windows-instances
  namespace: openshift-windows-machine-config-operator
data:
  ip-10-0-55-187.us-east-2.compute.internal: |-
    username=Administrator

# oc get node -l kubernetes.io/os=windows
NAME        STATUS                     ROLES    AGE     VERSION
sgao-win1   Ready,SchedulingDisabled   worker   8m15s   v1.22.1-1660+bbcc9aea9e4bef

# oc describe node sgao-win1
Name:               sgao-win1
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=windows
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=sgao-win1
                    kubernetes.io/os=windows
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/windows-build=10.0.17763
                    node.openshift.io/os_id=Windows
                    windowsmachineconfig.openshift.io/byoh=true
Annotations:        k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac: 00-15-5D-C9-F5-8A
                    k8s.ovn.org/hybrid-overlay-node-subnet: 10.132.10.0/24
                    volumes.kubernetes.io/controller-managed-attach-detach: true
                    windowsmachineconfig.openshift.io/pub-key-hash: 1df2c166b1c401180523270e9cf6bc2cd2724b9279ea65668a3b95298525a0f5
                    windowsmachineconfig.openshift.io/username:
                      -----BEGIN ENCRYPTED DATA-----<wmcoMarker><wmcoMarker>wx4EBwMIGyHM95CxsERgtbij7q4k3mYrEsaFVNoTO8jS5gF07WsxBH7z0Xp/aegs<wmcoMarker>VVx3CEY4...
                    windowsmachineconfig.openshift.io/version: 3.1.0+d5fd8c8
CreationTimestamp:  Mon, 06 Sep 2021 06:36:54 -0400
Taints:             node.kubernetes.io/unschedulable:NoSchedule
                    os=Windows:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  sgao-win1
  AcquireTime:     <unset>
  RenewTime:       Mon, 06 Sep 2021 06:44:08 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 06 Sep 2021 06:41:56 -0400   Mon, 06 Sep 2021 06:36:54 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 06 Sep 2021 06:41:56 -0400   Mon, 06 Sep 2021 06:36:54 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 06 Sep 2021 06:41:56 -0400   Mon, 06 Sep 2021 06:36:54 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 06 Sep 2021 06:41:56 -0400   Mon, 06 Sep 2021 06:37:04 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.0.55.187
  Hostname:    sgao-win1
Capacity:
  cpu:                2
  ephemeral-storage:  31455228Ki
  memory:             8125980Ki
  pods:               250
Allocatable:
  cpu:                1500m
  ephemeral-storage:  27915396253
  memory:             6975004Ki
  pods:               250
System Info:
  Machine ID:                 sgao-win1
  System UUID:                EC277FBA-77CD-9B78-69E1-578CDC479EFA
  Boot ID:                    
  Kernel Version:             10.0.17763.2061
  OS Image:                   Windows Server 2019 Datacenter
  Operating System:           windows
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.6
  Kubelet Version:            v1.22.1-1660+bbcc9aea9e4bef
  Kube-Proxy Version:         v1.22.1-1660+bbcc9aea9e4bef
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
Events:
  Type     Reason                    Age                     From     Message
  ----     ------                    ----                    ----     -------
  Normal   NodeNotSchedulable        164m (x2 over 5h18m)    kubelet  Node sgao-win1 status is now: NodeNotSchedulable
  Normal   Starting                  139m                    kubelet  Starting kubelet.
  Normal   NodeHasSufficientMemory   139m (x2 over 139m)     kubelet  Node sgao-win1 status is now: NodeHasSufficientPID
  Warning  CheckLimitsForResolvConf  139m                    kubelet  open c:\k\etc\resolv.conf: The system cannot find the file specified.
  Normal   NodeReady                 139m                    kubelet  Node sgao-win1 status is now: NodeReady
  Normal   NodeNotSchedulable        139m                    kubelet  Node sgao-win1 status is now: NodeNotSchedulable
  Normal   Starting                  136m                    kubelet  Starting kubelet.
  Normal   NodeHasSufficientMemory   136m                    kubelet  Node sgao-win1 status is now: NodeHasSufficientMemory
  Normal   NodeSchedulable           135m                    kubelet  Node sgao-win1 status is now: NodeSchedulable
  Normal   NodeNotSchedulable        135m (x2 over 136m)     kubelet  Node sgao-win1 status is now: NodeNotSchedulable
  Normal   Starting                  134m                    kubelet  Starting kubelet.
  Warning  CheckLimitsForResolvConf  134m                    kubelet  open c:\k\etc\resolv.conf: The system cannot find the file specified.
  Normal   NodeHasSufficientMemory   134m                    kubelet  Node sgao-win1 status is now: NodeHasSufficientMemory
  Normal   NodeNotSchedulable        134m                    kubelet  Node sgao-win1 status is now: NodeNotSchedulable
  Normal   Starting                  131m                    kubelet  Starting kubelet.
  Normal   NodeHasSufficientMemory   131m                    kubelet  Node sgao-win1 status is now: NodeHasSufficientMemory
  Normal   NodeHasSufficientPID      131m                    kubelet  Node sgao-win1 status is now: NodeHasSufficientPID
  Normal   NodeSchedulable           130m                    kubelet  Node sgao-win1 status is now: NodeSchedulable
  Normal   NodeNotSchedulable        130m (x2 over 131m)     kubelet  Node sgao-win1 status is now: NodeNotSchedulable
  Normal   Starting                  124m                    kubelet  Starting kubelet.
  Warning  CheckLimitsForResolvConf  124m                    kubelet  open c:\k\etc\resolv.conf: The system cannot find the file specified.
  Normal   NodeHasSufficientMemory   124m (x2 over 124m)     kubelet  Node sgao-win1 status is now: NodeHasSufficientPID
  Normal   NodeReady                 123m                    kubelet  Node sgao-win1 status is now: NodeReady
  Normal   NodeNotSchedulable        123m                    kubelet  Node sgao-win1 status is now: NodeNotSchedulable
  Normal   Starting                  121m                    kubelet  Starting kubelet.

Version-Release number of selected component (if applicable):
OCP version: 4.9.0-0.nightly-2021-09-05-204238
WMCO mater branch commit: d5fd8c8d9b7ed21f4dc5eac1f410e893c305e840

How reproducible:
Always

Steps to Reproduce:
1, Install OCP 4.9 UPI on baremetal
2, Build WMCO locally with latest commit in master branch and install it on cluster
4, Manually install a BYOH Windows instance, configure it with DNS name in configmap

Actual results:
BYOH Windows instance got deconfigured imediately when it's Ready and stuck in "Configure" - "Deconfigure" cycle

Expected results:
BYOH Windows instance should be configured as node and in "Ready" status

Additional info:
Looks like this happens due to DNS name does not exist in node.Status.Addresses, see https://github.com/openshift/windows-machine-config-operator/blob/d5fd8c8d9b7ed21f4dc5eac1f410e893c305e840/controllers/configmap_controller.go#L215

# oc describe node sgao-win1
...
Addresses:
  InternalIP:  10.0.55.187
  Hostname:    sgao-win1
Capacity:

--- Additional comment from mohashai on 2021-09-08 15:45:18 UTC ---

@sgao We are currently unable to reproduce this bug as we do not have access to a baremetal UPI setup (or any platform=none environment). Would it be possible for you to give me access to a QE cluster where this bug was seen?

--- Additional comment from sgao on 2021-09-09 11:58:19 UTC ---

@mohashai Sure, I'll prepare a cluster later today and DM to you, thanks.

--- Additional comment from mohashai on 2021-09-13 20:27:05 UTC ---

Current status:
- QE has seen this across multiple platforms
- I have a fix up, PR is reviewable
  - successfully tested locally on 4.9 vSphere (and other platforms)

--- Additional comment from mohashai on 2021-09-15 16:28:04 UTC ---

Fix is approved and ready to merge, but a change/deprecation in how to reference images to test is holding up our CI.

Comment 2 gaoshang 2021-09-18 07:23:58 UTC
This bug has been verified on OCP 4.8.0-0.nightly-2021-09-17-214908, thanks.

Comment 4 errata-xmlrpc 2021-09-21 11:11:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Windows Container Support for Red Hat OpenShift 3.1.0 product release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3215


Note You need to log in before you can comment on or make changes to this bug.