Bug 1776506

Summary: [wsu] rerun ansible playbook fail when no Pending csr exist
Product: OpenShift Container Platform Reporter: gaoshang <sgao>
Component: Windows ContainersAssignee: Sebastian Soto <ssoto>
Status: CLOSED ERRATA QA Contact: gaoshang <sgao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, gmarkley, rgudimet
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 11:17:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description gaoshang 2019-11-25 20:18:43 UTC
Description of problem:
Rerunning wsu ansible playbook fail in TASK [Check for bootstrap CSR] when no Pending csr exist, there are 2 situations:
1, csr is all in approved status, ansible shell "oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}'" will always be "".

# oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-9csdn   50s     system:node:winworker-ay3n2                                                 Approved,Issued
csr-hhkpj   3m31s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued

# oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}'
#

# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml

...
TASK [Check for bootstrap CSR] ***********
FAILED - RETRYING: Check for bootstrap CSR (2 retries left).
FAILED - RETRYING: Check for bootstrap CSR (1 retries left).
fatal: [40.69.171.210 -> localhost]: FAILED! => {"attempts": 2, "changed": true, "cmd": "oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}'", "delta": "0:00:01.098141", "end": "2019-11-26 02:51:59.182270", "rc": 0, "start": "2019-11-26 02:51:58.084129", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP ***********
40.69.171.210              : ok=2    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

2, Cluster will clean the approved csr after a while (maybe hours), then ansible shell will always be "" too.

# oc get csr
No resources found.

# oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}' 
No resources found.
#

# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml

...
TASK [Check for bootstrap CSR] ***********
FAILED - RETRYING: Check for bootstrap CSR (2 retries left).
FAILED - RETRYING: Check for bootstrap CSR (1 retries left).
fatal: [40.69.171.210 -> localhost]: FAILED! => {"attempts": 2, "changed": true, "cmd": "oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}'", "delta": "0:00:06.170185", "end": "2019-11-26 02:15:27.416078", "rc": 0, "start": "2019-11-26 02:15:21.245893", "stderr": "No resources found.", "stderr_lines": ["No resources found."], "stdout": "", "stdout_lines": []}

PLAY RECAP ***********
40.69.171.210              : ok=6    changed=4    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
localhost                  : ok=7    changed=6    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0


Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-24-183610   True        False         5m35s   Cluster version is 4.3.0-0.nightly-2019-11-24-183610
windows-machine-config-operator commit:
# git show
commit 1eb1f983774101b5077828fd2efb4dfb711d5886

How reproducible:
Always

Steps to Reproduce:
1. Install OCP 4.3 cluster with ovn-kubernetes
2. Edit ovn-kubernetes as following
# oc edit Network.operator.openshift.io cluster
# oc get Network.operator.openshift.io cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2019-11-25T13:02:46Z"
  generation: 2
  name: cluster
  resourceVersion: "21021"
  selfLink: /apis/operator.openshift.io/v1/networks/cluster
  uid: c0315a6b-41fa-446d-971f-70c846607467
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      hybridOverlayConfig:
        hybridClusterNetwork:
        - cidr: 10.132.0.0/14
          hostPrefix: 23
    type: OVNKubernetes
  logLevel: ""
  serviceNetwork:
  - 172.30.0.0/16
status: {}
3. Create windows instance with wni
# ./wni azure create --kubeconfig ~/window_container/azure/cluster/kubeconfig --credentials ~/.azure/osServicePrincipal.json --image-id MicrosoftWindowsServer:WindowsServer:2019-Datacenter-with-Containers:latest --instance-type Standard_D2s_v3
4. Run wsu ansible the first time
# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml
...
PLAY RECAP ******
40.69.171.210              : ok=11   changed=8    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
localhost                  : ok=7    changed=6    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
5. Rerun wsu ansible again

Actual results:
Rerun ansible playbook fail when no Pending csr exist

Expected results:
Rerun ansible playbook should not block following tasks even though no Pending csr exist

Additional info:

Comment 3 gaoshang 2020-01-13 08:59:03 UTC
This bug has been verified and passed on OCP 4.4.0-0.nightly-2020-01-12-032939, thanks.

Version:
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-01-12-032939   True        False         77m     Cluster version is 4.4.0-0.nightly-2020-01-12-032939
WMCO repo:
# git show
commit 389ae941d6113d8a741719a5fb559e5deca0a506

Steps:
1, Run WSU against a Windows node attached to a 4.4 cluster
# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml -v
2, Check ansible run succeed and windows node is added to cluster
# oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
...
ip-10-0-31-23.us-east-2.compute.internal     Ready    worker   129m    v1.16.2
3, Run WSU again, check ansible run succeed and windows node is not affected
4, Delete windows node and run ansible again, check it can be added back
# oc delete node ip-10-0-31-23.us-east-2.compute.internal
node "ip-10-0-31-23.us-east-2.compute.internal" deleted
# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml -v
...
# oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
...
ip-10-0-31-23.us-east-2.compute.internal     Ready    worker   5m42s   v1.16.2
5, Check windows work load
# oc create -f WinWebServer116.yaml
deployment.apps/win-webserver created
...
# oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE    IP           NODE                                       NOMINATED NODE   READINESS GATES
win-webserver-79b64df8b9-p2txh   1/1     Running   0          2m8s   10.132.2.2   ip-10-0-31-23.us-east-2.compute.internal   <none>           <none>

Comment 7 errata-xmlrpc 2020-05-04 11:17:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581