Bug 1776506 - [wsu] rerun ansible playbook fail when no Pending csr exist
Summary: [wsu] rerun ansible playbook fail when no Pending csr exist
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.4.0
Assignee: Sebastian Soto
QA Contact: gaoshang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-25 20:18 UTC by gaoshang
Modified: 2020-05-04 11:17 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:17:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift windows-machine-config-bootstrapper pull 124 0 None closed Bug 1776506: Allow WSU to be run multiple times 2020-09-11 17:46:51 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:17:36 UTC

Description gaoshang 2019-11-25 20:18:43 UTC
Description of problem:
Rerunning wsu ansible playbook fail in TASK [Check for bootstrap CSR] when no Pending csr exist, there are 2 situations:
1, csr is all in approved status, ansible shell "oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}'" will always be "".

# oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-9csdn   50s     system:node:winworker-ay3n2                                                 Approved,Issued
csr-hhkpj   3m31s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued

# oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}'
#

# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml

...
TASK [Check for bootstrap CSR] ***********
FAILED - RETRYING: Check for bootstrap CSR (2 retries left).
FAILED - RETRYING: Check for bootstrap CSR (1 retries left).
fatal: [40.69.171.210 -> localhost]: FAILED! => {"attempts": 2, "changed": true, "cmd": "oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}'", "delta": "0:00:01.098141", "end": "2019-11-26 02:51:59.182270", "rc": 0, "start": "2019-11-26 02:51:58.084129", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP ***********
40.69.171.210              : ok=2    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

2, Cluster will clean the approved csr after a while (maybe hours), then ansible shell will always be "" too.

# oc get csr
No resources found.

# oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}' 
No resources found.
#

# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml

...
TASK [Check for bootstrap CSR] ***********
FAILED - RETRYING: Check for bootstrap CSR (2 retries left).
FAILED - RETRYING: Check for bootstrap CSR (1 retries left).
fatal: [40.69.171.210 -> localhost]: FAILED! => {"attempts": 2, "changed": true, "cmd": "oc get csr | awk '/system:serviceaccount:openshift-machine-config-operator:node-bootstrapper/ && /Pending/ {print $1}'", "delta": "0:00:06.170185", "end": "2019-11-26 02:15:27.416078", "rc": 0, "start": "2019-11-26 02:15:21.245893", "stderr": "No resources found.", "stderr_lines": ["No resources found."], "stdout": "", "stdout_lines": []}

PLAY RECAP ***********
40.69.171.210              : ok=6    changed=4    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
localhost                  : ok=7    changed=6    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0


Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-24-183610   True        False         5m35s   Cluster version is 4.3.0-0.nightly-2019-11-24-183610
windows-machine-config-operator commit:
# git show
commit 1eb1f983774101b5077828fd2efb4dfb711d5886

How reproducible:
Always

Steps to Reproduce:
1. Install OCP 4.3 cluster with ovn-kubernetes
2. Edit ovn-kubernetes as following
# oc edit Network.operator.openshift.io cluster
# oc get Network.operator.openshift.io cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2019-11-25T13:02:46Z"
  generation: 2
  name: cluster
  resourceVersion: "21021"
  selfLink: /apis/operator.openshift.io/v1/networks/cluster
  uid: c0315a6b-41fa-446d-971f-70c846607467
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      hybridOverlayConfig:
        hybridClusterNetwork:
        - cidr: 10.132.0.0/14
          hostPrefix: 23
    type: OVNKubernetes
  logLevel: ""
  serviceNetwork:
  - 172.30.0.0/16
status: {}
3. Create windows instance with wni
# ./wni azure create --kubeconfig ~/window_container/azure/cluster/kubeconfig --credentials ~/.azure/osServicePrincipal.json --image-id MicrosoftWindowsServer:WindowsServer:2019-Datacenter-with-Containers:latest --instance-type Standard_D2s_v3
4. Run wsu ansible the first time
# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml
...
PLAY RECAP ******
40.69.171.210              : ok=11   changed=8    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
localhost                  : ok=7    changed=6    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
5. Rerun wsu ansible again

Actual results:
Rerun ansible playbook fail when no Pending csr exist

Expected results:
Rerun ansible playbook should not block following tasks even though no Pending csr exist

Additional info:

Comment 3 gaoshang 2020-01-13 08:59:03 UTC
This bug has been verified and passed on OCP 4.4.0-0.nightly-2020-01-12-032939, thanks.

Version:
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-01-12-032939   True        False         77m     Cluster version is 4.4.0-0.nightly-2020-01-12-032939
WMCO repo:
# git show
commit 389ae941d6113d8a741719a5fb559e5deca0a506

Steps:
1, Run WSU against a Windows node attached to a 4.4 cluster
# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml -v
2, Check ansible run succeed and windows node is added to cluster
# oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
...
ip-10-0-31-23.us-east-2.compute.internal     Ready    worker   129m    v1.16.2
3, Run WSU again, check ansible run succeed and windows node is not affected
4, Delete windows node and run ansible again, check it can be added back
# oc delete node ip-10-0-31-23.us-east-2.compute.internal
node "ip-10-0-31-23.us-east-2.compute.internal" deleted
# ansible-playbook -i hosts ~/go/src/windows-machine-config-operator/tools/ansible/tasks/wsu/main.yaml -v
...
# oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
...
ip-10-0-31-23.us-east-2.compute.internal     Ready    worker   5m42s   v1.16.2
5, Check windows work load
# oc create -f WinWebServer116.yaml
deployment.apps/win-webserver created
...
# oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE    IP           NODE                                       NOMINATED NODE   READINESS GATES
win-webserver-79b64df8b9-p2txh   1/1     Running   0          2m8s   10.132.2.2   ip-10-0-31-23.us-east-2.compute.internal   <none>           <none>

Comment 7 errata-xmlrpc 2020-05-04 11:17:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.