Bug 1888372

Summary: [3.11] - when redeploying new CA the SDN pods are not restarted to take the latest CA
Product: OpenShift Container Platform Reporter: Vladislav Walek <vwalek>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: Gaoyun Pei <gpei>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: bretm, scuppett, sreber
Version: 3.11.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-18 14:09:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vladislav Walek 2020-10-14 17:14:00 UTC
When the redeploy openshift ca playbook is executed, one of the tasks is to update the kubeconfigs with the new CA.

That works normally as it should.

However, when redeploy certificates playbook finishes - the SDN pod and also router pod as well shows that connecting to masterURL to API is signed by unknown certificate.

I tried to reproduce that, however, for me it works as exepected.

However, I see the difference - on my test the router and SDN pods - the applicaiton in the container is restarted ( monitored the restarts in the oc get pods).

On customer however, the SDN pods are not restarted - and a workaround is to restart the pods.

The only difference I saw is that customer uses crio and I was using the docker.

Currently, checking it on CRIO lab.

There is nothing much, no logs to provide really as the root cause is know.
The error shown by the SDN pod is "x509 certificate signed by unknown issuer".

Checked on customer side that the kubeconfig used by SDN were in fact updated.

Comment 1 Vladislav Walek 2020-10-14 18:37:55 UTC
there is workaround which fixes the issue, I just got confirmation from customer.

Between rerunning the redeploy-openshift-ca and the redeploy-certificates - the SDN pods were deleted and new ones were started.

Comment 2 Scott Dodson 2020-10-14 19:34:47 UTC
Given there's a workaround I'd like to leave it up to the team to set relative priority, Severity is the more appropriate field to represent the impact to the customer.

Comment 3 Russell Teague 2020-10-22 14:39:22 UTC
What specific version of openshift-ansible is being used?
Were any certificates expired at the time the redeploy was being run?  This could result in node services not being restarted.

Even though playbooks complete successfully, logs can still be helpful in determining a root cause.  Please attach playbook logs from when pods were left in a failed state.

Comment 4 Russell Teague 2020-10-23 13:16:58 UTC
In testing I have run into some issues with cert redeploy when using crio which I'm still investigating.  Could you also attach a complete inventory?

Comment 5 Vladislav Walek 2020-10-24 00:09:26 UTC
Hello,

yeah, customer also is using CRIO.

>> What specific version of openshift-ansible is being used?

latest version, we tested it also directly from the github repo and it fails

>> Were any certificates expired at the time the redeploy was being run?  This could result in node services not being restarted.

No, the redeployment was done when the certificates are still valid. No certificates were expired.

>> Even though playbooks complete successfully, logs can still be helpful in determining a root cause.  Please attach playbook logs from when pods were left in a failed state.

I am afraid I no longer have logs. However, if we can find the task which is responsible for restarting the SDN pods (restarting the process in the container, b/c on my lab my test didn't schedule new pods, just I see restarts of the process) it would add more context to the logic.

Comment 14 Gaoyun Pei 2020-11-16 10:33:26 UTC
Verify this bug with openshift-ansible-3.11.318-1.git.0.bccee5b.el7.noarch.rpm.

Setup a 3.11 cluster with using cri-o as container runtime, run redeploy openshift ca playbook, then redeploy certificates playbook.

11-16 18:08:25  TASK [openshift_node : Delete OpenShift SDN/OVS pods prior to upgrade] *********
11-16 18:08:27  changed: [ec2-54-174-151-131.compute-1.amazonaws.com -> ec2-54-174-151-131.compute-1.amazonaws.com] => {"changed": true, "cmd": "oc get pods --config=/etc/origin/master/admin.kubeconfig --field-selector=spec.nodeName=ip-172-18-8-123.ec2.internal -o json -n openshift-sdn | oc delete --config=/etc/origin/master/admin.kubeconfig --force --grace-period=0 -f -\n", "delta": "0:00:01.655693", "end": "2020-11-16 05:08:27.346656", "rc": 0, "start": "2020-11-16 05:08:25.690963", "stderr": "warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.", "stderr_lines": ["warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely."], "stdout": "pod \"ovs-v4hv7\" force deleted\npod \"sdn-frnd6\" force deleted", "stdout_lines": ["pod \"ovs-v4hv7\" force deleted", "pod \"sdn-frnd6\" force deleted"]}
...

SDN pod on all the nodes were deleted.

After redeploy-certificates playbook finished, checked the new router and SDN pods, no cert issue in the log.

Comment 16 errata-xmlrpc 2020-11-18 14:09:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.318 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5107