Bug 1888372 - [3.11] - when redeploying new CA the SDN pods are not restarted to take the latest CA
Summary: [3.11] - when redeploying new CA the SDN pods are not restarted to take the l...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 3.11.z
Assignee: Russell Teague
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-14 17:14 UTC by Vladislav Walek
Modified: 2020-11-18 14:10 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-18 14:09:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12261 0 None closed Bug 1888372: openshift_node: Drain nodes and restart during join.yml 2020-12-08 08:04:10 UTC
Red Hat Product Errata RHBA-2020:5107 0 None None None 2020-11-18 14:10:44 UTC

Description Vladislav Walek 2020-10-14 17:14:00 UTC
When the redeploy openshift ca playbook is executed, one of the tasks is to update the kubeconfigs with the new CA.

That works normally as it should.

However, when redeploy certificates playbook finishes - the SDN pod and also router pod as well shows that connecting to masterURL to API is signed by unknown certificate.

I tried to reproduce that, however, for me it works as exepected.

However, I see the difference - on my test the router and SDN pods - the applicaiton in the container is restarted ( monitored the restarts in the oc get pods).

On customer however, the SDN pods are not restarted - and a workaround is to restart the pods.

The only difference I saw is that customer uses crio and I was using the docker.

Currently, checking it on CRIO lab.

There is nothing much, no logs to provide really as the root cause is know.
The error shown by the SDN pod is "x509 certificate signed by unknown issuer".

Checked on customer side that the kubeconfig used by SDN were in fact updated.

Comment 1 Vladislav Walek 2020-10-14 18:37:55 UTC
there is workaround which fixes the issue, I just got confirmation from customer.

Between rerunning the redeploy-openshift-ca and the redeploy-certificates - the SDN pods were deleted and new ones were started.

Comment 2 Scott Dodson 2020-10-14 19:34:47 UTC
Given there's a workaround I'd like to leave it up to the team to set relative priority, Severity is the more appropriate field to represent the impact to the customer.

Comment 3 Russell Teague 2020-10-22 14:39:22 UTC
What specific version of openshift-ansible is being used?
Were any certificates expired at the time the redeploy was being run?  This could result in node services not being restarted.

Even though playbooks complete successfully, logs can still be helpful in determining a root cause.  Please attach playbook logs from when pods were left in a failed state.

Comment 4 Russell Teague 2020-10-23 13:16:58 UTC
In testing I have run into some issues with cert redeploy when using crio which I'm still investigating.  Could you also attach a complete inventory?

Comment 5 Vladislav Walek 2020-10-24 00:09:26 UTC
Hello,

yeah, customer also is using CRIO.

>> What specific version of openshift-ansible is being used?

latest version, we tested it also directly from the github repo and it fails

>> Were any certificates expired at the time the redeploy was being run?  This could result in node services not being restarted.

No, the redeployment was done when the certificates are still valid. No certificates were expired.

>> Even though playbooks complete successfully, logs can still be helpful in determining a root cause.  Please attach playbook logs from when pods were left in a failed state.

I am afraid I no longer have logs. However, if we can find the task which is responsible for restarting the SDN pods (restarting the process in the container, b/c on my lab my test didn't schedule new pods, just I see restarts of the process) it would add more context to the logic.

Comment 14 Gaoyun Pei 2020-11-16 10:33:26 UTC
Verify this bug with openshift-ansible-3.11.318-1.git.0.bccee5b.el7.noarch.rpm.

Setup a 3.11 cluster with using cri-o as container runtime, run redeploy openshift ca playbook, then redeploy certificates playbook.

11-16 18:08:25  TASK [openshift_node : Delete OpenShift SDN/OVS pods prior to upgrade] *********
11-16 18:08:27  changed: [ec2-54-174-151-131.compute-1.amazonaws.com -> ec2-54-174-151-131.compute-1.amazonaws.com] => {"changed": true, "cmd": "oc get pods --config=/etc/origin/master/admin.kubeconfig --field-selector=spec.nodeName=ip-172-18-8-123.ec2.internal -o json -n openshift-sdn | oc delete --config=/etc/origin/master/admin.kubeconfig --force --grace-period=0 -f -\n", "delta": "0:00:01.655693", "end": "2020-11-16 05:08:27.346656", "rc": 0, "start": "2020-11-16 05:08:25.690963", "stderr": "warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.", "stderr_lines": ["warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely."], "stdout": "pod \"ovs-v4hv7\" force deleted\npod \"sdn-frnd6\" force deleted", "stdout_lines": ["pod \"ovs-v4hv7\" force deleted", "pod \"sdn-frnd6\" force deleted"]}
...

SDN pod on all the nodes were deleted.

After redeploy-certificates playbook finished, checked the new router and SDN pods, no cert issue in the log.

Comment 16 errata-xmlrpc 2020-11-18 14:09:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.318 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5107


Note You need to log in before you can comment on or make changes to this bug.