Bug 1888372
Summary: | [3.11] - when redeploying new CA the SDN pods are not restarted to take the latest CA | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vladislav Walek <vwalek> |
Component: | Installer | Assignee: | Russell Teague <rteague> |
Installer sub component: | openshift-ansible | QA Contact: | Gaoyun Pei <gpei> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | bretm, scuppett, sreber |
Version: | 3.11.0 | ||
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-18 14:09:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vladislav Walek
2020-10-14 17:14:00 UTC
there is workaround which fixes the issue, I just got confirmation from customer. Between rerunning the redeploy-openshift-ca and the redeploy-certificates - the SDN pods were deleted and new ones were started. Given there's a workaround I'd like to leave it up to the team to set relative priority, Severity is the more appropriate field to represent the impact to the customer. What specific version of openshift-ansible is being used? Were any certificates expired at the time the redeploy was being run? This could result in node services not being restarted. Even though playbooks complete successfully, logs can still be helpful in determining a root cause. Please attach playbook logs from when pods were left in a failed state. In testing I have run into some issues with cert redeploy when using crio which I'm still investigating. Could you also attach a complete inventory? Hello, yeah, customer also is using CRIO. >> What specific version of openshift-ansible is being used? latest version, we tested it also directly from the github repo and it fails >> Were any certificates expired at the time the redeploy was being run? This could result in node services not being restarted. No, the redeployment was done when the certificates are still valid. No certificates were expired. >> Even though playbooks complete successfully, logs can still be helpful in determining a root cause. Please attach playbook logs from when pods were left in a failed state. I am afraid I no longer have logs. However, if we can find the task which is responsible for restarting the SDN pods (restarting the process in the container, b/c on my lab my test didn't schedule new pods, just I see restarts of the process) it would add more context to the logic. Verify this bug with openshift-ansible-3.11.318-1.git.0.bccee5b.el7.noarch.rpm. Setup a 3.11 cluster with using cri-o as container runtime, run redeploy openshift ca playbook, then redeploy certificates playbook. 11-16 18:08:25 TASK [openshift_node : Delete OpenShift SDN/OVS pods prior to upgrade] ********* 11-16 18:08:27 changed: [ec2-54-174-151-131.compute-1.amazonaws.com -> ec2-54-174-151-131.compute-1.amazonaws.com] => {"changed": true, "cmd": "oc get pods --config=/etc/origin/master/admin.kubeconfig --field-selector=spec.nodeName=ip-172-18-8-123.ec2.internal -o json -n openshift-sdn | oc delete --config=/etc/origin/master/admin.kubeconfig --force --grace-period=0 -f -\n", "delta": "0:00:01.655693", "end": "2020-11-16 05:08:27.346656", "rc": 0, "start": "2020-11-16 05:08:25.690963", "stderr": "warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.", "stderr_lines": ["warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely."], "stdout": "pod \"ovs-v4hv7\" force deleted\npod \"sdn-frnd6\" force deleted", "stdout_lines": ["pod \"ovs-v4hv7\" force deleted", "pod \"sdn-frnd6\" force deleted"]} ... SDN pod on all the nodes were deleted. After redeploy-certificates playbook finished, checked the new router and SDN pods, no cert issue in the log. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 3.11.318 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5107 |