Created attachment 1862414 [details] must-gather Description of problem: The cloud provider configuration is changed during Octavia LB tests, in order to change from the default amphora Octavia driver to the ovn Octavia driver, but a worker node remains in `unschedulable` after one hour. It can happen in any OCP version, not only in 4.10. Version-Release number of selected component (if applicable): OCP 4.10.0-0.nightly-2022-02-17-234353 OSP RHOS-16.2-RHEL-8-20211129.n.1 How reproducible: ~10% of the times Steps to Reproduce: 1. Install OCP (the network type can be OpenShiftSDN or OVNKubernetes) 2. Change the Octavia driver in the cloud provider according to [1]: " You can update your cloud provider configuration after you run the installer. On a command line, run: $ oc edit configmap -n openshift-config cloud-provider-config [...] config: | [Global] secret-name = openstack-credentials secret-namespace = kube-system region = regionOne ca-file = /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem [LoadBalancer] use-octavia = True lb-provider = ovn lb-method = SOURCE_IP_PORT floating-network-id = "xx" subnet-id = "yy" [...] After you save your changes, your cluster will take some time to reconfigure itself. The process is complete if none of your nodes have a SchedulingDisabled status. " Actual results: A worker node remains in unschedulable after one hour (it usually takes 20 minutes to complete). Expected results: No node should be in unschedulable after the cluster is re-configured. Additional info: 2022-02-19 07:43:29.386 | TASK [tests : Wait until there are no unschedulable nodes - this means the config has been applied] *** 2022-02-19 07:43:29.391 | Saturday 19 February 2022 07:43:27 +0000 (0:01:32.134) 0:18:42.428 ***** 2022-02-19 07:43:29.394 | FAILED - RETRYING: Wait until there are no unschedulable nodes - this means the config has been applied (30 retries left). [...] 2022-02-19 08:42:45.481 | FAILED - RETRYING: Wait until there are no unschedulable nodes - this means the config has been applied (1 retries left). [...] 2022-02-19 08:44:48.159 | fatal: [undercloud-0]: FAILED! => { 2022-02-19 08:44:48.162 | "attempts": 30, 2022-02-19 08:44:48.164 | "changed": false, 2022-02-19 08:44:48.169 | "resources": [ 2022-02-19 08:44:48.172 | { 2022-02-19 08:44:48.174 | "apiVersion": "v1", 2022-02-19 08:44:48.177 | "kind": "Node", [...] 2022-02-19 08:44:48.803 | "spec": { 2022-02-19 08:44:48.806 | "providerID": "openstack:///eeef43db-b2e4-4591-876d-4bdd53085db9", 2022-02-19 08:44:48.808 | "taints": [ 2022-02-19 08:44:48.811 | { 2022-02-19 08:44:48.814 | "effect": "PreferNoSchedule", 2022-02-19 08:44:48.816 | "key": "UpdateInProgress" 2022-02-19 08:44:48.819 | }, 2022-02-19 08:44:48.822 | { 2022-02-19 08:44:48.824 | "effect": "NoSchedule", 2022-02-19 08:44:48.827 | "key": "node.kubernetes.io/unschedulable", 2022-02-19 08:44:48.830 | "timeAdded": "2022-02-19T07:46:11Z" 2022-02-19 08:44:48.832 | } 2022-02-19 08:44:48.835 | ], 2022-02-19 08:44:48.838 | "unschedulable": true 2022-02-19 08:44:48.840 | }, 2022-02-19 08:44:48.843 | "status": { 2022-02-19 08:44:48.845 | "addresses": [ 2022-02-19 08:44:48.848 | { 2022-02-19 08:44:48.851 | "address": "10.196.1.114", 2022-02-19 08:44:48.853 | "type": "InternalIP" 2022-02-19 08:44:48.856 | }, 2022-02-19 08:44:48.858 | { 2022-02-19 08:44:48.861 | "address": "ostest-f7774-worker-0-49k5n", 2022-02-19 08:44:48.863 | "type": "Hostname" 2022-02-19 08:44:48.866 | } 2022-02-19 08:44:48.868 | ], [...] [1] https://docs.openshift.com/container-platform/4.9/installing/installing_openstack/installing-openstack-installer-custom.html#installation-osp-setting-cloud-provider-options_installing-openstack-installer-custom
Seen in 4.8 as well, in this case after adding the LoadBalancer section to the cloud-provider cm (as the use of Octavia needs to be done as a day 2 operation): $ oc edit configmap -n openshift-config cloud-provider-config [...] config: | [...] [LoadBalancer] use-octavia = True [...] 2022-03-01 04:53:38.896 | TASK [tests : Wait until there are no unschedulable nodes - this means the config has been applied] *** 2022-03-01 04:53:38.902 | Tuesday 01 March 2022 04:53:35 +0000 (0:00:24.794) 0:02:07.969 ********* [...] 2022-03-01 05:52:56.702 | FAILED - RETRYING: Wait until there are no unschedulable nodes - this means the config has been applied (1 retries left). 2022-03-01 05:54:59.423 | fatal: [undercloud-0]: FAILED! => { 2022-03-01 05:54:59.426 | "attempts": 30, 2022-03-01 05:54:59.431 | "changed": false, 2022-03-01 05:54:59.435 | "resources": [ 2022-03-01 05:54:59.438 | { 2022-03-01 05:54:59.441 | "apiVersion": "v1", 2022-03-01 05:54:59.444 | "kind": "Node", [...] 2022-03-01 05:55:00.078 | "spec": { 2022-03-01 05:55:00.081 | "providerID": "openstack:///32125a88-8505-4e29-984f-514b229281d7", 2022-03-01 05:55:00.084 | "taints": [ 2022-03-01 05:55:00.087 | { 2022-03-01 05:55:00.090 | "effect": "NoSchedule", 2022-03-01 05:55:00.093 | "key": "node.kubernetes.io/unschedulable", 2022-03-01 05:55:00.096 | "timeAdded": "2022-03-01T04:53:32Z" 2022-03-01 05:55:00.100 | } 2022-03-01 05:55:00.103 | ], 2022-03-01 05:55:00.106 | "unschedulable": true 2022-03-01 05:55:00.110 | }, 2022-03-01 05:55:00.113 | "status": { 2022-03-01 05:55:00.117 | "addresses": [ 2022-03-01 05:55:00.120 | { 2022-03-01 05:55:00.123 | "address": "10.196.2.151", 2022-03-01 05:55:00.127 | "type": "InternalIP" 2022-03-01 05:55:00.130 | }, 2022-03-01 05:55:00.134 | { 2022-03-01 05:55:00.137 | "address": "ostest-dkswg-worker-0-x75m5", 2022-03-01 05:55:00.140 | "type": "Hostname" 2022-03-01 05:55:00.144 | } 2022-03-01 05:55:00.148 | ], [...]
I kinda see that after reboot the problematic node has K8s API connectivity issue. I can see that nodes seem to have additional network attached. Could this be related to [1]? [1] https://bugzilla.redhat.com/show_bug.cgi?id=2059330
Okay, it shouldn't be connected with the aforementioned bug as it should be fixed in the version used here.
Moving the BZ to the ovn-kubernetes team as it's related to default routes metrics when changing the cloud-provider configuration. Changing the cloud-provider config triggers the issue: $ oc edit configmap -n openshift-config cloud-provider-config [...] [LoadBalancer] use-octavia = True lb-provider = ovn [...] One worker node keeps unschedulable after the change: NAME STATUS ROLES AGE VERSION INTERNAL-IP ostest-qgfdz-master-0 Ready master 82m v1.23.3+bba255d 10.196.1.71 ostest-qgfdz-master-1 Ready master 82m v1.23.3+bba255d 10.196.0.134 ostest-qgfdz-master-2 Ready master 82m v1.23.3+bba255d 10.196.3.214 ostest-qgfdz-worker-0-fkk2h Ready,SchedulingDisabled worker 62m v1.23.3+bba255d 10.196.3.93 ostest-qgfdz-worker-0-qtj89 Ready worker 62m v1.23.3+bba255d 10.196.3.1 ostest-qgfdz-worker-0-tnbzg Ready worker 62m v1.23.3+bba255d 10.196.1.39 Openstack subnets: +--------------------------------------+--------------------+--------------------------------------+---------------+ | ID | Name | Network | Subnet | +--------------------------------------+--------------------+--------------------------------------+---------------+ | 0c4a9fab-e91a-4250-8306-5389216dcf7c | ostest-qgfdz-nodes | e3c3fca6-d140-4098-85cc-ab2a8ee0e11a | 10.196.0.0/16 | | 48f330e1-00bd-4e17-abee-f944248e2399 | StorageNFSSubnet | 9b23f05e-0871-4206-a467-ec571cba27ef | 172.17.5.0/24 | +--------------------------------------+--------------------+--------------------------------------+---------------+ Before the change: ostest-qgfdz-worker-0-fkk2h <<---- --------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 ostest-qgfdz-worker-0-qtj89 --------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 ostest-qgfdz-worker-0-tnbzg --------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 After the change: ostest-qgfdz-worker-0-fkk2h <<---- --------------------------- default via 172.17.5.1 dev br-ex proto dhcp metric 49 <<---- the storage network has more priority default via 10.196.0.1 dev ens3 proto dhcp metric 101 ostest-qgfdz-worker-0-qtj89 (no changes) ---------------------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 ostest-qgfdz-worker-0-tnbzg (no changes) ---------------------------------------- default via 10.196.0.1 dev br-ex proto dhcp metric 49 default via 172.17.5.1 dev ens4 proto dhcp metric 100 $ oc describe node ostest-qgfdz-worker-0-fkk2h Name: ostest-qgfdz-worker-0-fkk2h Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m4.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=regionOne failure-domain.beta.kubernetes.io/zone=nova kubernetes.io/arch=amd64 kubernetes.io/hostname=ostest-qgfdz-worker-0-fkk2h kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m4.xlarge node.openshift.io/os_id=rhcos topology.cinder.csi.openstack.org/zone=nova topology.kubernetes.io/region=regionOne topology.kubernetes.io/zone=nova Annotations: csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"f4cb37d6-12b3-448e-8416-34225b4ad0a6","manila.csi.openstack.org":"ostest-qgfdz-worker-0-fkk2h"} k8s.ovn.org/host-addresses: ["10.196.3.93","172.17.5.216"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ostest-qgfdz-worker-0-fkk2h","mac-address":"fa:16:3e:aa:b9:44","ip-addresses":["172.17.5... k8s.ovn.org/node-chassis-id: 54cc8a65-7ed7-4bdb-adb5-57f26f6d0e8a k8s.ovn.org/node-mgmt-port-mac-address: 9a:b8:37:5e:ea:62 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.17.5.216/24"} k8s.ovn.org/node-subnets: {"default":"10.129.2.0/23"} machine.openshift.io/machine: openshift-machine-api/ostest-qgfdz-worker-0-fkk2h machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-ff3365f8923121ad1960df303d0d147c machineconfiguration.openshift.io/desiredConfig: rendered-worker-cc201ca80e2b969ef05d2a4b291debb4 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Working volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 03 Mar 2022 11:31:11 +0000 Taints: node.kubernetes.io/unschedulable:NoSchedule UpdateInProgress:PreferNoSchedule Unschedulable: true [...] ** Seen in 4.8 and 4.10 for now.
Hey there! Do you have sosreports and the full journal log (tar -czf /host/var/log/journal.tar.gz /host/var/log/journal) for one of the affected nodes? That would be super helpful. Thanks! - Andreas
Can you test and reproduce this on a 4.8 cluster and on the node with issues, can you provide a sosreport and the full journal log (`tar -czf /host/var/log/journal.tar.gz /host/var/log/journal`) for one of the affected nodes? I will need the journal ovs-configure and NM logs to figure out why this happens in 4.8. But I imagine that anything that triggers a reconfiguration of br-ex would have the potential to trigger this.
In 4.10, this will then definitely trigger https://bugzilla.redhat.com/show_bug.cgi?id=2057160 ; so in that case, it's a dup. https://github.com/openshift/machine-config-operator/commit/9e3e74d4d2c31e449352bb5b9efcdb25f1234c98 will trigger a replumming of br-ex on every reboot and that can lead to the issues described here.
(In reply to Andreas Karis from comment #10) > Can you test and reproduce this on a 4.8 cluster and on the node with > issues, can you provide a sosreport and the full journal log (`tar -czf > /host/var/log/journal.tar.gz /host/var/log/journal`) for one of the affected > nodes? I will need the journal ovs-configure and NM logs to figure out why > this happens in 4.8. But I imagine that anything that triggers a > reconfiguration of br-ex would have the potential to trigger this. Hey Andreas, I'll try manually reproducing it and getting the info. We're hitting it on our CI but cannot extract that info from there. Thanks
Hi Jon! Did you have any luck reproducing this on 4.8? Thanks!
(In reply to Andreas Karis from comment #13) > Hi Jon! Did you have any luck reproducing this on 4.8? Thanks! Hey Andreas, not yet, if we see it's not reproducible in 4.8 any more we can close it, but let's wait a couple of weeks and see the results. Thanks
Ok, I'll close this as closed duplicate in the meantime to clean up my backlog. If you can reproduce this on 4.8, please simply reopen this BZ and I'll have a look at it. *** This bug has been marked as a duplicate of bug 2057160 ***