Bug 2056605
| Summary: | [osp][octavia lb] unschedulable worker node after cloud provider config change | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jon Uriarte <juriarte> | ||||
| Component: | Networking | Assignee: | Andreas Karis <akaris> | ||||
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | medium | CC: | aos-bugs, m.andre, mdulko, mfedosin, pprinett, rravaiol | ||||
| Version: | 4.10 | Keywords: | TestBlocker | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-04-04 15:53:18 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Jon Uriarte
2022-02-21 15:08:37 UTC
Seen in 4.8 as well, in this case after adding the LoadBalancer section to the cloud-provider cm (as the use
of Octavia needs to be done as a day 2 operation):
$ oc edit configmap -n openshift-config cloud-provider-config
[...]
config: |
[...]
[LoadBalancer]
use-octavia = True
[...]
2022-03-01 04:53:38.896 | TASK [tests : Wait until there are no unschedulable nodes - this means the config has been applied] ***
2022-03-01 04:53:38.902 | Tuesday 01 March 2022 04:53:35 +0000 (0:00:24.794) 0:02:07.969 *********
[...]
2022-03-01 05:52:56.702 | FAILED - RETRYING: Wait until there are no unschedulable nodes - this means the config has been applied (1 retries left).
2022-03-01 05:54:59.423 | fatal: [undercloud-0]: FAILED! => {
2022-03-01 05:54:59.426 | "attempts": 30,
2022-03-01 05:54:59.431 | "changed": false,
2022-03-01 05:54:59.435 | "resources": [
2022-03-01 05:54:59.438 | {
2022-03-01 05:54:59.441 | "apiVersion": "v1",
2022-03-01 05:54:59.444 | "kind": "Node",
[...]
2022-03-01 05:55:00.078 | "spec": {
2022-03-01 05:55:00.081 | "providerID": "openstack:///32125a88-8505-4e29-984f-514b229281d7",
2022-03-01 05:55:00.084 | "taints": [
2022-03-01 05:55:00.087 | {
2022-03-01 05:55:00.090 | "effect": "NoSchedule",
2022-03-01 05:55:00.093 | "key": "node.kubernetes.io/unschedulable",
2022-03-01 05:55:00.096 | "timeAdded": "2022-03-01T04:53:32Z"
2022-03-01 05:55:00.100 | }
2022-03-01 05:55:00.103 | ],
2022-03-01 05:55:00.106 | "unschedulable": true
2022-03-01 05:55:00.110 | },
2022-03-01 05:55:00.113 | "status": {
2022-03-01 05:55:00.117 | "addresses": [
2022-03-01 05:55:00.120 | {
2022-03-01 05:55:00.123 | "address": "10.196.2.151",
2022-03-01 05:55:00.127 | "type": "InternalIP"
2022-03-01 05:55:00.130 | },
2022-03-01 05:55:00.134 | {
2022-03-01 05:55:00.137 | "address": "ostest-dkswg-worker-0-x75m5",
2022-03-01 05:55:00.140 | "type": "Hostname"
2022-03-01 05:55:00.144 | }
2022-03-01 05:55:00.148 | ],
[...]
I kinda see that after reboot the problematic node has K8s API connectivity issue. I can see that nodes seem to have additional network attached. Could this be related to [1]? [1] https://bugzilla.redhat.com/show_bug.cgi?id=2059330 Okay, it shouldn't be connected with the aforementioned bug as it should be fixed in the version used here. Moving the BZ to the ovn-kubernetes team as it's related to default routes metrics when changing the cloud-provider configuration.
Changing the cloud-provider config triggers the issue:
$ oc edit configmap -n openshift-config cloud-provider-config
[...]
[LoadBalancer]
use-octavia = True
lb-provider = ovn
[...]
One worker node keeps unschedulable after the change:
NAME STATUS ROLES AGE VERSION INTERNAL-IP
ostest-qgfdz-master-0 Ready master 82m v1.23.3+bba255d 10.196.1.71
ostest-qgfdz-master-1 Ready master 82m v1.23.3+bba255d 10.196.0.134
ostest-qgfdz-master-2 Ready master 82m v1.23.3+bba255d 10.196.3.214
ostest-qgfdz-worker-0-fkk2h Ready,SchedulingDisabled worker 62m v1.23.3+bba255d 10.196.3.93
ostest-qgfdz-worker-0-qtj89 Ready worker 62m v1.23.3+bba255d 10.196.3.1
ostest-qgfdz-worker-0-tnbzg Ready worker 62m v1.23.3+bba255d 10.196.1.39
Openstack subnets:
+--------------------------------------+--------------------+--------------------------------------+---------------+
| ID | Name | Network | Subnet |
+--------------------------------------+--------------------+--------------------------------------+---------------+
| 0c4a9fab-e91a-4250-8306-5389216dcf7c | ostest-qgfdz-nodes | e3c3fca6-d140-4098-85cc-ab2a8ee0e11a | 10.196.0.0/16 |
| 48f330e1-00bd-4e17-abee-f944248e2399 | StorageNFSSubnet | 9b23f05e-0871-4206-a467-ec571cba27ef | 172.17.5.0/24 |
+--------------------------------------+--------------------+--------------------------------------+---------------+
Before the change:
ostest-qgfdz-worker-0-fkk2h <<----
---------------------------
default via 10.196.0.1 dev br-ex proto dhcp metric 49
default via 172.17.5.1 dev ens4 proto dhcp metric 100
ostest-qgfdz-worker-0-qtj89
---------------------------
default via 10.196.0.1 dev br-ex proto dhcp metric 49
default via 172.17.5.1 dev ens4 proto dhcp metric 100
ostest-qgfdz-worker-0-tnbzg
---------------------------
default via 10.196.0.1 dev br-ex proto dhcp metric 49
default via 172.17.5.1 dev ens4 proto dhcp metric 100
After the change:
ostest-qgfdz-worker-0-fkk2h <<----
---------------------------
default via 172.17.5.1 dev br-ex proto dhcp metric 49 <<---- the storage network has more priority
default via 10.196.0.1 dev ens3 proto dhcp metric 101
ostest-qgfdz-worker-0-qtj89 (no changes)
----------------------------------------
default via 10.196.0.1 dev br-ex proto dhcp metric 49
default via 172.17.5.1 dev ens4 proto dhcp metric 100
ostest-qgfdz-worker-0-tnbzg (no changes)
----------------------------------------
default via 10.196.0.1 dev br-ex proto dhcp metric 49
default via 172.17.5.1 dev ens4 proto dhcp metric 100
$ oc describe node ostest-qgfdz-worker-0-fkk2h
Name: ostest-qgfdz-worker-0-fkk2h
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m4.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=regionOne
failure-domain.beta.kubernetes.io/zone=nova
kubernetes.io/arch=amd64
kubernetes.io/hostname=ostest-qgfdz-worker-0-fkk2h
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m4.xlarge
node.openshift.io/os_id=rhcos
topology.cinder.csi.openstack.org/zone=nova
topology.kubernetes.io/region=regionOne
topology.kubernetes.io/zone=nova
Annotations: csi.volume.kubernetes.io/nodeid:
{"cinder.csi.openstack.org":"f4cb37d6-12b3-448e-8416-34225b4ad0a6","manila.csi.openstack.org":"ostest-qgfdz-worker-0-fkk2h"}
k8s.ovn.org/host-addresses: ["10.196.3.93","172.17.5.216"]
k8s.ovn.org/l3-gateway-config:
{"default":{"mode":"shared","interface-id":"br-ex_ostest-qgfdz-worker-0-fkk2h","mac-address":"fa:16:3e:aa:b9:44","ip-addresses":["172.17.5...
k8s.ovn.org/node-chassis-id: 54cc8a65-7ed7-4bdb-adb5-57f26f6d0e8a
k8s.ovn.org/node-mgmt-port-mac-address: 9a:b8:37:5e:ea:62
k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.17.5.216/24"}
k8s.ovn.org/node-subnets: {"default":"10.129.2.0/23"}
machine.openshift.io/machine: openshift-machine-api/ostest-qgfdz-worker-0-fkk2h
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-ff3365f8923121ad1960df303d0d147c
machineconfiguration.openshift.io/desiredConfig: rendered-worker-cc201ca80e2b969ef05d2a4b291debb4
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Working
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 03 Mar 2022 11:31:11 +0000
Taints: node.kubernetes.io/unschedulable:NoSchedule
UpdateInProgress:PreferNoSchedule
Unschedulable: true
[...]
** Seen in 4.8 and 4.10 for now.
Hey there! Do you have sosreports and the full journal log (tar -czf /host/var/log/journal.tar.gz /host/var/log/journal) for one of the affected nodes? That would be super helpful. Thanks! - Andreas Can you test and reproduce this on a 4.8 cluster and on the node with issues, can you provide a sosreport and the full journal log (`tar -czf /host/var/log/journal.tar.gz /host/var/log/journal`) for one of the affected nodes? I will need the journal ovs-configure and NM logs to figure out why this happens in 4.8. But I imagine that anything that triggers a reconfiguration of br-ex would have the potential to trigger this. In 4.10, this will then definitely trigger https://bugzilla.redhat.com/show_bug.cgi?id=2057160 ; so in that case, it's a dup. https://github.com/openshift/machine-config-operator/commit/9e3e74d4d2c31e449352bb5b9efcdb25f1234c98 will trigger a replumming of br-ex on every reboot and that can lead to the issues described here. (In reply to Andreas Karis from comment #10) > Can you test and reproduce this on a 4.8 cluster and on the node with > issues, can you provide a sosreport and the full journal log (`tar -czf > /host/var/log/journal.tar.gz /host/var/log/journal`) for one of the affected > nodes? I will need the journal ovs-configure and NM logs to figure out why > this happens in 4.8. But I imagine that anything that triggers a > reconfiguration of br-ex would have the potential to trigger this. Hey Andreas, I'll try manually reproducing it and getting the info. We're hitting it on our CI but cannot extract that info from there. Thanks Hi Jon! Did you have any luck reproducing this on 4.8? Thanks! (In reply to Andreas Karis from comment #13) > Hi Jon! Did you have any luck reproducing this on 4.8? Thanks! Hey Andreas, not yet, if we see it's not reproducible in 4.8 any more we can close it, but let's wait a couple of weeks and see the results. Thanks Ok, I'll close this as closed duplicate in the meantime to clean up my backlog. If you can reproduce this on 4.8, please simply reopen this BZ and I'll have a look at it. *** This bug has been marked as a duplicate of bug 2057160 *** |