Description of problem: Version-Release number of selected component (if applicable): How reproducible: Deploy an OCP 4.3 cluster on POWER 9 fleetwood/zz/zepplin server. Use tf scripts with an Openstack provider to create cluster VMs. This issue is recreatable on all the POWER 9 models using SEA based network connection type for the VMs. Steps to Reproduce: 1. 2. 3. Actual results: Openshift install fails at the below step. [root@arc-p9flt-ocp43-b0a2-bastion openstack-upi]# ./openshift-install wait-for install-complete INFO Waiting up to 30m0s for the cluster at https://api.arc-p9flt-ocp43-b0a2.p9flt.com:6443 to initialize... ERROR Cluster operator authentication Degraded is True with RouteStatusDegradedFailedCreate: RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift) INFO Cluster operator authentication Progressing is Unknown with NoData: INFO Cluster operator authentication Available is Unknown with NoData: ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. Moving to release version "4.3.0-0.nightly-ppc64le-2020-03-02-144601". Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3896981f985871cd7b286866f1987d885d10a52698f7117014b9ccdb26b8bd9". INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. INFO Cluster operator insights Disabled is False with : INFO Cluster operator kube-controller-manager Progressing is True with Progressing: Progressing: 1 nodes are at revision 5; 2 nodes are at revision 6 INFO Cluster operator kube-scheduler Progressing is True with Progressing: Progressing: 1 nodes are at revision 5; 2 nodes are at revision 6 ERROR Cluster operator monitoring Degraded is True with UpdatingconfigurationsharingFailed: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s) INFO Cluster operator monitoring Available is False with : INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. INFO Cluster operator openshift-apiserver Available is False with _OpenShiftAPICheckFailed: Available: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) Available: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) Available: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) Available: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) Available: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) INFO Cluster operator operator-lifecycle-manager-packageserver Available is False with : INFO Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.13.0 FATAL failed to initialize the cluster: Multiple errors are preventing progress: * Could not update oauthclient "console" (289 of 496): the server does not recognize this resource, check extension API servers * Could not update role "openshift-console-operator/prometheus-k8s" (446 of 496): resource may have been deleted All nodes and pods are healthy. But, the Authentication clusteroperator is in unknown state. [root@arc-p9flt-ocp43-b0a2-bastion auth]# oc get clusteroperators | grep auth NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 20m Expected results: Additional info:
We dont see this issue, when we use the SRIOV type of network connection for cluster VM deployment. One of our internal developers had the below finding. The vxlan tunnel not transmitting is something to do with MTU size, PMTU, and Large Sends. After enabling Path MTU discovery (PMTU) and we no longer see any dropped packets on vxlan_sys_4789 on the master. Below work around was added to all master nodes and the authentication clusteroperator was up and running. # oc debug node/arc-p9flt-ocp43-b0a2-master-0 # chroot /host # vi /etc/sysctl.conf net.ipv4.ip_no_pmtu_disc=1 # sysctl -p # restart We need this to be fixed in the openshift code.
aconstan this is the same issue what we discussed offline over the chat, we followed the workaround from - https://access.redhat.com/solutions/4585001 and it worked later.
Changing the system config feels like an installer thing not an SDN one. If you feel otherwise, can you give us guidance about how to set the system config?
This would be affected through definition of machine config. MCO team can help with that process but to me this bug doesn't make it clear when such configuration is necessary. Moving to multi-arch in hopes that they know best when to enable these settings on this platform.
we can create a template conf file in MCO along with the other sysctl variables and put this in there.I will investigate this
opened this issue to discuss options with MCO team on ways to do this https://github.com/openshift/machine-config-operator/issues/1638
Openstack is not going to be supported in OCP 4.3, the only supported methods for end-user installs in 4.3 are UPI on Baremetal, where Baremetal is RHCOS(Linux running on top of system firmware or a pre-provisioned PowerVM LPAR.
I think using Path MTU is not the best solution here. That is going to make us use the min_pmtu value which by default is 552 bytes and will be terrible network performance. We should not have to tweak the MTU values in general but it is a valid work-around. Using this approach in MCO seems like it would lock us into a MTU value for all Power deployments - which is not going to work well for us. A similar bug was seen in ICP deployments a while ago, again only in PowerVM environments. A better solution would be to disable large sends in the ibmveth driver. insmod ibmveth.ko old_large_send=0 I'm not sure how to automate this via MCO so perhaps this will just have to be an extra configuration step in PowerVM deployments...? Alternatively, if we don't want to disable large sends (which imo is a better solution by avoiding the MTU configs) - we can change the pmtu value to something higher than 552 via net.ipv4.route.min_pmtu=1430 or 1450 Now the question would be, what's the appropriate MTU value? The idea is to keep the MTU as large as possible without creating a lot of fragmentation and obviously not dropping packets. Too small of MTU and network performance will suffer. This is why I think using the large send approach will be better going forward. Since large_send is enabled by default in this driver, I'm pretty sure disabling it will allow the Authentication cluster operator to come up healthy. I'm waiting to get access to this cluster and run some tests.
Prashanth is out until August. We are fixing other bugs during the current sprint so I am putting "UpcomingSprint" tag until Prashanth returns.
Hi Mick, you mentioned in a previous comment that you are planning to run some tests. Any findings on the potential fix for this bug?
I have been investigating an issue on 4.4 using ibmveth not SEA, but I suspect it is the same issue. I found that a cashed route entry to the address of the other lpar or (node) exists with an associated mtu of 1450. This cashed entry results in some frames sent via the vxlan tunnel to be dropped. Any packet of size 1450 with don't fragment set, sent via the vxlan and destined to the address matching the cashed entry will be dropped in ip_fragment() with error -EMSGSIZE. I can demonstrate using ping sending a larger than mtu size ICMP ECHO frame. # ping -c1 -s2000 10.129.0.1 <<< address of tun0 interface on node 9.47.88.234 >>>> This results is two packets sent making up a 2000 byte ICMP ECHO. Debug tracing: 172741.648290: vxlan_xmit_one: skb=000000001d6ce7c9 prot=1 10.128.0.1:0-->10.129.0.1:0 len=618 << small packet 172741.648293: vxlan_xmit_one: skb=0000000076d0dea2 prot=1 10.128.0.1:0-->10.129.0.1:0 len=1458 << large packet 172741.648300: vxlan_xmit_one: skb=000000001d6ce7c9 df=0x40 len=626 9.47.88.239-->9.47.88.234 << DF is set 172741.648300: vxlan_xmit_one: skb=0000000076d0dea2 df=0x40 len=1466 9.47.88.239-->9.47.88.234 "" 172741.648332: iptunnel_xmit: skb=000000001d6ce7c9 err=0x0 <<< small packet sent with no error. 172741.648332: ip_fragment.constprop.4: IPSTATS_MIB_FRAGFAILS skb=0000000076d0dea2 mtu=1450 flags=0x0 gso_size=0 dst=000000009d832520 dev=env32 frag_max_size=0 172741.648363: iptunnel_xmit: skb=0000000076d0dea2 err=0xffffffa6 <<< larger packet returns -EMSGSIZE 172741.648363: iptunnel_xmit: ++tx-drop: skb=0000000076d0dea2 The larger of the two packets is encapsulated in a vxlan header it becomes larger than the mtu of 1450 and the packet is dropped. However, the mtu of env32 should be 1500 not 1450. # ip link show | grep env32 2: env32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 The MTU passed to ip_fragment is 1450, where did that come from? Checking cached routes shows the answer. [root@master-0 etc]# ip -d route get to 9.47.88.234 unicast 9.47.88.234 dev env32 table main src 9.47.88.239 uid 0 cache expires 597sec mtu 1450 <<<< If I flush the cached FIB then ping start to work. [root@master-0 etc]# ip route flush cache [root@master-0 etc]# ip -d route get to 9.47.88.234 unicast 9.47.88.234 dev env32 table main src 9.47.88.239 uid 0 cache [root@master-0 etc]# ping -c1 -s2000 10.129.0.1 PING 10.129.0.1 (10.129.0.1) 2000(2028) bytes of data. 2008 bytes from 10.129.0.1: icmp_seq=1 ttl=64 time=1.17 ms --- 10.129.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 1.174/1.174/1.174/0.000 ms After a bit the cached FIB returns and ping stops working again. The cache entry must be the results of pmtu discovery. But path mtu should be disabled. Is this correct? [root@master-0 core]# sysctl -a | grep pmtu net.ipv4.ip_forward_use_pmtu = 0 net.ipv4.ip_no_pmtu_disc = 0 net.ipv4.route.min_pmtu = 552 Next questions: Why is the pmtu sent from the other lpar? Why is this not a problem on other platforms? Possible solution: ip route flush cache ip route add 9.47.88.234 dev env32 mtu lock 1500 This appears to avoid the problem, vxlans tx_drop counter stop increasing, I don't see the cashed FIB entry returning. Will this break anything else??
Thank you, Dave, can this bug be assigned under you since you are investigating the similar issue? Or will this bug need RH input?
(In reply to Dan Li from comment #15) > Thank you, Dave, can this bug be assigned under you since you are > investigating the similar issue? Or will this bug need RH input? You can assign it to me. However, I would appreciate input on my findings from openshift development as I continue to investigate.
Sure. Let's see if there is any responses from this thread. If not, you can either raise this during our weekly RH/IBM call or let me know so that I can help with finding the right person for an answer.
Created attachment 1712099 [details] MustFrag-worker0-env32-Filter-TCP.pcap
We know that vxlan dropping tx packets is caused by creation of a cache route entry with a MTU=1450. This cache entry must be created due to the receipt of an ICMP Unreachable/Must-Fragment error. Using tcpdump I captured several of these ICMP packets sent from worker0. tcpdump was run on worker-0 (tcpdump -n -i env32 -w env32-startup.pcap) Reference trace: MustFrag-worker0-env32-Filter-TCP.pcap master0 - 9.47.88.239 worker0 - 9.47.88.234 The pod interfaces (eth0) have MTU=1450. The external interface (ibmveth) MTU=1500. The pcap was filtered to follow the tcp connection showing the icmp error. The following packets are of interest. 1) Connection setup: PK1 9.47.88.234::sun-sr-https > SYN MSS:1410 PK5 9.47.88.234:5439 > SYN/ACK MSS:1410 <....> 2) A large packet is sent from master0 to worker0. Packet length=3080 MSS=3014 (DF is set), worker0 ACKs this in PK10. However, MSS=3014 is larger than the negotiated MSS of 1410, This is OK due to TSO being enabled on all interfaces. PK9: 9.47.88.239:sun-sr-https > 9.47.88.234:54390 Frame length: 3080 TCP-len(MSS): 3014 3) Master-0 sends 1489 byte packet (MSS=1423/DF) to worker0, worker0 responds with a ICMP Must fragment. This packet is still larger than the negotiated MSS but smaller than PK9. The MTU of the interface is 1450. Why did this packet result in a ICMP-Must Frag when PK9 did not? PK27 9.47.88.239:sun-sr-https > 9.47.88.234:54390 Frame lenght: 1489 TCP-Len: 1423 PK28 9.47.88.234 > 9.47.88.239 ICMP Desttination unreachable/Fragmentation needed. Points to PK27 as the packet that need fragmentation My guess is that the size of PK9 was compared to gso-size, and PK27's size was erroneously compared to the MTU of 1450.
I have identified the bug in the ibmveth driver causing the generation of the ICMP Must Fragment error. I tested a change to the ibmveth driver it successfully resolved the issue. My change was reviewed by Power Virtualization development. A further discussion of the change with PHYP and VIOS development is pending.
Hi Dave, it seems that there was an action pending on the bug. Do you think this bug will be resolved before the sprint ends this week? If not, I would like to add an "UpcomingSprint" tag to this bug so that it can be evaluated in a future sprint.
no it will not be resolved this week. I dont have a estimate at this time.
Thank you, Dave! Adding "UpcomingSprint" tag to this bug
Hi Dave, I am doing this as a part of Sprint close-off activity every 3 weeks. Do you think this bug will be resolved before end of this sprint (before October 3rd)? If not, I would like to add an "UpcomingSprint" tag for evaluation in the future.
(In reply to Dan Li from comment #22) > Hi Dave, it seems that there was an action pending on the bug. Do you think > this bug will be resolved before the sprint ends this week? If not, I would > like to add an "UpcomingSprint" tag to this bug so that it can be evaluated > in a future sprint. Hi Dan, Please mark as "UpcomingSprint", here is a status, sorry I have not updated sooner. The issue is with the ibmveth driver and how it decides if an ingress frame is a GSO or not. We understand the problem and a fix has been purposed. The firmware and the vios version also affects the problem therefor we need to validate the fix against older firmware. We are in the process of this validation. Once we are happy with the fix I will submit the patch to lkml and then ask for it to be incorporated into the RHCOS kernel. A workaround exists that can be used until the new kernel is available. This only affects configurations using "interpartiton logical LAN" (using the ibmveth driver). The workaround it to set: net.ipv4.route.min_pmtu = 1450 net.ipv4.ip_no_pmtu_disc = 1
Thank you for the update, Dave. Adding the "UpcomingSprint" label to this bug.
*** Bug 1827376 has been marked as a duplicate of this bug. ***
------- Comment From zhangcho.com 2020-04-27 16:28 EDT------- 1. create 88-sysctl.conf with: net.ipv4.tcp_mtu_probing = 1 net.ipv4.tcp_base_mss = 1024 2. add the 88-sysctl.conf to ignition files 3. Install OCP4 with the new ignition from step2 Now the OCP4 install can complete without error.
The following patch series has been submitted to netdev, these 2 patches corrects the problem with the ibmveth described in this bug. https://www.spinics.net/lists/netdev/msg690799.html https://www.spinics.net/lists/netdev/msg690800.html https://www.spinics.net/lists/netdev/msg690801.html
(In reply to David J. Wilder from comment #31) > The following patch series has been submitted to netdev, these 2 patches > corrects the problem with the ibmveth described in this bug. > https://www.spinics.net/lists/netdev/msg690799.html > https://www.spinics.net/lists/netdev/msg690800.html > https://www.spinics.net/lists/netdev/msg690801.html . fyi ... the above mentioned 2 patches are now requested for RHEL8.4/8.3/8.2 through LTC bug 185730 - RH1887038- RHEL8.1 - ibmveth is producing TX errors over VXLAN when large send (TSO) is enabled ...
Hi @Dave, an FYI that after OCP 4.6 releases next week, OCP 4.3 will go end of maintenance support. If your patches fix this bug, then we should try to get this in and close out this bug. Otherwise, we should consider either closing this bug if it won't be fixed, or re-targeting this bug to a later reported version (4.4, 4.5, 4.6) if it affects other OCP versions.
------- Comment From wilder.com 2020-10-13 19:25 EDT------- (In reply to comment #25) > Hi @Dave, an FYI that after OCP 4.6 releases next week, OCP 4.3 will go end > of maintenance support. If your patches fix this bug, then we should try to > get this in and close out this bug. Otherwise, we should consider either > closing this bug if it won't be fixed, or re-targeting this bug to a later > reported version (4.4, 4.5, 4.6) if it affects other OCP versions. The fix is in the kernel, so it will affect all versions of OCP on power. I tried to update the version field on the bug but the pull down only list 4.3(x). Can you update it on your end?
Yes, I just bumped the "Version" to 4.4. Let me know if you would like to change it to any other release.
Hi @Dave, I'm going through this exercise once again. Do you think this bug will be resolved before the end of the current sprint (Oct 24th)? If not, I would like to add "UpcomingSprint" tag
Dave's patches are being added to the RHEL kernel https://bugzilla.redhat.com/show_bug.cgi?id=1887038#c10 These patches should also be included in CoreOS as well. Should I create a new bug for rhel 8.3 z stream to make sure this is included in OpenShift 4.7? Or will the fix in coreos be tracked here?
I believe we need to create a BZ for every release the fix will go into. Dennis is that right?
I'm adding an "UpcomingSprint" tag as per BZ 1887038 this bug is still at "POST" and seems unlikely to be resolved before the end of the current sprint (Oct 24th).
Hi @Dave, do you see this bug as a blocker for OCP 4.7? If not, I would like to set the blocker flag to indicate that this bug is not a blocker.
Hi Dave, I see that BZ 1887038 is currently ON_QA. Will this bug reach ON_QA before the end of this sprint (November 14th)?
We need to file a bug for master, then clone that bug to each release we need to backport to.
------- Comment From wilder.com 2020-11-09 14:41 EDT------- (In reply to comment #32) > Hi @Dave, do you see this bug as a blocker for OCP 4.7? If not, I would like > to set the blocker flag to indicate that this bug is not a blocker. No, not a blocker.
Setting "blocker-" per Dave's comment and adding UpcomingSprint label as this may not be resolved before the end of this sprint
Hi Dave, is this bug still at "Post" state? Seems like the related bug BZ 1887038 is currently at Verified.
Adding "UpcomingSprint" as this bug will not be resolved before end of this sprint.
Hi Dave, a couple of questions, 1. Will this bug be resolved before the end of this sprint (Dec 26th)? If not, I'd like to add "UpcomingSprint" 2. The current "Target Release" for this bug is 4.7? Do we know if this bug will be resolved during 4.7? If not, we should set it to "blank" or "4.8"
(In reply to Dan Li from comment #47) > Hi Dave, a couple of questions, > > 1. Will this bug be resolved before the end of this sprint (Dec 26th)? If > not, I'd like to add "UpcomingSprint" > 2. The current "Target Release" for this bug is 4.7? Do we know if this bug > will be resolved during 4.7? If not, we should set it to "blank" or "4.8" Dan, I presume Dave is "Dave Wilder" from IBM. He is out till Jan 4th, so is unlikely to respond before that. That said, Dave Wilder's patches have been accepted upstream (see comment# 31 which describes the submission upstream) several months ago. In comment# 37, Mick Trasel from IBM asked if these patches will be included in CoreOS as well. I don't understand the nuances of what is gating acceptance into OCP 4.X. An explanation will really help us understand the process that RH follows (and we don't need to bugging folks at RH).
Hi Pradeep, thank you for the response. As a part of bug triage team, the OpenShift team requires us to check in with each bug owner every sprint (3 weeks) to confirm the status of each bug; henceforth my ask in Comment 47. Since Dave Wilder's patch has been accepted, does it indicate that we can move this bug to a further state for the creator to test the fix? In regards to Mick's comment 37, I am not the best person to answer this questions unfortunately. Does Dennis' answer in Comment 42 help with the question? Based on his answer, I believe the best practice is to create a bug for the master, then create a "Clone" of the bug and point the target fix of the bug to each OCP release (4.7, 4.6, 4.5, etc.)
(In reply to Dan Li from comment #49) > Hi Pradeep, thank you for the response. As a part of bug triage team, the > OpenShift team requires us to check in with each bug owner every sprint (3 > weeks) to confirm the status of each bug; henceforth my ask in Comment 47. Thanks for the explanation Dan. Understood. > > Since Dave Wilder's patch has been accepted, does it indicate that we can > move this bug to a further state for the creator to test the fix? Yes, please move this bug to the next stage for testing the fix > > In regards to Mick's comment 37, I am not the best person to answer this > questions unfortunately. Does Dennis' answer in Comment 42 help with the > question? Based on his answer, I believe the best practice is to create a > bug for the master, then create a "Clone" of the bug and point the target > fix of the bug to each OCP release (4.7, 4.6, 4.5, etc.) Looking at bug# 1887038 it appears that has already been done. Do I understand that correctly?
Yes, it seems that bug 1887038 is already VERIFIED, which means that it is fixed and tested at least in the environment that the other bug is reported in.
Hi Archana, could you verify this bug from your side as the reporter? It seems that this bug has been fixed according to Comment 31 and bug 1887038.
(In reply to Dan Li from comment #52) > Hi Archana, could you verify this bug from your side as the reporter? It > seems that this bug has been fixed according to Comment 31 and bug 1887038. . fyi ... the related kernel patches are available in the RHEL8 zstream kernels for 8.2.0.z in "kernel-4.18.0-193.37.1.el8_2.src.rpm" (as of RHBZ bug 1896300 - RHEL8.1 - ibmveth is producing TX errors over VXLAN when large send (TSO) is enabled (-> related to Red Hat bug 1816254 - OCP 4.3 - Authentication clusteroperator is in unknown state on POWER 9 servers") [rhel-8.2.0.z] ...) and for 8.3.0.z in "kernel-4.18.0-240.8.1.el8_3.src.rpm" (as of RBHZ bug 1896299 - RHEL8.1 - ibmveth is producing TX errors over VXLAN when large send (TSO) is enabled (-> related to Red Hat bug 1816254 - OCP 4.3 - Authentication clusteroperator is in unknown state on POWER 9 servers") [rhel-8.3.0.z] ...) ...
Per Hanns-Joachim's comment 53, should this bug be moved to "ON_QA"?
Moving this bug to ON_QA as a part of the triage process per Comment 51 and Comment 53. Hi Archana if you and team have time, please verify this bug's fix as related bugs have been in Verified/Closed status.
We have verified this bug with OCP 4.7 cluster and it completed fine. We are deploying test applications on this cluster. We intend to monitor the cluster operator behavior for few days to ensure that the auth co doesnt get degraded. AFter that we will close this bug, sometime next week.
@aprabhak reminder
Cluster was stable and co's were healthy. We can close this bug.
Closing this bug per Archana's successful verification results.