1816254 – OCP 4.3 - Authentication clusteroperator is in unknown state on POWER 9 servers

Bug 1816254 - OCP 4.3 - Authentication clusteroperator is in unknown state on POWER 9 servers

Summary: OCP 4.3 - Authentication clusteroperator is in unknown state on POWER 9 servers

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Multi-Arch
Sub Component:
Version:	4.4
Hardware:	ppc64le
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	David J. Wilder
QA Contact:	Barry Donahue
Docs Contact:
URL:
Whiteboard:	multi-arch
Duplicates (1):	1827376 (view as bug list)
Depends On:	1887038 1896299 1896300
Blocks:	OCP/Z_4.2
TreeView+	depends on / blocked

Reported:	2020-03-23 16:35 UTC by Archana Prabhakar
Modified:	2023-12-15 17:33 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-28 13:33:01 UTC
Target Upstream Version:
Embargoed:
Flags:	wilder: needinfo-

Attachments	(Terms of Use)
MustFrag-worker0-env32-Filter-TCP.pcap (20.45 KB, application/octet-stream) 2020-08-20 23:09 UTC, David J. Wilder	no flags	Details
View All

Description Archana Prabhakar 2020-03-23 16:35:53 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:

Deploy an OCP 4.3 cluster on POWER 9 fleetwood/zz/zepplin server.
Use tf scripts with an Openstack provider to create cluster VMs.
This issue is recreatable on all the POWER 9 models using SEA based network connection type for the VMs.

Steps to Reproduce:
1.
2.
3.

Actual results:

Openshift install fails at the below step.

[root@arc-p9flt-ocp43-b0a2-bastion openstack-upi]# ./openshift-install wait-for install-complete
INFO Waiting up to 30m0s for the cluster at https://api.arc-p9flt-ocp43-b0a2.p9flt.com:6443 to initialize... 

ERROR Cluster operator authentication Degraded is True with RouteStatusDegradedFailedCreate: RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift) 
INFO Cluster operator authentication Progressing is Unknown with NoData:  
INFO Cluster operator authentication Available is Unknown with NoData:  
ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default 
INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
Moving to release version "4.3.0-0.nightly-ppc64le-2020-03-02-144601".
Moving to ingress-controller image version "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3896981f985871cd7b286866f1987d885d10a52698f7117014b9ccdb26b8bd9". 
INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. 
INFO Cluster operator insights Disabled is False with :  
INFO Cluster operator kube-controller-manager Progressing is True with Progressing: Progressing: 1 nodes are at revision 5; 2 nodes are at revision 6 
INFO Cluster operator kube-scheduler Progressing is True with Progressing: Progressing: 1 nodes are at revision 5; 2 nodes are at revision 6 
ERROR Cluster operator monitoring Degraded is True with UpdatingconfigurationsharingFailed: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s) 
INFO Cluster operator monitoring Available is False with :  
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. 
INFO Cluster operator openshift-apiserver Available is False with _OpenShiftAPICheckFailed: Available: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
Available: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
Available: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
Available: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
Available: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) 
INFO Cluster operator operator-lifecycle-manager-packageserver Available is False with :  
INFO Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.13.0 
FATAL failed to initialize the cluster: Multiple errors are preventing progress:
* Could not update oauthclient "console" (289 of 496): the server does not recognize this resource, check extension API servers
* Could not update role "openshift-console-operator/prometheus-k8s" (446 of 496): resource may have been deleted 



All nodes and pods are healthy. But, the Authentication clusteroperator is in unknown state.


[root@arc-p9flt-ocp43-b0a2-bastion auth]# oc get clusteroperators | grep auth
NAME                                       VERSION                                     AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                         Unknown     Unknown       True       20m

Expected results:


Additional info:

Comment 1 Archana Prabhakar 2020-03-23 16:41:07 UTC

We dont see this issue, when we use the SRIOV type of network connection for cluster VM deployment. 

One of our internal developers had the below finding.

The vxlan tunnel not transmitting is something to do with MTU size, PMTU, and Large Sends. 
After enabling Path MTU discovery (PMTU) and we no longer see any dropped packets on vxlan_sys_4789 on the master. Below work around was added to all master nodes and the authentication clusteroperator was up and running.

# oc debug node/arc-p9flt-ocp43-b0a2-master-0
# chroot /host
# vi /etc/sysctl.conf
 net.ipv4.ip_no_pmtu_disc=1
# sysctl -p
# restart


We need this to be fixed in the openshift code.

Comment 2 mkumatag 2020-03-23 17:18:14 UTC

aconstan this is the same issue what we discussed offline over the chat, we followed the workaround from - https://access.redhat.com/solutions/4585001 and it worked later.

Comment 3 Ben Bennett 2020-03-25 14:50:36 UTC

Changing the system config feels like an installer thing not an SDN one.  If you feel otherwise, can you give us guidance about how to set the system config?

Comment 4 Scott Dodson 2020-03-27 23:47:06 UTC

This would be affected through definition of machine config. MCO team can help with that process but to me this bug doesn't make it clear when such configuration is necessary. Moving to multi-arch in hopes that they know best when to enable these settings on this platform.

Comment 5 Prashanth Sundararaman 2020-03-30 20:52:10 UTC

we can create a template conf file in MCO along with the other sysctl variables and put this in there.I will investigate this

Comment 6 Prashanth Sundararaman 2020-04-09 22:03:35 UTC

opened this issue to discuss options with MCO team on ways to do this

https://github.com/openshift/machine-config-operator/issues/1638

Comment 7 Dennis Gilmore 2020-04-17 18:35:48 UTC

Openstack is not going to be supported in OCP 4.3, the only supported methods for end-user installs in 4.3 are UPI on Baremetal, where Baremetal is RHCOS(Linux running on top of system firmware or a pre-provisioned PowerVM LPAR.

Comment 8 Mick Tarsel 2020-04-22 22:01:34 UTC

I think using Path MTU is not the best solution here. That is going to make us use the min_pmtu value which by default is 552 bytes and will be terrible network performance. We should not have to tweak the MTU values in general but it is a valid work-around. Using this approach in MCO seems like it would lock us into a MTU value for all Power deployments - which is not going to work well for us.

A similar bug was seen in ICP deployments a while ago, again only in PowerVM environments. 

A better solution would be to disable large sends in the ibmveth driver.

insmod ibmveth.ko old_large_send=0

I'm not sure how to automate this via MCO so perhaps this will just have to be an extra configuration step in PowerVM deployments...?

Alternatively, if we don't want to disable large sends (which imo is a better solution by avoiding the MTU configs) - we can change the pmtu value to something higher than 552 via net.ipv4.route.min_pmtu=1430 or 1450

Now the question would be, what's the appropriate MTU value? The idea is to keep the MTU as large as possible without creating a lot of fragmentation and obviously not dropping packets. Too small of MTU and network performance will suffer. This is why I think using the large send approach will be better going forward. 

Since large_send is enabled by default in this driver, I'm pretty sure disabling it will allow the Authentication cluster operator to come up healthy. I'm waiting to get access to this cluster and run some tests.

Comment 11 Dan Li 2020-07-21 17:54:07 UTC

Prashanth is out until August. We are fixing other bugs during the current sprint so I am putting "UpcomingSprint" tag until Prashanth returns.

Comment 13 Dan Li 2020-08-12 16:26:43 UTC

Hi Mick, you mentioned in a previous comment that you are planning to run some tests. Any findings on the potential fix for this bug?

Comment 14 David J. Wilder 2020-08-12 20:35:38 UTC

I have been investigating an issue on 4.4 using ibmveth not SEA, but I suspect it is the same issue.

I found that a cashed route entry to the address of the other lpar or (node) exists with an associated mtu of 1450.  This cashed entry results in some frames sent via the vxlan tunnel to be dropped.  Any packet of size 1450 with don't fragment set, sent via the vxlan and destined to the address matching the cashed entry will be dropped in ip_fragment() with error -EMSGSIZE.


I can demonstrate using ping sending a larger than mtu size ICMP ECHO frame.

# ping -c1 -s2000 10.129.0.1  <<< address of tun0 interface on node 9.47.88.234  >>>>

This results is two packets sent making up a 2000 byte ICMP ECHO.

Debug tracing:
172741.648290: vxlan_xmit_one: skb=000000001d6ce7c9 prot=1 10.128.0.1:0-->10.129.0.1:0 len=618  << small packet
172741.648293: vxlan_xmit_one: skb=0000000076d0dea2 prot=1 10.128.0.1:0-->10.129.0.1:0 len=1458 << large packet
172741.648300: vxlan_xmit_one: skb=000000001d6ce7c9 df=0x40 len=626 9.47.88.239-->9.47.88.234    << DF is set
172741.648300: vxlan_xmit_one: skb=0000000076d0dea2 df=0x40 len=1466 9.47.88.239-->9.47.88.234      ""
172741.648332: iptunnel_xmit: skb=000000001d6ce7c9 err=0x0  <<< small packet sent with no error.
172741.648332: ip_fragment.constprop.4: IPSTATS_MIB_FRAGFAILS skb=0000000076d0dea2 mtu=1450 flags=0x0 gso_size=0 dst=000000009d832520 dev=env32 frag_max_size=0               
172741.648363: iptunnel_xmit: skb=0000000076d0dea2 err=0xffffffa6  <<< larger packet returns -EMSGSIZE 
172741.648363: iptunnel_xmit: ++tx-drop: skb=0000000076d0dea2

The larger of the two packets is encapsulated in a vxlan header it becomes larger than the mtu of 1450 and the packet is dropped.  However, the mtu of env32 should be 1500 not 1450.

# ip link show | grep env32
2: env32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 

The MTU passed to ip_fragment is 1450,  where did that come from?  Checking cached routes shows the answer.

[root@master-0 etc]# ip -d route get to 9.47.88.234
unicast 9.47.88.234 dev env32 table main src 9.47.88.239 uid 0
    cache expires 597sec mtu 1450  <<<<


If I flush the cached FIB then ping start to work.

[root@master-0 etc]# ip route flush cache
[root@master-0 etc]# ip -d route get to 9.47.88.234
unicast 9.47.88.234 dev env32 table main src 9.47.88.239 uid 0
    cache
[root@master-0 etc]# ping -c1 -s2000 10.129.0.1
PING 10.129.0.1 (10.129.0.1) 2000(2028) bytes of data.
2008 bytes from 10.129.0.1: icmp_seq=1 ttl=64 time=1.17 ms

--- 10.129.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.174/1.174/1.174/0.000 ms

After a bit the cached FIB returns and ping stops working again.

The cache entry must be the results of pmtu discovery.

But path mtu should be disabled.  Is this correct?

[root@master-0 core]# sysctl -a | grep pmtu
net.ipv4.ip_forward_use_pmtu = 0
net.ipv4.ip_no_pmtu_disc = 0
net.ipv4.route.min_pmtu = 552

Next questions:
Why is the pmtu sent from the other lpar?
Why is this not a problem on other platforms?

Possible solution:  
ip route flush cache
ip route add 9.47.88.234 dev env32 mtu lock 1500

This appears to avoid the problem, vxlans tx_drop counter stop increasing,
I don't see the cashed FIB entry returning.

Will this break anything else??

Comment 15 Dan Li 2020-08-18 15:53:36 UTC

Thank you, Dave, can this bug be assigned under you since you are investigating the similar issue? Or will this bug need RH input?

Comment 17 David J. Wilder 2020-08-18 16:31:20 UTC

(In reply to Dan Li from comment #15)
> Thank you, Dave, can this bug be assigned under you since you are
> investigating the similar issue? Or will this bug need RH input?

You can assign it to me. However, I would appreciate input on my findings from openshift development as I continue to investigate.

Comment 18 Dan Li 2020-08-18 16:50:11 UTC

Sure. Let's see if there is any responses from this thread. If not, you can either raise this during our weekly RH/IBM call or let me know so that I can help with finding the right person for an answer.

Comment 19 David J. Wilder 2020-08-20 23:09:49 UTC

Created attachment 1712099 [details]
MustFrag-worker0-env32-Filter-TCP.pcap

Comment 20 David J. Wilder 2020-08-20 23:10:30 UTC

We know that vxlan dropping tx packets is caused by creation of a cache route entry with a MTU=1450.
This cache entry must be created due to the receipt of an ICMP Unreachable/Must-Fragment error.

Using tcpdump I captured several of these ICMP packets sent from worker0.
tcpdump was run on worker-0  (tcpdump -n -i env32 -w env32-startup.pcap)
Reference trace: MustFrag-worker0-env32-Filter-TCP.pcap

master0 - 9.47.88.239
worker0 - 9.47.88.234
The pod interfaces (eth0) have MTU=1450.
The external interface (ibmveth) MTU=1500.
The pcap was filtered to follow the tcp connection showing the icmp error.

The following packets are of interest.

1) Connection setup:
PK1     9.47.88.234::sun-sr-https >   SYN  MSS:1410
PK5     9.47.88.234:5439         >    SYN/ACK MSS:1410
<....>


2) A large packet is sent from master0 to worker0. Packet length=3080  MSS=3014  (DF is set), worker0 ACKs this in PK10.
However, MSS=3014 is larger than the negotiated MSS of 1410, This is OK due to TSO being enabled on all interfaces.

PK9:     9.47.88.239:sun-sr-https > 9.47.88.234:54390  Frame length: 3080  TCP-len(MSS): 3014

3) Master-0 sends 1489 byte packet (MSS=1423/DF) to worker0,  worker0 responds with a ICMP Must fragment.
This packet is still larger than the negotiated MSS but smaller than PK9.  The MTU of the interface is 1450.
Why did this packet result in a ICMP-Must Frag when PK9 did not?

PK27     9.47.88.239:sun-sr-https > 9.47.88.234:54390  Frame lenght: 1489  TCP-Len: 1423

PK28     9.47.88.234 > 9.47.88.239  ICMP Desttination unreachable/Fragmentation needed.
              Points to PK27 as the packet that need fragmentation

My guess is that the size of PK9 was compared to gso-size, and PK27's size was erroneously compared to the MTU of 1450.

Comment 21 David J. Wilder 2020-08-28 18:02:51 UTC

I have identified the bug in the ibmveth driver causing the generation of the ICMP Must Fragment error.  I tested a change to the ibmveth driver it successfully resolved the issue.  My change was reviewed by Power Virtualization development. A further discussion of the change with PHYP and VIOS development is pending.

Comment 22 Dan Li 2020-09-08 18:51:08 UTC

Hi Dave, it seems that there was an action pending on the bug. Do you think this bug will be resolved before the sprint ends this week? If not, I would like to add an "UpcomingSprint" tag to this bug so that it can be evaluated in a future sprint.

Comment 23 David J. Wilder 2020-09-08 20:27:01 UTC

no it will not be resolved this week.  I dont have a estimate at this time.

Comment 24 Dan Li 2020-09-08 20:42:26 UTC

Thank you, Dave! Adding "UpcomingSprint" tag to this bug

Comment 25 Dan Li 2020-09-28 17:15:04 UTC

Hi Dave, I am doing this as a part of Sprint close-off activity every 3 weeks. Do you think this bug will be resolved before end of this sprint (before October 3rd)? If not, I would like to add an "UpcomingSprint" tag for evaluation in the future.

Comment 26 David J. Wilder 2020-09-28 18:01:41 UTC

(In reply to Dan Li from comment #22)
> Hi Dave, it seems that there was an action pending on the bug. Do you think
> this bug will be resolved before the sprint ends this week? If not, I would
> like to add an "UpcomingSprint" tag to this bug so that it can be evaluated
> in a future sprint.

Hi Dan, Please mark as "UpcomingSprint",  here is a status, sorry I have not updated sooner. The issue is with the ibmveth driver and how it decides if an ingress frame is a GSO or not. We understand the problem and a fix has been purposed.  The firmware and the vios version also affects the problem therefor we need to validate the fix against older firmware.  We are in the process of this validation.  Once we are happy with the fix I will submit the patch to lkml and then ask for it to be incorporated into the RHCOS kernel.  

A workaround exists that can be used until the new kernel is available. This only affects configurations using "interpartiton logical LAN" (using the ibmveth driver).

The workaround it to set:
net.ipv4.route.min_pmtu = 1450
net.ipv4.ip_no_pmtu_disc = 1

Comment 27 Dan Li 2020-09-28 18:06:04 UTC

Thank you for the update, Dave. Adding the "UpcomingSprint" label to this bug.

Comment 28 Mick Tarsel 2020-09-30 14:25:57 UTC

*** Bug 1827376 has been marked as a duplicate of this bug. ***

Comment 29 IBM Bug Proxy 2020-09-30 15:57:33 UTC

------- Comment From zhangcho.com 2020-04-27 16:28 EDT-------
1. create 88-sysctl.conf with:
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_base_mss = 1024

2. add the 88-sysctl.conf to ignition files

3. Install OCP4 with the new ignition from step2

Now the OCP4 install can complete without error.

Comment 31 David J. Wilder 2020-10-08 20:31:10 UTC

The following patch series has been submitted to netdev,  these 2 patches corrects the problem with the ibmveth described in this bug.
https://www.spinics.net/lists/netdev/msg690799.html
https://www.spinics.net/lists/netdev/msg690800.html
https://www.spinics.net/lists/netdev/msg690801.html

Comment 32 Hanns-Joachim Uhl 2020-10-10 09:51:09 UTC

(In reply to David J. Wilder from comment #31)
> The following patch series has been submitted to netdev,  these 2 patches
> corrects the problem with the ibmveth described in this bug.
> https://www.spinics.net/lists/netdev/msg690799.html
> https://www.spinics.net/lists/netdev/msg690800.html
> https://www.spinics.net/lists/netdev/msg690801.html
.
fyi ... the above mentioned 2 patches are now requested for RHEL8.4/8.3/8.2 through
LTC bug 185730 - RH1887038- RHEL8.1 - ibmveth is producing TX errors over VXLAN when large send (TSO) is enabled ...

Comment 33 Dan Li 2020-10-13 23:08:56 UTC

Hi @Dave, an FYI that after OCP 4.6 releases next week, OCP 4.3 will go end of maintenance support. If your patches fix this bug, then we should try to get this in and close out this bug. Otherwise, we should consider either closing this bug if it won't be fixed, or re-targeting this bug to a later reported version (4.4, 4.5, 4.6) if it affects other OCP versions.

Comment 34 IBM Bug Proxy 2020-10-13 23:30:44 UTC

------- Comment From wilder.com 2020-10-13 19:25 EDT-------
(In reply to comment #25)
> Hi @Dave, an FYI that after OCP 4.6 releases next week, OCP 4.3 will go end
> of maintenance support. If your patches fix this bug, then we should try to
> get this in and close out this bug. Otherwise, we should consider either
> closing this bug if it won't be fixed, or re-targeting this bug to a later
> reported version (4.4, 4.5, 4.6) if it affects other OCP versions.

The fix is in the kernel, so it will affect all versions of OCP on power.  I tried to update the version field on the bug but the pull down only list 4.3(x).  Can you update it on your end?

Comment 35 Dan Li 2020-10-13 23:37:48 UTC

Yes, I just bumped the "Version" to 4.4. Let me know if you would like to change it to any other release.

Comment 36 Dan Li 2020-10-19 20:20:20 UTC

Hi @Dave, I'm going through this exercise once again. Do you think this bug will be resolved before the end of the current sprint (Oct 24th)? If not, I would like to add "UpcomingSprint" tag

Comment 37 Mick Tarsel 2020-10-23 03:32:35 UTC

Dave's patches are being added to the RHEL kernel https://bugzilla.redhat.com/show_bug.cgi?id=1887038#c10 These patches should also be included in CoreOS as well.

Should I create a new bug for rhel 8.3 z stream to make sure this is included in OpenShift 4.7? Or will the fix in coreos be tracked here?

Comment 38 Barry Donahue 2020-10-23 10:46:52 UTC

I believe we need to create a BZ for every release the fix will go into. Dennis is that right?

Comment 39 Dan Li 2020-10-23 15:43:26 UTC

I'm adding an "UpcomingSprint" tag as per BZ 1887038 this bug is still at "POST" and seems unlikely to be resolved before the end of the current sprint (Oct 24th).

Comment 40 Dan Li 2020-10-28 20:31:05 UTC

Hi @Dave, do you see this bug as a blocker for OCP 4.7? If not, I would like to set the blocker flag to indicate that this bug is not a blocker.

Comment 41 Dan Li 2020-11-09 16:31:09 UTC

Hi Dave, I see that BZ 1887038 is currently ON_QA. Will this bug reach ON_QA before the end of this sprint (November 14th)?

Comment 42 Dennis Gilmore 2020-11-09 18:30:29 UTC

We need to file a bug for master, then clone that bug to each release we need to backport to.

Comment 43 IBM Bug Proxy 2020-11-09 19:50:43 UTC

------- Comment From wilder.com 2020-11-09 14:41 EDT-------
(In reply to comment #32)
> Hi @Dave, do you see this bug as a blocker for OCP 4.7? If not, I would like
> to set the blocker flag to indicate that this bug is not a blocker.

No, not a blocker.

Comment 44 Dan Li 2020-11-10 13:19:40 UTC

Setting "blocker-" per Dave's comment and adding UpcomingSprint label as this may not be resolved before the end of this sprint

Comment 45 Dan Li 2020-11-30 18:19:44 UTC

Hi Dave, is this bug still at "Post" state? Seems like the related bug BZ 1887038 is currently at Verified.

Comment 46 Dan Li 2020-12-04 13:34:43 UTC

Adding "UpcomingSprint" as this bug will not be resolved before end of this sprint.

Comment 47 Dan Li 2020-12-14 21:56:07 UTC

Hi Dave, a couple of questions,

1. Will this bug be resolved before the end of this sprint (Dec 26th)? If not, I'd like to add "UpcomingSprint"
2. The current "Target Release" for this bug is 4.7? Do we know if this bug will be resolved during 4.7? If not, we should set it to "blank" or "4.8"

Comment 48 Pradeep Satyanarayana 2020-12-15 00:10:21 UTC

(In reply to Dan Li from comment #47)
> Hi Dave, a couple of questions,
> 
> 1. Will this bug be resolved before the end of this sprint (Dec 26th)? If
> not, I'd like to add "UpcomingSprint"
> 2. The current "Target Release" for this bug is 4.7? Do we know if this bug
> will be resolved during 4.7? If not, we should set it to "blank" or "4.8"

Dan, I presume Dave is "Dave Wilder" from IBM. He is out till Jan 4th, so is unlikely to respond before that. That said, Dave Wilder's patches have been accepted upstream (see comment# 31 which describes the submission upstream) several months ago.

In comment# 37, Mick Trasel from IBM asked if these patches will be included in CoreOS as well.  I don't understand the nuances of what is gating acceptance into OCP 4.X. An explanation will really help us understand the process that RH follows (and we don't need to bugging folks at RH).

Comment 49 Dan Li 2020-12-15 14:40:52 UTC

Hi Pradeep, thank you for the response. As a part of bug triage team, the OpenShift team requires us to check in with each bug owner every sprint (3 weeks) to confirm the status of each bug; henceforth my ask in Comment 47.

Since Dave Wilder's patch has been accepted, does it indicate that we can move this bug to a further state for the creator to test the fix?

In regards to Mick's comment 37, I am not the best person to answer this questions unfortunately. Does Dennis' answer in Comment 42 help with the question? Based on his answer, I believe the best practice is to create a bug for the master, then create a "Clone" of the bug and point the target fix of the bug to each OCP release (4.7, 4.6, 4.5, etc.)

Comment 50 Pradeep Satyanarayana 2020-12-15 16:31:45 UTC

(In reply to Dan Li from comment #49)
> Hi Pradeep, thank you for the response. As a part of bug triage team, the
> OpenShift team requires us to check in with each bug owner every sprint (3
> weeks) to confirm the status of each bug; henceforth my ask in Comment 47.

Thanks for the explanation Dan. Understood.

> 
> Since Dave Wilder's patch has been accepted, does it indicate that we can
> move this bug to a further state for the creator to test the fix?

Yes, please move this bug to the next stage for testing the fix

> 
> In regards to Mick's comment 37, I am not the best person to answer this
> questions unfortunately. Does Dennis' answer in Comment 42 help with the
> question? Based on his answer, I believe the best practice is to create a
> bug for the master, then create a "Clone" of the bug and point the target
> fix of the bug to each OCP release (4.7, 4.6, 4.5, etc.)

Looking at bug# 1887038 it appears that has already been done. Do I understand that correctly?

Comment 51 Dan Li 2020-12-15 18:37:59 UTC

Yes, it seems that bug 1887038 is already VERIFIED, which means that it is fixed and tested at least in the environment that the other bug is reported in.

Comment 52 Dan Li 2020-12-15 18:41:00 UTC

Hi Archana, could you verify this bug from your side as the reporter? It seems that this bug has been fixed according to Comment 31 and bug 1887038.

Comment 53 Hanns-Joachim Uhl 2020-12-16 08:48:54 UTC

(In reply to Dan Li from comment #52)
> Hi Archana, could you verify this bug from your side as the reporter? It
> seems that this bug has been fixed according to Comment 31 and bug 1887038.
.
fyi ... the related kernel patches are available in the RHEL8 zstream kernels
for 8.2.0.z in "kernel-4.18.0-193.37.1.el8_2.src.rpm"
(as of RHBZ bug 1896300 - RHEL8.1 - ibmveth is producing TX errors over VXLAN when large send (TSO) is enabled (-> related to Red Hat bug 1816254 - OCP 4.3 - Authentication clusteroperator is in unknown state on POWER 9 servers") [rhel-8.2.0.z] ...)
and
for 8.3.0.z in "kernel-4.18.0-240.8.1.el8_3.src.rpm"
(as of RBHZ bug 1896299 - RHEL8.1 - ibmveth is producing TX errors over VXLAN when large send (TSO) is enabled (-> related to Red Hat bug 1816254 - OCP 4.3 - Authentication clusteroperator is in unknown state on POWER 9 servers") [rhel-8.3.0.z] ...)
...

Comment 54 Dan Li 2021-01-04 15:21:42 UTC

Per Hanns-Joachim's comment 53, should this bug be moved to "ON_QA"?

Comment 55 Dan Li 2021-01-07 14:49:20 UTC

Moving this bug to ON_QA as a part of the triage process per Comment 51 and Comment 53. 

Hi Archana if you and team have time, please verify this bug's fix as related bugs have been in Verified/Closed status.

Comment 56 Archana Prabhakar 2021-01-08 11:40:24 UTC

We have verified this bug with OCP 4.7 cluster and it completed fine. We are deploying test applications on this cluster. We intend to monitor the cluster operator behavior for few days to ensure that the auth co doesnt get degraded. AFter that we will close this bug, sometime next week.

Comment 57 lmcfadde 2021-01-25 21:58:58 UTC

@aprabhak reminder

Comment 58 Archana Prabhakar 2021-01-28 11:16:31 UTC

Cluster was stable and co's were healthy. We can close this bug.

Comment 59 Dan Li 2021-01-28 13:33:01 UTC

Closing this bug per Archana's successful verification results.

Note You need to log in before you can comment on or make changes to this bug.