Bug 1832332

Summary: "[sig-network] Services should be rejected when no endpoints exist" test fails frequently on RHEL7 nodes
Product: Red Hat Enterprise Linux 7 Reporter: Vikas Laad <vlaad>
Component: kernelAssignee: Paolo Abeni <pabeni>
kernel sub component: arp/icmp QA Contact: Jianlin Shi <jishi>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aconstan, atragler, bbennett, cglombek, danw, dcbw, dhoward, ecordell, gnault, jdesousa, jiji, jstancek, miabbott, mkumatag, nmurray, pabeni, periklis, ptalbert, ricarril, rteague, sdodson, skunkerk, sukulkar, vrutkovs, walters, weliang, wking, ykashtan, zzhao
Version: 7.8Flags: pabeni: needinfo-
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: OKDBlocker
Fixed In Version: kernel-3.10.0-1148.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1781575
: 1834184 (view as bug list) Environment:
Last Closed: 2020-09-29 21:14:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1781575    
Bug Blocks: 1779811, 1825255, 1831684, 1834184    

Comment 1 Juan Luis de Sousa-Valadas 2020-05-06 16:58:02 UTC
Reassigning to kernel, but most likely there isn't an issue on the kernel at all, we just need someone to figure out what is happening at kernel level so that we can ultimately fix in user space. This is also happening on RHCOS which has a RHEL 8 kernel but I'm assigning the RHEL 7 version because it will be to investigate for a kernel engineer than the one of RHCOS where they won't have access to some of their usual tools.

The problem we're seeing is the ocp nodes (RHEL hosts) fail to reject iptables rules because it reaches the ipv4 icmp_ratelimit at least 95% of the time. We verified this is the issue by changing the icmp_ratemask from 6168 to 6160, after that the problem goes away, icmp_ratelimit has the default value of 1000.

Currently we don't understand WHY icmp is rate_limited. I tried to find the icmp packets with no success at all using this snippet:
for i in $(lsns | cut -c -90| grep net | awk '{print $4}'); do nsenter -n  -t $i tcpdump -i any icmp -w ns_$i.pcap & done

And the result is that I don't see any icmp traffic except for the one I force to be created even if I entirely disable the icmp_ratemask and after several minutes. I also tried to use tracefs and look for icmp traces there by doing:
# mount -t tracefs nodev /sys/kernel/tracing
# cd /sys/kernel/tracing
# echo function > current_tracer
# grep icmp trace_pipe

In this case I did see some ICMP but nothing interesting or that I can possibly correlate.

Also it's important to say OCP is more complex in terms of networking than your usual setup, we have a significant amount of net namespaces, one per pod, and we do the switching based in an ovs bridge, this needs to be taken in consideration. We also have vast amount of iptables rules.

So we need help figure out:
1- What kind of ICMP traffic is being rate limited
2- Where is it coming from and going to?
3- If this is traffic inside the host it would be extremely helpful to know information about the process such cmdline and PPID.

OpenShift QA can provide a test environment with a RHEL node which you can SSH into and act as root like in a normal RHEL host, can install packets using yum, etc. I can also assist setting up a reproducer or answering questions that you may have

Thanks

Comment 3 Patrick Talbert 2020-05-07 12:17:43 UTC
I don't quite understand the question.

The system has many iptables rules with a REJECT "icmp-port-unreachable" action and the concern is that when traffic matches these rules the system does not always transmit the expected ICMP packet back to the sender unless the icmp_ratelimit is removed or the default ratemask adjusted to allow Type 3 Dest Unreachable?


Note that even if the ICMP messages are rate limited, the traffic which triggered it is still dropped.



Upstream and the RHEL kernel (as of RHEL 7) also includes *global* limits controlled by icmp_msgs_per_sec and icmp_msgs_burst. This is checked first against ICMP matching the icmp_ratemask.

Then the rate limit controlled by icmp_ratelimit is checked; this is a per-peer limit. So if a given remote host sends a flood of traffic which is all REJECT'd the kernel will sent back *at most* 1 ICMP Dest Unreachable message per second (assuming icmp_ratelimit is the default 1000ms). But it will only do that if the global limit hasn't been met.


 567 void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 568 {
 569         struct iphdr *iph;
 570         int room;
 571         struct icmp_bxm icmp_param;
 572         struct rtable *rt = skb_rtable(skb_in);
 573         struct ipcm_cookie ipc;
 574         struct flowi4 fl4;
 575         __be32 saddr;
 576         u8  tos;
 577         u32 mark;
 578         struct net *net;
 579         struct sock *sk;
......
 657         /* Check global sysctl_icmp_msgs_per_sec ratelimit, unless
 658          * incoming dev is loopback.  If outgoing dev change to not be
 659          * loopback, then peer ratelimit still work (in icmpv4_xrlim_allow)
 660          */
 661         if (!(skb_in->dev && (skb_in->dev->flags&IFF_LOOPBACK)) &&
 662               !icmpv4_global_allow(net, type, code))
 663                 goto out_bh_enable;
......
 721         /* peer icmp_ratelimit */
 722         if (!icmpv4_xrlim_allow(net, rt, &fl4, type, code))
 723                 goto ende;
......
 739 ende:
 740         ip_rt_put(rt);
 741 out_unlock:
 742         icmp_xmit_unlock(sk);
 743 out_bh_enable:
 744         local_bh_enable();
 745 out:;
 746 }
 747 EXPORT_SYMBOL(icmp_send);


 298 static bool icmpv4_global_allow(struct net *net, int type, int code)
 299 {
 300         if (icmpv4_mask_allow(net, type, code))
 301                 return true;
 302 
 303         if (icmp_global_allow())
 304                 return true;
 305 
 306         return false;
 307 }

 282 static bool icmpv4_mask_allow(struct net *net, int type, int code)
 283 {
 284         if (type > NR_ICMP_TYPES)
 285                 return true;
 286 
 287         /* Don't limit PMTU discovery. */
 288         if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED)
 289                 return true;
 290 
 291         /* Limit if icmp type is enabled in ratemask. */
 292         if (!((1 << type) & net->ipv4.sysctl_icmp_ratemask))
 293                 return true;
 294 
 295         return false;
 296 }

 243 /**
 244  * icmp_global_allow - Are we allowed to send one more ICMP message ?
 245  *
 246  * Uses a token bucket to limit our ICMP messages to sysctl_icmp_msgs_per_sec.
 247  * Returns false if we reached the limit and can not send another packet.
 248  * Note: called with BH disabled
 249  */
 250 bool icmp_global_allow(void)
 251 {
 252         u32 credit, delta, incr = 0, now = (u32)jiffies;
 253         bool rc = false;
 254 
 255         /* Check if token bucket is empty and cannot be refilled
 256          * without taking the spinlock.
 257          */
 258         if (!icmp_global.credit) {
 259                 delta = min_t(u32, now - icmp_global.stamp, HZ);
 260                 if (delta < HZ / 50)
 261                         return false;
 262         }
 263 
 264         spin_lock(&icmp_global.lock);
 265         delta = min_t(u32, now - icmp_global.stamp, HZ);
 266         if (delta >= HZ / 50) {
 267                 incr = sysctl_icmp_msgs_per_sec * delta / HZ ;
 268                 if (incr)
 269                         icmp_global.stamp = now;
 270         }
 271         credit = min_t(u32, icmp_global.credit + incr, sysctl_icmp_msgs_burst);
 272         if (credit) {
 273                 credit--;
 274                 rc = true;
 275         }
 276         icmp_global.credit = credit;
 277         spin_unlock(&icmp_global.lock);
 278         return rc;
 279 }
 280 EXPORT_SYMBOL(icmp_global_allow);


 313 static bool icmpv4_xrlim_allow(struct net *net, struct rtable *rt,
 314                                struct flowi4 *fl4, int type, int code)
 315 {
 316         struct dst_entry *dst = &rt->dst;
 317         struct inet_peer *peer;
 318         bool rc = true;
 319 
 320         if (icmpv4_mask_allow(net, type, code))
 321                 goto out;
 322 
 323         /* No rate limit on loopback */
 324         if (dst->dev && (dst->dev->flags&IFF_LOOPBACK))
 325                 goto out;
 326 
 327         peer = inet_getpeer_v4(net->ipv4.peers, fl4->daddr, 1);
 328         rc = inet_peer_xrlim_allow(peer, net->ipv4.sysctl_icmp_ratelimit);
 329         if (peer)
 330                 inet_putpeer(peer);
 331 out:
 332         return rc;
 333 }

519 /*
520  *      Check transmit rate limitation for given message.
521  *      The rate information is held in the inet_peer entries now.
522  *      This function is generic and could be used for other purposes
523  *      too. It uses a Token bucket filter as suggested by Alexey Kuznetsov.
524  *
525  *      Note that the same inet_peer fields are modified by functions in
526  *      route.c too, but these work for packet destinations while xrlim_allow
527  *      works for icmp destinations. This means the rate limiting information
528  *      for one "ip object" is shared - and these ICMPs are twice limited:
529  *      by source and by destination.
530  *
531  *      RFC 1812: 4.3.2.8 SHOULD be able to limit error message rate
532  *                        SHOULD allow setting of rate limits 
533  *
534  *      Shared between ICMPv4 and ICMPv6.
535  */
536 #define XRLIM_BURST_FACTOR 6
537 bool inet_peer_xrlim_allow(struct inet_peer *peer, int timeout)
538 {
539         unsigned long now, token;
540         bool rc = false;
541 
542         if (!peer)
543                 return true;
544 
545         token = peer->rate_tokens;
546         now = jiffies;
547         token += now - peer->rate_last;
548         peer->rate_last = now;
549         if (token > XRLIM_BURST_FACTOR * timeout)
550                 token = XRLIM_BURST_FACTOR * timeout;
551         if (token >= timeout) {
552                 token -= timeout;
553                 rc = true;
554         }
555         peer->rate_tokens = token;
556         return rc;
557 }
558 EXPORT_SYMBOL(inet_peer_xrlim_allow);



The kernel does not log any details or statistics about this activity; you'd have to do something like perf or stap to track specific instances when icmp_send() was "blocked" by one of these limits.

Comment 4 Dan Winship 2020-05-07 12:41:16 UTC
(In reply to Patrick Talbert from comment #3)
> I don't quite understand the question.
> 
> The system has many iptables rules with a REJECT "icmp-port-unreachable"
> action and the concern is that when traffic matches these rules the system
> does not always transmit the expected ICMP packet back to the sender unless
> the icmp_ratelimit is removed or the default ratemask adjusted to allow Type
> 3 Dest Unreachable?

It's not that it "does not always transmit the expected ICMP packet". It's that it almost 100% reliably fails to transmit the expected ICMP packet, even when the network is otherwise almost completely idle and so no rate limiting should be occurring.

The behavior also seems to vary a lot between releases. It never worked right in RHEL 7; it started working right in RHEL 8, but is now broken again in RHEL 8.2, which has been bisected to a particular commit: https://bugzilla.redhat.com/show_bug.cgi?id=1781575#c37.

Comment 5 Patrick Talbert 2020-05-07 13:19:06 UTC
(In reply to Dan Winship from comment #4)
> (In reply to Patrick Talbert from comment #3)
> > I don't quite understand the question.
> > 
> > The system has many iptables rules with a REJECT "icmp-port-unreachable"
> > action and the concern is that when traffic matches these rules the system
> > does not always transmit the expected ICMP packet back to the sender unless
> > the icmp_ratelimit is removed or the default ratemask adjusted to allow Type
> > 3 Dest Unreachable?
> 
> It's not that it "does not always transmit the expected ICMP packet". It's
> that it almost 100% reliably fails to transmit the expected ICMP packet,
> even when the network is otherwise almost completely idle and so no rate
> limiting should be occurring.

Do you have steps to reproduce this that does not involve an entire Openshift deployment?


> 
> The behavior also seems to vary a lot between releases. It never worked
> right in RHEL 7; it started working right in RHEL 8, but is now broken again
> in RHEL 8.2, which has been bisected to a particular commit:
> https://bugzilla.redhat.com/show_bug.cgi?id=1781575#c37.

Ah that's great. Has anyone gone the next step to see which of the several commits from that BZ is causing the condition?

Comment 6 Alexander Constantinescu 2020-05-07 13:23:38 UTC
*** Bug 1829961 has been marked as a duplicate of this bug. ***

Comment 7 Alexander Constantinescu 2020-05-07 14:09:29 UTC
*** Bug 1831684 has been marked as a duplicate of this bug. ***

Comment 8 Patrick Talbert 2020-05-07 14:21:37 UTC
In staring at the commits from BZ1765639 I really do not immediately see how those would impact this issue.

There is a lot going on here so it's possible some other change had an unexpected knock-on effect:

$ git log --oneline kernel-4.18.0-151.el8..kernel-4.18.0-152.el8 net/ include/net/ | wc -l
462


But definitely if there is an existing reproducer then this is ripe for a further bisect.

Comment 9 Colin Walters 2020-05-07 15:41:46 UTC
> Do you have steps to reproduce this that does not involve an entire Openshift deployment?

If it helps we can easily get anyone who needs it a cluster already spun up and a kubeconfig.

Also I'm happy to join a Bluejeans and help in realtime.

FWIW I have a setup now w/Systemtap to help me debug this and am capturing live notes here:

https://hackmd.io/B3IVIiQeTei6TFz0kXp6Ng

But I don't know this code (and only know rudimentary SystemTap).

Comment 10 Colin Walters 2020-05-07 17:56:44 UTC
I've updated the hackmd but just to post a checkpoint of my findings, using this systemtap script:

```
#! /usr/bin/env stap

probe begin {
    println("Watching ICMP, Ctrl-C to exit")
}

probe kernel.function("inet_peer_xrlim_allow") {
    if ($peer == NULL) {
        println("inet_peer_xrlim_allow(NULL peer)")
    }
}

probe kernel.function("inet_peer_xrlim_allow").return {
    printf("inet_peer_xrlim_allow last=%s tokens=%s ret=%s\n", $peer->rate_last$, $peer->rate_tokens$, $$return)
}

probe kernel.function("icmpv4_xrlim_allow").return {
    printf("icmpv4_xrlim_allow(type=%d code=%d) %s\n", $type, $code, $$return)
}

probe kernel.function("icmpv4_global_allow").return {
    if ($return == 0) {
        println("icmpv4_global_allow: denied")
    }
}

probe module("nf_reject_ipv4").function("nf_send_unreach") {
    printf("nf_send_unreach: %s\n", $skb_in->dev->name$)
}
```

Here's what I see from the RHEL 8.1 kernel and a successful test:

# staprun inetpeer81.ko 
Watching ICMP, Ctrl-C to exit
icmpv4_xrlim_allow(type=5 code=1) return=0x1
nf_send_unreach: "tun0"
inet_peer_xrlim_allow last=4361979389 tokens=1 ret=return=0x0
icmpv4_xrlim_allow(type=3 code=3) return=0x0
icmpv4_xrlim_allow(type=5 code=1) return=0x1
nf_send_unreach: "tun0"
inet_peer_xrlim_allow last=4361980416 tokens=6 ret=return=0x0
icmpv4_xrlim_allow(type=3 code=3) return=0x0
icmpv4_xrlim_allow(type=5 code=1) return=0x1
nf_send_unreach: "tun0"
inet_peer_xrlim_allow last=4361984622 tokens=7 ret=return=0x0
icmpv4_xrlim_allow(type=3 code=3) return=0x0
nf_send_unreach: "tun0"
inet_peer_xrlim_allow last=4361984628 tokens=13 ret=return=0x1
icmpv4_xrlim_allow(type=3 code=3) return=0x1

Using kernel-4.18.0-193.el8.x86_64 what I see is:

icmpv4_xrlim_allow(type=5 code=1) return=0x1
icmpv4_xrlim_allow(type=5 code=1) return=0x1
icmpv4_xrlim_allow(type=5 code=1) return=0x1
icmpv4_xrlim_allow(type=5 code=1) return=0x1
icmpv4_xrlim_allow(type=5 code=1) return=0x1
nf_send_unreach: "tun0"
inet_peer_xrlim_allow last=4347872279 tokens=0 ret=return=0x0
icmpv4_xrlim_allow(type=3 code=3) return=0x0
icmpv4_xrlim_allow(type=5 code=1) return=0x1
nf_send_unreach: "tun0"
inet_peer_xrlim_allow last=4347873280 tokens=0 ret=return=0x0
icmpv4_xrlim_allow(type=3 code=3) return=0x0
icmpv4_xrlim_allow(type=5 code=1) return=0x1
icmpv4_xrlim_allow(type=5 code=1) return=0x1
icmpv4_xrlim_allow(type=5 code=1) return=0x1
nf_send_unreach: "tun0"
inet_peer_xrlim_allow last=4347877380 tokens=0 ret=return=0x0
icmpv4_xrlim_allow(type=3 code=3) return=0x0
nf_send_unreach: "tun0"
inet_peer_xrlim_allow last=4347878400 tokens=0 ret=return=0x0
icmpv4_xrlim_allow(type=3 code=3) return=0x0
nf_send_unreach: "tun0"

Which, notice tokens is always zero.

Comment 11 Colin Walters 2020-05-07 18:44:08 UTC
Hum...so I finally read the comment

"Note that the same inet_peer fields are modified by functions in route.c too"

And looking there, I notice we lost an increment of rate_tokens:

diff -u kernel-4.18.0-117.el8/linux-4.18.0-117.el8_1_0.x86_64/net/ipv4/route.c kernel-4.18.0-193.el8/linux-4.18.0-193.el8_2_0.x86_64/net/ipv4/route.c
--- kernel-4.18.0-117.el8/linux-4.18.0-117.el8_1_0.x86_64/net/ipv4/route.c      2019-07-16 13:21:04.000000000 +0000
+++ kernel-4.18.0-193.el8/linux-4.18.0-193.el8_2_0.x86_64/net/ipv4/route.c      2020-03-27 13:57:18.000000000 +0000
@@ -908,16 +908,15 @@
        if (peer->rate_tokens == 0 ||
            time_after(jiffies,
                       (peer->rate_last +
-                       (ip_rt_redirect_load << peer->rate_tokens)))) {
+                       (ip_rt_redirect_load << peer->n_redirects)))) {
                __be32 gw = rt_nexthop(rt, ip_hdr(skb)->daddr);
 
                icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, gw);
                peer->rate_last = jiffies;
-               ++peer->rate_tokens;
                ++peer->n_redirects;

which seems to have come from https://github.com/torvalds/linux/commit/b406472b5ad79ede8d10077f0c8f05505ace8b6d

Not certain this is it but certainly the token values in the previous kernel are small and possibly were just incremented there.

Comment 12 Patrick Talbert 2020-05-08 07:16:47 UTC
(In reply to Colin Walters from comment #11)
> Hum...so I finally read the comment
> 
> "Note that the same inet_peer fields are modified by functions in route.c
> too"
> 
> And looking there, I notice we lost an increment of rate_tokens:
> 
> diff -u
> kernel-4.18.0-117.el8/linux-4.18.0-117.el8_1_0.x86_64/net/ipv4/route.c
> kernel-4.18.0-193.el8/linux-4.18.0-193.el8_2_0.x86_64/net/ipv4/route.c
> --- kernel-4.18.0-117.el8/linux-4.18.0-117.el8_1_0.x86_64/net/ipv4/route.c  
> 2019-07-16 13:21:04.000000000 +0000
> +++ kernel-4.18.0-193.el8/linux-4.18.0-193.el8_2_0.x86_64/net/ipv4/route.c  
> 2020-03-27 13:57:18.000000000 +0000
> @@ -908,16 +908,15 @@
>         if (peer->rate_tokens == 0 ||
>             time_after(jiffies,
>                        (peer->rate_last +
> -                       (ip_rt_redirect_load << peer->rate_tokens)))) {
> +                       (ip_rt_redirect_load << peer->n_redirects)))) {
>                 __be32 gw = rt_nexthop(rt, ip_hdr(skb)->daddr);
>  
>                 icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, gw);
>                 peer->rate_last = jiffies;
> -               ++peer->rate_tokens;
>                 ++peer->n_redirects;
> 
> which seems to have come from
> https://github.com/torvalds/linux/commit/
> b406472b5ad79ede8d10077f0c8f05505ace8b6d
> 
> Not certain this is it but certainly the token values in the previous kernel
> are small and possibly were just incremented there.

Nice STAP. Thank you for looking at this.

I saw that commit as well but it's only touching the ip_rt_send_redirect() function. Is this environment generating a lot of redirects? A simple netstat -s would tell you. And/or also stap ip_rt_send_redirect() to see if it is running.


I will make kernels with and without 5f1b3b571c08 and post links here when they are finished.

Comment 13 Patrick Talbert 2020-05-08 07:28:49 UTC
This is a test build of the RHEL 8.2 GA kernel-4.18.0-193.el8 with commit 5f1b3b571c08 reverted:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28447004

Comment 14 Paolo Abeni 2020-05-08 10:20:42 UTC
I think the root cause is (In reply to Colin Walters from comment #11)
> kernel-4.18.0-117.el8/linux-4.18.0-117.el8_1_0.x86_64/net/ipv4/route.c
> kernel-4.18.0-193.el8/linux-4.18.0-193.el8_2_0.x86_64/net/ipv4/route.c
> --- kernel-4.18.0-117.el8/linux-4.18.0-117.el8_1_0.x86_64/net/ipv4/route.c  
> 2019-07-16 13:21:04.000000000 +0000
> +++ kernel-4.18.0-193.el8/linux-4.18.0-193.el8_2_0.x86_64/net/ipv4/route.c  
> 2020-03-27 13:57:18.000000000 +0000
> @@ -908,16 +908,15 @@
>         if (peer->rate_tokens == 0 ||
>             time_after(jiffies,
>                        (peer->rate_last +
> -                       (ip_rt_redirect_load << peer->rate_tokens)))) {
> +                       (ip_rt_redirect_load << peer->n_redirects)))) {
>                 __be32 gw = rt_nexthop(rt, ip_hdr(skb)->daddr);
>  
>                 icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, gw);
>                 peer->rate_last = jiffies;
> -               ++peer->rate_tokens;
>                 ++peer->n_redirects;
> 
> which seems to have come from
> https://github.com/torvalds/linux/commit/
> b406472b5ad79ede8d10077f0c8f05505ace8b6d

I think the missing increment here is really the root cause. If the redirect rate is high enough - and looking at the STAP traces there are quite a bit of redirects - we the above test will always succeeds because 'rate_token' start as 0, the redirects refresh 'rate_last' value at a quite high frequency, while keeping 'rate_token' at 0, and inet_peer_xrlim_allow() does not get any chance to success, as the delta between 'now' and 'rate_token' is low.

Reverting the above commit will bring back bz#1753092 - which is likely less critical. I think:

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c28ce1b84dd2..9fc9297d4080 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -905,7 +905,7 @@ void ip_rt_send_redirect(struct sk_buff *skb)
        /* Check for load limit; set rate_last to the latest sent
         * redirect.
         */
-       if (peer->rate_tokens == 0 ||
+       if (peer->n_redirects == 0 ||
            time_after(jiffies,
                       (peer->rate_last +
                        (ip_rt_redirect_load << peer->n_redirects)))) {


will address the issue in a possibly safer way. A scratch build with the above change will be soon available at:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28448170

Could you please give the above a spin in your testbed?

Comment 16 Guillaume Nault 2020-05-08 15:16:55 UTC
(In reply to Dan Winship from comment #4)
> (In reply to Patrick Talbert from comment #3)
> > I don't quite understand the question.
> > 
> > The system has many iptables rules with a REJECT "icmp-port-unreachable"
> > action and the concern is that when traffic matches these rules the system
> > does not always transmit the expected ICMP packet back to the sender unless
> > the icmp_ratelimit is removed or the default ratemask adjusted to allow Type
> > 3 Dest Unreachable?
> 
> It's not that it "does not always transmit the expected ICMP packet". It's
> that it almost 100% reliably fails to transmit the expected ICMP packet,
> even when the network is otherwise almost completely idle and so no rate
> limiting should be occurring.
> 
According to comment 10, that's not true. ICMP Redirects are sent and that likely is the problem.

> The behavior also seems to vary a lot between releases. It never worked
> right in RHEL 7; it started working right in RHEL 8, but is now broken again
> in RHEL 8.2,
> 
As Paolo found out in comment 14, the problem likely comes from the fact that the router has to send ICMP Redirect messages. Those are rate limited, which is also the case for ICMP Destination Unreachable messages.
The problem is that one can influence the other, and those interactions have changed over time. That's probably what made you think that "It never worked right in RHEL 7". Just start sending packets to the right gateway and you should start seeing the Destination Unreachable messages you're expecting.

> which has been bisected to a particular commit:
> https://bugzilla.redhat.com/show_bug.cgi?id=1781575#c37.
>
I'm sorry, but the message pointed to by this link is wrong. It points to a completely unrelated bz, just because that bz has "icmp" in one of its commit messages.
At least it got me to look at this problem...

To make it short, ICMP Redirects might rate limit your ICMP Destination Unreachable messages. This has always been the case and is still the case even with the patch of comment 14 (just to a lesser extend). That patch is probably a step in the right direction, but be prepared for more evolutions in this area.

ICMP Redirect messages are a common symptom of a bad network design or bad configuration on a router or an end host.
Someone should figure out why the peer doesn't uses the right gateway. Assuming that's fixable, and that you can do without ICMP Redirects, then your test should work with any kernel version.

That doesn't prevent improving the rate limiting algorithms in the kernel, but let's do that for good reasons.

Comment 18 Colin Walters 2020-05-08 20:04:05 UTC
This bug seems again to have the "redhat" private flag - anyone mind if I de-restrict it?

> patch from comment#14 posted upstream:

Presuming that all goes through, I'm trying to think about next steps here.  

For OpenShift/RHEL CoreOS we have a rule that we can "cherry pick" things from RHEL but only after they've been attached to an errata: https://gitlab.cee.redhat.com/coreos/redhat-coreos/blob/master/README.md#overridingusing-specific-package-versions

I think then what we'd need to decide is whether to queue this for for an 8.2.X update and cherry-pick it along with the rest of RHEL 8.2, or would it need to wait for 8.3?

The severity of this issue is a bit tricky; I personally wouldn't call it *critical* but we have the Kubernetes test here for a reason, and there are customer cases around this.

> To make it short, ICMP Redirects might rate limit your ICMP Destination Unreachable messages. This has always been the case and is still the case even with the patch of comment 14 (just to a lesser extend). That patch is probably a step in the right direction, but be prepared for more evolutions in this area.

Fair enough, it seems not unlikely to me that something needs to be fixed in the OpenShift SDN too.  But I know very little about that and will let one of those engineers comment.

Comment 19 Paolo Abeni 2020-05-08 20:53:42 UTC
(In reply to Colin Walters from comment #18)
> I think then what we'd need to decide is whether to queue this for for an
> 8.2.X update and cherry-pick it along with the rest of RHEL 8.2, or would it
> need to wait for 8.3?

I think we can start filing the 8.3 clone for this bz and I think this could deserve a z-stream backport, so I would ask the z-stream flag. Than OpenShift may pick whatever course of action is more suitable, WDYT?

Comment 20 Colin Walters 2020-05-11 12:52:00 UTC
> I think we can start filing the 8.3 clone for this bz and I think this could deserve a z-stream backport, so I would ask the z-stream flag. Than OpenShift may pick whatever course of action is more suitable, WDYT?

Sounds good - thanks for taking care of this.  I think we'll start a discussion on next steps probably on aos-devel@ after we have the official builds going and queued into Errata Tool.

Comment 21 Dan Winship 2020-05-11 13:33:08 UTC
(In reply to Colin Walters from comment #18)
> > To make it short, ICMP Redirects might rate limit your ICMP Destination Unreachable messages. This has always been the case and is still the case even with the patch of comment 14 (just to a lesser extend). That patch is probably a step in the right direction, but be prepared for more evolutions in this area.
> 
> Fair enough, it seems not unlikely to me that something needs to be fixed in
> the OpenShift SDN too.  But I know very little about that and will let one
> of those engineers comment.

Yes, it sounds like it's doing something wrong and we should figure out what. It's annoying that seemingly the only sign that it's doing something wrong is that something unrelated fails. :-/

Comment 36 Paolo Abeni 2020-05-26 13:37:50 UTC
Can QA please provide an ack here? Beyond the integration test in openshift scenario there is a limited scope reproducer attached to the cloned rhel8 bz#1834184

Comment 37 zhaozhanqi 2020-05-27 02:30:43 UTC
Hi, sorry I'm openshift QE, I think weliang already given the testing in comment 34 in openshift side. 

I assign this bug to kernel QE.  please let me know you dislike this. thanks.

Comment 38 Jan Stancek 2020-06-04 06:41:39 UTC
Patch(es) committed on kernel-3.10.0-1148.el7

Comment 42 Jianlin Shi 2020-06-05 00:25:40 UTC
Verified on 3.10.0-1148:

:: [ 20:23:47 ] :: [  BEGIN   ] :: Running 'tcpdump -r ping_plus_redir.pcap'
reading from file ping_plus_redir.pcap, link-type EN10MB (Ethernet)              
20:23:31.666882 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:31.666944 IP 192.168.1.101 > 192.168.1.2: ICMP redirect 192.168.2.2 to host 192.168.1.102, length 92
20:23:31.666956 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:32.852687 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:32.852714 IP 192.168.1.101 > 192.168.1.2: ICMP redirect 192.168.2.2 to host 192.168.1.102, length 92
20:23:32.852716 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:34.043654 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64                                                                                                                 
20:23:34.043678 IP 192.168.1.101 > 192.168.1.2: ICMP redirect 192.168.2.2 to host 192.168.1.102, length 92                                                                                                 
20:23:34.043680 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64                                                                                                                 
20:23:35.230761 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64                                                                                                                 
20:23:35.230787 IP 192.168.1.101 > 192.168.1.2: ICMP redirect 192.168.2.2 to host 192.168.1.102, length 92                                                                                                 
20:23:35.230790 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64                                                                                                                 
20:23:36.419676 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64                                                                                                                 
20:23:36.419703 IP 192.168.1.101 > 192.168.1.2: ICMP redirect 192.168.2.2 to host 192.168.1.102, length 92
20:23:36.419705 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:37.606692 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:37.606718 IP 192.168.1.101 > 192.168.1.2: ICMP redirect 192.168.2.2 to host 192.168.1.102, length 92
20:23:37.606720 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:38.791720 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:38.791737 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:39.978665 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:39.978690 IP 192.168.1.101 > 192.168.1.2: ICMP redirect 192.168.2.2 to host 192.168.1.102, length 92
20:23:39.978692 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:41.163679 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:41.163696 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:42.347693 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
20:23:42.347710 IP 192.168.1.2 > 192.168.2.2: ICMP echo request, id 5621, seq 1, length 64
:: [ 20:23:47 ] :: [   PASS   ] :: Command 'tcpdump -r ping_plus_redir.pcap' (Expected 0, got 0)


[root@kvm-06-guest02 bz1834184_redirect_rate]# uname -a
Linux kvm-06-guest02.hv2.lab.eng.bos.redhat.com 3.10.0-1148.el7.x86_64 #1 SMP Wed Jun 3 15:04:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Comment 43 Russell Teague 2020-07-28 15:08:23 UTC
Was there a reason this was not added to the upcoming 7.8.z errata? https://errata.devel.redhat.com/advisory/56015

Comment 44 Jan Stancek 2020-07-28 15:31:31 UTC
(In reply to Russell Teague from comment #43)
> Was there a reason this was not added to the upcoming 7.8.z errata?
> https://errata.devel.redhat.com/advisory/56015

There doesn't appear to be any 7.8.z BZ created/approved for it.
zstream+ and ZTR is RHEL8 way of requesting zstream. Adding 7.8.z? flag, with PMApproved or GSSApproved added PM tooling should create 7.8.z clone.

Comment 45 Russell Teague 2020-07-28 15:47:32 UTC
I thought this was the 7.8.z BZ based on the Version field, however this bug has the 7.9 errata attached.

Comment 54 errata-xmlrpc 2020-09-29 21:14:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4060

Comment 55 Red Hat Bugzilla 2023-09-14 05:57:25 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days