2007581 – Too many haproxy processes in default-router pod causing high load average after upgrade from v4.8.3 to v4.8.10

Bug 2007581 - Too many haproxy processes in default-router pod causing high load average after upgrade from v4.8.3 to v4.8.10

Summary: Too many haproxy processes in default-router pod causing high load average af...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Andrew McDermott
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:	EmergencyRequest
Depends On:
Blocks:	2015829
TreeView+	depends on / blocked

Reported:	2021-09-24 10:07 UTC by Jitendra Pradhan
Modified:	2024-12-20 21:12 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-12 04:38:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 663	0	None	Merged	Bug 2007581: Change default balancing algorithm to "leastconn"	2022-01-28 17:47:52 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:38:46 UTC

Comment 2 Michal Fojtik 2021-09-24 10:33:50 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 4 Andrew McDermott 2021-09-24 11:24:21 UTC

Another instance: https://bugzilla.redhat.com/show_bug.cgi?id=2006548.

Is the customer using the hard-stop-after annotation and, if so, with
what value?

I am currently investigating 2006548 and may mark this as a duplicate
later in the day if the behaviour of OCP 4.6 and 4.8 are similar,
though I do note the description explicitly talks about a behavioural
change from 4.8.3 to 4.8.10.

Comment 6 Andrew McDermott 2021-09-24 12:03:44 UTC

I am currently running:

$ oc version
Client Version: 4.8.0
Server Version: 4.8.11
Kubernetes Version: v1.21.1+9807387

I have a version of the router that is not currently bound to
privileged port numbers - this so I can run lsof in the router pod and
identify the established connections.

My cluster is quiet. 
NAME                              READY   STATUS    RESTARTS   AGE
router-copy-c99657fc9-8r84z       2/2     Running   0          23m
router-default-6d77dbc8b7-n5ljl   2/2     Running   0          42m

$ oc rsh router-copy-c99657fc9-8r84z
Defaulted container "router" out of: router, logs
sh-4.4$ pgrep -a haproxy
57 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 20

A single haproxy instance with the following established TCP connections:

sh-4.4$ lsof -L -n -p 57 | grep TCP
lsof: WARNING: can't stat() cgroup file system /sys/fs/cgroup/systemd
      Output information may be incomplete.
lsof: WARNING: can't stat() cgroup file system /sys/fs/cgroup/rdma
      Output information may be incomplete.
[snipped repeated WARNINGs]

haproxy  57 1000620000    5u     IPv4            2631687      0t0       TCP *:webcache (LISTEN)
haproxy  57 1000620000    6u     IPv4            2631688      0t0       TCP *:pcsync-https (LISTEN)
haproxy  57 1000620000   27u     IPv4            2994165      0t0       TCP 10.128.2.1:40678->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   29u     IPv4            2922580      0t0       TCP 10.128.2.1:47318->10.130.0.4:pcsync-https (ESTABLISHED)

If I now connect to the console via Chrome, and wait ~30s I see:

sh-4.4$ lsof -L -n -p 57 | grep TCP

lsof: WARNING: can't stat() cgroup file system /sys/fs/cgroup/systemd
      Output information may be incomplete.
lsof: WARNING: can't stat() cgroup file system /sys/fs/cgroup/rdma
      Output information may be incomplete.
[snipped repeated WARNINGs]

haproxy  57 1000620000    5u     IPv4            2631687      0t0       TCP *:webcache (LISTEN)
haproxy  57 1000620000    6u     IPv4            2631688      0t0       TCP *:pcsync-https (LISTEN)
haproxy  57 1000620000   23u     IPv4            3025322      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40434 (ESTABLISHED)
haproxy  57 1000620000   27u     IPv4            3024525      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40866 (ESTABLISHED)
haproxy  57 1000620000   28u     IPv4            3029018      0t0       TCP 10.128.2.1:42546->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   29u     IPv4            3024343      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40446 (ESTABLISHED)
haproxy  57 1000620000   30u     IPv4            3024345      0t0       TCP 10.128.2.1:51296->10.130.0.23:sun-sr-https (ESTABLISHED)
haproxy  57 1000620000   31u     IPv4            3025342      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40518 (ESTABLISHED)
haproxy  57 1000620000   34u     IPv4            3024358      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40522 (ESTABLISHED)
haproxy  57 1000620000   35u     IPv4            3024359      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40524 (ESTABLISHED)
haproxy  57 1000620000   37u     IPv4            3024362      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40526 (ESTABLISHED)
haproxy  57 1000620000   43u     IPv4            3026348      0t0       TCP 10.128.2.1:42556->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   44u     IPv4            3024385      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40536 (ESTABLISHED)
haproxy  57 1000620000   45u     IPv4            3024443      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40610 (ESTABLISHED)
haproxy  57 1000620000   46u     IPv4            3024472      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40654 (ESTABLISHED)
haproxy  57 1000620000   49u     IPv4            3024390      0t0       TCP 10.128.2.1:42404->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   50u     IPv4            3022806      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40598 (ESTABLISHED)
haproxy  57 1000620000   53u     IPv4            3029021      0t0       TCP 10.128.2.1:42558->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   54u     IPv4            3024432      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40602 (ESTABLISHED)
haproxy  57 1000620000   57u     IPv4            3024437      0t0       TCP 10.128.2.1:42412->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   60u     IPv4            3025415      0t0       TCP 10.128.2.1:42414->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   61u     IPv4            3024450      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40638 (ESTABLISHED)
haproxy  57 1000620000   64u     IPv4            3024454      0t0       TCP 10.128.2.1:42420->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   65u     IPv4            3024459      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40648 (ESTABLISHED)
haproxy  57 1000620000   68u     IPv4            3024462      0t0       TCP 10.128.2.1:42422->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   71u     IPv4            3024477      0t0       TCP 10.128.2.1:42424->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   72u     IPv4            3024478      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40656 (ESTABLISHED)
haproxy  57 1000620000   75u     IPv4            3025418      0t0       TCP 10.128.2.1:42426->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   76u     IPv4            3029608      0t0       TCP 10.128.2.1:42698->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   79u     IPv4            3025449      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40786 (ESTABLISHED)
haproxy  57 1000620000   81u     IPv4            3024501      0t0       TCP 10.128.2.1:42446->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   82u     IPv4            3025458      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40796 (ESTABLISHED)
haproxy  57 1000620000   85u     IPv4            3024505      0t0       TCP 10.128.2.1:42450->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   86u     IPv4            3025465      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40820 (ESTABLISHED)
haproxy  57 1000620000   89u     IPv4            3024514      0t0       TCP 10.128.2.1:42454->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   92u     IPv4            3024530      0t0       TCP 10.128.2.1:42466->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   93u     IPv4            3024541      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40898 (ESTABLISHED)
haproxy  57 1000620000   96u     IPv4            3024546      0t0       TCP 10.128.2.1:42468->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   97u     IPv4            3024555      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40906 (ESTABLISHED)
haproxy  57 1000620000  100u     IPv4            3024560      0t0       TCP 10.128.2.1:42470->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000  101u     IPv4            3024561      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40908 (ESTABLISHED)
haproxy  57 1000620000  104u     IPv4            3024566      0t0       TCP 10.128.2.1:42472->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000  105u     IPv4            3024577      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40944 (ESTABLISHED)
haproxy  57 1000620000  108u     IPv4            3024582      0t0       TCP 10.128.2.1:42476->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000  109u     IPv4            3024583      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40960 (ESTABLISHED)
haproxy  57 1000620000  112u     IPv4            3024588      0t0       TCP 10.128.2.1:42478->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000  113u     IPv4            3024594      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:41000 (ESTABLISHED)
haproxy  57 1000620000  116u     IPv4            3024599      0t0       TCP 10.128.2.1:42480->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000  117u     IPv4            3024606      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:41024 (ESTABLISHED)
haproxy  57 1000620000  120u     IPv4            3024611      0t0       TCP 10.128.2.1:42486->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000  124u     IPv4            3024686      0t0       TCP 10.128.2.1:53512->10.130.0.4:pcsync-https (ESTABLISHED)

Is this number of connections expected for a single connection via my browser?

If I delete my browser tab and repeatedly run the following we see the
number of established connection diminish (to almost zero):

sh-4.4$ lsof -L -n -p 57 | grep TCP
haproxy  57 1000620000    5u     IPv4            2631687      0t0       TCP *:webcache (LISTEN)
haproxy  57 1000620000    6u     IPv4            2631688      0t0       TCP *:pcsync-https (LISTEN)
haproxy  57 1000620000   23u     IPv4            3033253      0t0       TCP 10.128.2.1:42912->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   25u     IPv4            3047195      0t0       TCP 10.128.2.1:43660->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   26u     IPv4            3046032      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:56424 (ESTABLISHED)
haproxy  57 1000620000   28u     IPv4            3029018      0t0       TCP 10.128.2.1:42546->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   31u     IPv4            3025342      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40518 (ESTABLISHED)
haproxy  57 1000620000   34u     IPv4            3024358      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40522 (ESTABLISHED)
haproxy  57 1000620000   35u     IPv4            3024359      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40524 (ESTABLISHED)
haproxy  57 1000620000   37u     IPv4            3024362      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:40526 (ESTABLISHED)
haproxy  57 1000620000   43u     IPv4            3048643      0t0       TCP 10.128.2.1:43744->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   50u     IPv4            3049603      0t0       TCP 10.128.2.1:43742->10.129.0.22:pcsync-https (ESTABLISHED)
haproxy  57 1000620000   53u     IPv4            3048639      0t0       TCP 192.168.7.181:pcsync-https->192.168.7.164:57400 (ESTABLISHED)
haproxy  57 1000620000   76u     IPv4            3029608      0t0       TCP 10.128.2.1:42698->10.129.0.22:pcsync-https (ESTABLISHED)


Approximately 1 minute later (maybe less) zero established connections:

sh-4.4$ lsof -L -n -p 57 | grep TCP
haproxy  57 1000620000    5u     IPv4            2631687      0t0       TCP *:webcache (LISTEN)
haproxy  57 1000620000    6u     IPv4            2631688      0t0       TCP *:pcsync-https (LISTEN)

So it would seem all/most/lots of those connections are all associated
with the console. Multiplying this up by lots of users and the known
issues we have with websocket connections and haproxy reloads we will
have lots of outstanding processes that cannot be terminated.

Next steps: I will stand up OCP v4.8.3 and repeat the experiment. We
have not bumped the version of haproxy in any .z release of 4.8.

Comment 22 Andrew McDermott 2021-09-30 13:33:45 UTC

If adjusting the router's reload interval helps operationally then the
next steps are to revert that change and apply a new change to
understand why we are reloading so often.

Let's change the router's default logging level from 2 to 5; the goal
is to capture more debug output that the openshift-router can emit
and, with some post-processing, understand what changes occur on
either routes, endpoints, or services that will necessitate a reload.

Steps:

  - Set a CVO override so CVO stops managing the ingress operator.
  $ oc patch clusterversions/version --type=json --patch='[{"op":"add","path":"/spec/overrides","value":[{"kind":"Deployment","group":"apps/v1","name":"ingress-operator","namespace":"openshift-ingress-operator","unmanaged":true}]}]'

  - Scale down the ingress-operator so that we can alter the router-default deployment
  $ oc scale --replicas 0 -n openshift-ingress-operator deployments ingress-operator

  - patch the logging level from 2->5 for the router container
  $ oc patch deployment/router-default -n openshift-ingress --patch='{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"router"},{"name":"logs"}],"containers":[{"command":["/usr/bin/openshift-router","--v=5"],"name":"router"}]}}}}'

This will now log a significant amount.

If we could collect the output from all the router pods (i.e., the
"router" container) then we can post-process the information to try
and understand what is driving frequent reloads.

If would be helpful is access logging is also enabled. Please attach
both the 'router' and 'logs' container output from all router pods.

If we could let this run at this log level for, say, 15 mins (or more)
then that would be helpful. If you've since altered the
hard-stop-after period then let's collect at log level 5 for the
hard-stop-after duration plus an additional 5 minutes. Equally, we
should not leave this patch enabled indefinitely as it may affect
overall performance.

Comment 35 Jitendra Pradhan 2021-10-15 15:44:58 UTC

Hi Andrew,

In an another case (https://access.redhat.com/support/cases/internal/#/case/03058396) of mine, one of my customer is also facing the same issue and OCP version is same as 4.8.

Comment 36 Miciah Dashiel Butler Masters 2021-10-15 15:47:53 UTC

(In reply to Jitendra Pradhan from comment #35)
> In an another case
> (https://access.redhat.com/support/cases/internal/#/case/03058396) of mine,
> one of my customer is also facing the same issue and OCP version is same as
> 4.8.

Please try the workaround:

    oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"unsupportedConfigOverrides":{"loadBalancingAlgorithm":"leastconn"}}}'

Let us know whether that works for this customer.

Comment 37 Miciah Dashiel Butler Masters 2021-10-15 15:50:17 UTC

If possible, could you also provide details about the number of HAProxy processes, the memory per HAProxy process, and the haproxy.config file from one of the router pods for this new case?

Comment 40 Andrew McDermott 2021-10-19 12:07:16 UTC

I built haproxy-2.2 with https://pagure.io/glibc-malloc-trace-utils

In my sample config I have 4004 backends:

    $ grep -c -e  '^backend ' haproxy.cfg
    4004

Run with the configuration but don't fork:

    $ ../../haproxy-2.2/haproxy -f ./haproxy.cfg -d -V

Inspect the allocations made:

    $ ~/glibc-malloc-trace-utils/trace_allocs /tmp/mtrace.mtr.27404 | sort -n | uniq -c > /tmp/random

I then swapped "balance random" for "balance leastconn"

    $ ../../haproxy-2.2/haproxy -f ./haproxy.cfg -d -V
    $ ~/glibc-malloc-trace-utils/trace_allocs /tmp/mtrace.mtr.27727 | sort -n | uniq -c > /tmp/leastconn

Looking at the diff (fully expanded later):

    $ diff -y -w132 /tmp/leastconn /tmp/random
      1 160112							      1 160112
							      >	   4004 196608
      2 640448							      2 640448

All other allocations appear to be identical apart from 4004 at 199608
bytes. Those 4004 allocations comes from chash_init_server_tree()
which appears to be called only when "balance random" is chosen:

void chash_init_server_tree(struct proxy *p)w 
{
	struct server *srv;
	struct eb_root init_head = EB_ROOT;
	int node;

	p->lbprm.set_server_status_up = chash_set_server_status_up;on
	point. p->lbprm.set_server_status_down =
	chash_set_server_status_down; p->lbprm.update_server_eweight =
	chash_update_server_weight; p->lbprm.server_take_conn = NULL;
	p->lbprm.server_drop_conn = NULL; p->lbprm.wdiv =
	BE_WEIGHT_SCALE; for (srv = p->srv; srv; srv = srv->next) {
	srv->next_eweight = (srv->uweight * p->lbprm.wdiv +
	p->lbprm.wmult - 1) / p->lbprm.wmult;
	srv_lb_commit_status(srv); }

	recount_servers(p);
	update_backend_weight(p);

	p->lbprm.chash.act = init_head;
	p->lbprm.chash.bck = init_head;
	p->lbprm.chash.last = NULL;

	/* queue active and backup servers in two distinct groups */
	for (srv = p->srv; srv; srv = srv->next) {
		srv->lb_tree = (srv->flags & SRV_F_BACKUP) ? &p->lbprm.chash.bck : &p->lbprm.chash.act;
		srv->lb_nodes_tot = srv->uweight * BE_WEIGHT_SCALE;

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The weight in our backends is 256 multiplied by BE_WEIGHT_SCALE (i.e., 16).
The huge memory growth with "random" versus "leastconn" comes from this single
allocation point as it is called for each backend.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		srv->lb_nodes_now = 0;
		srv->lb_nodes = calloc(srv->lb_nodes_tot, sizeof(struct tree_occ));
		for (node = 0; node < srv->lb_nodes_tot; node++) {
			srv->lb_nodes[node].server = srv;
			srv->lb_nodes[node].node.key = full_hash(srv->puid * SRV_EWGHT_RANGE + node);
		}
		if (srv_currently_usable(srv))
			chash_queue_dequeue_srv(srv);
	}
}

-diff-------------------------------

  10055 1							  10055 1
     62 2							     62 2
     32 3							     32 3
   4018 4							   4018 4
     12 5							     12 5
     16 6							     16 6
     18 7							     18 7
     52 8							     52 8
     15 9							     15 9
      7 10							      7 10
     13 11							     13 11
     60 12							     60 12
     23 13							     23 13
  12070 14							  12070 14
     15 15							     15 15
  26198 16							  26198 16
     13 17							     13 17
   4011 18							   4011 18
      5 19							      5 19
     29 20							     29 20
     17 21							     17 21
      3 22							      3 22
     32 23							     32 23
   1257 24							   1257 24
    105 25							    105 25
    900 26							    900 26
   2007 27							   2007 27
     10 28							     10 28
      9 29							      9 29
     90 30							     90 30
    901 31							    901 31
   2825 32							   2825 32
   8026 33							   8026 33
    122 34							    122 34
   1170 35							   1170 35
   2713 36							   2713 36
     21 37							     21 37
    189 38							    189 38
   1923 39							   1923 39
   1465 40							   1465 40
   2701 41							   2701 41
      6 42							      6 42
      7 44							      7 44
      3 46							      3 46
     20 47							     20 47
  10271 48							  10271 48
   1804 49							   1804 49
     11 50							     11 50
      1 51							      1 51
      6 52							      6 52
      3 54							      3 54
  10153 56							  10153 56
      5 60							      5 60
      2 63							      2 63
     30 64							     30 64
      1 66							      1 66
      2 68							      2 68
      3 70							      3 70
   4053 72							   4053 72
     18 73							     18 73
    182 74							    182 74
   1804 75							   1804 75
      4 76							      4 76
     20 78							     20 78
    184 79							    184 79
  11830 80							  11830 80
     10 81							     10 81
      1 84							      1 84
      1 86							      1 86
      1 87							      1 87
     44 88							     44 88
      4 91							      4 91
      9 93							      9 93
     94 94							     94 94
    900 95							    900 95
  12057 96							  12057 96
      1 97							      1 97
     11 98							     11 98
     91 99							     91 99
    900 100							    900 100
   2002 101							   2002 101
     48 104							     48 104
      1 106							      1 106
      3 107							      3 107
  10024 108							  10024 108
     24 112							     24 112
      4 115							      4 115
      4 119							      4 119
     70 120							     70 120
      1 122							      1 122
      1 124							      1 124
      1 125							      1 125
      1 126							      1 126
     24 128							     24 128
      1 132							      1 132
      1 138							      1 138
      1 142							      1 142
     11 151							     11 151
  10059 152							  10059 152
   4013 160							   4013 160
      1 166							      1 166
     11 168							     11 168
      1 170							      1 170
     54 176							     54 176
      1 180							      1 180
      4 199							      4 199
      4 200							      4 200
      1 204							      1 204
      1 205							      1 205
      1 213							      1 213
     20 224							     20 224
     18 225							     18 225
    181 227							    181 227
   1804 229							   1804 229
      2 231							      2 231
   4011 232							   4011 232
     18 235							     18 235
    180 237							    180 237
   1800 239							   1800 239
      1 240							      1 240
      6 241							      6 241
      2 245							      2 245
      1 248							      1 248
     68 256							     68 256
      4 261							      4 261
     34 264							     34 264
      9 265							      9 265
     94 267							     94 267
    900 269							    900 269
      4 270							      4 270
   2001 271							   2001 271
      9 275							      9 275
     90 277							     90 277
    900 279							    900 279
   2001 281							   2001 281
      1 304							      1 304
     20 336							     20 336
     11 352							     11 352
      3 376							      3 376
     24 436							     24 436
     23 472							     23 472
      1 480							      1 480
      1 488							      1 488
      1 496							      1 496
     60 504							     60 504
      6 512							      6 512
      2 526							      2 526
     14 536							     14 536
     29 608							     29 608
     18 640							     18 640
    180 648							    180 648
   1804 656							   1804 656
      2 664							      2 664
     18 680							     18 680
      1 684							      1 684
    180 688							    180 688
   1800 696							   1800 696
      6 704							      6 704
      2 770							      2 770
      4 784							      4 784
      9 800							      9 800
     94 808							     94 808
    900 816							    900 816
   2001 824							   2001 824
      9 840							      9 840
     90 848							     90 848
    900 856							    900 856
   2001 864							   2001 864
     24 868							     24 868
      1 1016							      1 1016
     12 1024							     12 1024
      8 1025							      8 1025
     20 1028							     20 1028
      1 1053							      1 1053
      1 1055							      1 1055
      2 1100							      2 1100
      1 1192							      1 1192
      1 1194							      1 1194
     24 1216							     24 1216
      3 1217							      3 1217
      3 1219							      3 1219
      8 1306							      8 1306
      4 1333							      4 1333
      4 1335							      4 1335
      8 1380							      8 1380
     24 1648							     24 1648
      1 2048							      1 2048
     20 2104							     20 2104
      3 2204							      3 2204
      3 2208							      3 2208
     18 2256							     18 2256
      1 2276							      1 2276
      1 2400							      1 2400
      1 3056							      1 3056
   4007 3448							   4007 3448
  29887 4096							  29887 4096
     20 5952							     20 5952
   4009 6296							   4009 6296
      3 16384							      3 16384
     29 32768							     29 32768
      1 160112							      1 160112
							      >	   4004 196608
      2 640448							      2 640448
      1 1048576							      1 1048576
      1 2561792							      1 2561792

Comment 44 Arvind iyengar 2021-10-25 10:12:03 UTC

Verified in "4.10.0-0.nightly-2021-10-23-225921" release. With this payload, it is observed that the LB algorithm now defaults to "leastconn":
-------
oc get clusterversion                         
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-23-225921   True        False         52m     Cluster version is 4.10.0-0.nightly-2021-10-23-225921


oc -n openshift-ingress get deployment router-default -o yaml  | grep -i ROUTER_LOAD_BALANCE_ALGORITHM -A2
        - name: ROUTER_LOAD_BALANCE_ALGORITHM
          value: leastconn

Inside router pods:
sh-4.4$ env | grep -i ROUTER_LOAD_BALANCE_ALGORITHM
ROUTER_LOAD_BALANCE_ALGORITHM=leastconn


Backend route algorigthm post the chanage:
backend be_http:test1:service-unsecure
  mode http
  option redispatch
  option forwardfor
  balance leastconn  <----

  timeout check 5000ms
  http-request add-header X-Forwarded-Host %[req.hdr(host)]
  http-request add-header X-Forwarded-Port %[dst_port]
  http-request add-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  http-request add-header X-Forwarded-Proto-Version h2 if { ssl_fc_alpn -i h2 }
  http-request add-header Forwarded for=%[src];host=%[req.hdr(host)];proto=%[req.hdr(X-Forwarded-Proto)]
  cookie e96c07fa08f2609cadf847f019750244 insert indirect nocache httponly
  server pod:web-server-rc-fpdlb:service-unsecure:http:10.128.2.28:8080 10.128.2.28:8080 cookie 6eb972d07c3f4b2d51696ce52cfa2115 weight 256 check inter 5000ms
  server pod:web-server-rc-7wd5w:service-unsecure:http:10.129.2.31:8080 10.129.2.31:8080 cookie 0e8a34e402536207cbae3af56924532e weight 256 check inter 5000ms
-------

Comment 45 Arvind iyengar 2021-10-25 10:12:23 UTC

Verified in "4.10.0-0.nightly-2021-10-23-225921" release. With this payload, it is observed that the LB algorithm now defaults to "leastconn":
-------
oc get clusterversion                         
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-23-225921   True        False         52m     Cluster version is 4.10.0-0.nightly-2021-10-23-225921


oc -n openshift-ingress get deployment router-default -o yaml  | grep -i ROUTER_LOAD_BALANCE_ALGORITHM -A2
        - name: ROUTER_LOAD_BALANCE_ALGORITHM
          value: leastconn

Inside router pods:
sh-4.4$ env | grep -i ROUTER_LOAD_BALANCE_ALGORITHM
ROUTER_LOAD_BALANCE_ALGORITHM=leastconn


Backend route algorigthm post the chanage:
backend be_http:test1:service-unsecure
  mode http
  option redispatch
  option forwardfor
  balance leastconn  <----

  timeout check 5000ms
  http-request add-header X-Forwarded-Host %[req.hdr(host)]
  http-request add-header X-Forwarded-Port %[dst_port]
  http-request add-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  http-request add-header X-Forwarded-Proto-Version h2 if { ssl_fc_alpn -i h2 }
  http-request add-header Forwarded for=%[src];host=%[req.hdr(host)];proto=%[req.hdr(X-Forwarded-Proto)]
  cookie e96c07fa08f2609cadf847f019750244 insert indirect nocache httponly
  server pod:web-server-rc-fpdlb:service-unsecure:http:10.128.2.28:8080 10.128.2.28:8080 cookie 6eb972d07c3f4b2d51696ce52cfa2115 weight 256 check inter 5000ms
  server pod:web-server-rc-7wd5w:service-unsecure:http:10.129.2.31:8080 10.129.2.31:8080 cookie 0e8a34e402536207cbae3af56924532e weight 256 check inter 5000ms
-------

Comment 47 Eswar Vadla 2021-12-03 14:56:45 UTC

Hello Arvind,

Currently this bug is in verified for version 4.10, is it possible for backport to 4.7.

Regards,
Eswar.

Comment 52 Brandi Munilla 2022-02-10 20:36:55 UTC

Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks!

Comment 53 Miciah Dashiel Butler Masters 2022-02-23 22:53:25 UTC

No doc update is required for the 4.10.0 BZ because the change was already backported to 4.9.z.  The 4.9.z BZ has a release note that clearly describes the issue, and with the change already backported to 4.9.z, there is effectively no change from 4.9.z to 4.10.0 to warrant a release note.

Comment 55 errata-xmlrpc 2022-03-12 04:38:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.