Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1697561

Summary:	[OSP15] Too much memory (userspace/kernel) consumption brings nodes to a halt and has OOM go on a killing spree
Product:	Red Hat Enterprise Linux 8	Reporter:	Pavel Sedlák <psedlak>
Component:	podman	Assignee:	Brent Baude <bbaude>
Status:	CLOSED WONTFIX	QA Contact:	atomic-bugs <atomic-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.0	CC:	aschultz, bdobreli, bfournie, chjones, dwalsh, emacchi, jeckersb, jligon, jnovy, johfulto, jstransk, lmiccini, lsm5, mburns, mheon, michele, ohochman, pgrist, pkomarov, twilson, vmulaje
Target Milestone:	rc	Keywords:	Triaged
Target Release:	8.0	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	podman-1.0.3-1.git9d78c0c.module+el8.0.0.z+3717+fdd07b7c	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-01 07:40:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1699202, 1727099, 1730325
Bug Blocks:	1623890

Description Pavel Sedlák 2019-04-08 16:56:28 UTC

Description of problem:

compute-0 host is unable to reach controller-0 on it's management ip,
in small CI deployment (1uc 1controller 1compute) with OSPd, in virt environment,

Nova-compute cannot reach rabbitmq on controller:
> /var/log/containers/nova/nova-compute.log.1:2019-04-07 23:59:57.497 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [6b358d33-da08-46ec-a779-d6cdc45e4e89] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 113] EHOSTUNREACH. Trying again in 32 seconds.: OSError: [Errno 113] EHOSTUNREACH

Neither ping:
> [root@compute-0 heat-admin]# ping controller-0.internalapi.localdomain
> PING controller-0.internalapi.localdomain (172.17.1.65) 56(84) bytes of data.
> From compute-0.localdomain (172.17.1.22) icmp_seq=1 Destination Host Unreachable

IPs and firewall rules seems correct to me, if i did not overlooked something:
> [root@compute-0 heat-admin]# ip a s vlan20
> 7: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
>     link/ether d6:1b:b6:11:f0:c3 brd ff:ff:ff:ff:ff:ff
>     inet 172.17.1.22/24 brd 172.17.1.255 scope global vlan20
>        valid_lft forever preferred_lft forever
>     inet6 fe80::d41b:b6ff:fe11:f0c3/64 scope link
>        valid_lft forever preferred_lft forever
>
> [heat-admin@controller-0 ~]$ ip a s vlan20
> 8: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
>     link/ether f2:5f:cd:e1:8c:2c brd ff:ff:ff:ff:ff:ff
>     inet 172.17.1.65/24 brd 172.17.1.255 scope global vlan20
>        valid_lft forever preferred_lft forever
>     inet 172.17.1.111/32 scope global vlan20
>        valid_lft forever preferred_lft forever
>     inet 172.17.1.86/32 scope global vlan20
>        valid_lft forever preferred_lft forever
>     inet6 fe80::f05f:cdff:fee1:8c2c/64 scope link
>        valid_lft forever preferred_lft forever
>
> [heat-admin@controller-0 ~]$ sudo iptables -S
> -P INPUT ACCEPT
> -P FORWARD ACCEPT
> -P OUTPUT ACCEPT
> -A INPUT -m state --state RELATED,ESTABLISHED -m comment --comment "000 accept related established rules ipv4" -j ACCEPT
> -A INPUT -p icmp -m state --state NEW -m comment --comment "001 accept all icmp ipv4" -j ACCEPT
> ...
> -A INPUT -p tcp -m multiport --dports 4369,5672,25672 -m state --state NEW -m comment --comment "109 rabbitmq ipv4" -j ACCEPT

Not sure what other parts are actually involved, but ovs-vsctl setups seem ok (vlan20 on both with tag:20, both in br-isolated with ifc eth1).
eth1 interfaces of both (controller-0 and compute-0) VMs are on outer virt level in same network:
> [root@seal53 ~]# virsh domiflist controller-0 | grep manage
> vnet1      network    management virtio      52:54:00:75:0c:c5
> [heat-admin@controller-0 ~]$ ip a s eth1
> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000
>     link/ether 52:54:00:75:0c:c5 brd ff:ff:ff:ff:ff:ff
>
> [root@compute-0 heat-admin]# ip a s eth1
> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000
>     link/ether 52:54:00:c7:b3:a1 brd ff:ff:ff:ff:ff:ff
> [root@seal53 ~]# virsh domiflist compute-0|grep manag
> vnet7      network    management virtio      52:54:00:c7:b3:a1

What seems bit strange is on controller-0 tcpdump shows incoming ARP who-has requests,
but I'm not seeing any responses.
As also observed by tcpdump -i management on virt host itself (who-has repeats on and on):
> [root@seal53 ~]# tcpdump -i management -n
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on management, link-type EN10MB (Ethernet), capture size 262144 bytes
> 16:38:24.754875 ARP, Request who-has 172.17.1.86 tell 172.17.1.22, length 28
> 16:38:25.648649 IP 172.16.0.1.36130 > 172.16.0.86.ssh: Flags [P.], seq 2906720216:2906720252, ack 1311528671, win 3917, options [nop,nop,TS val 441971287 ecr 2609996849], length 36
> ...
> 16:38:26.407392 IP 172.16.0.1.37764 > 172.16.0.86.ssh: Flags [.], ack 29, win 954, options [nop,nop,TS val 441972046 ecr 2610001764], length 0
> 16:38:26.675023 ARP, Request who-has 172.17.1.65 tell 172.17.1.22, length 28
> 16:38:26.802889 ARP, Request who-has 172.17.1.86 tell 172.17.1.22, length 28
> 16:38:27.698873 ARP, Request who-has 172.17.1.65 tell 172.17.1.22, length 28
> 16:38:27.826858 ARP, Request who-has 172.17.1.86 tell 172.17.1.22, length 28
(.1.65 here being controller, 86 undercloud and .22 is the compute)

Also seems it's bi-directional issue, as neither controller can ping compute on it's mgm ip,
and has also incomplete arp for it:

> [root@compute-0 heat-admin]# arp -a
> overcloud.ctlplane.localdomain (192.168.24.8) at 52:54:00:55:89:4b [ether] on eth0
> ? (172.17.1.111) at <incomplete> on vlan20
> overcloud.localdomain (10.0.0.111) at 52:54:00:5e:4d:9b [ether] on eth2
> controller-0.tenant.localdomain (172.17.2.107) at 56:20:db:9a:d4:4c [ether] on vlan50
> controller-0.storage.localdomain (172.17.3.129) at 7e:d2:f9:71:1c:9a [ether] on vlan30
> controller-0.localdomain (172.17.1.65) at <incomplete> on vlan20
> overcloud.internalapi.localdomain (172.17.1.86) at <incomplete> on vlan20
> controller-0.ctlplane.localdomain (192.168.24.20) at 52:54:00:55:89:4b [ether] on eth0
> _gateway (192.168.24.1) at 52:54:00:71:69:64 [ether] on eth0
> overcloud.storage.localdomain (172.17.3.101) at 7e:d2:f9:71:1c:9a [ether] on vlan30
>
> [root@controller-0 heat-admin]# time arp -a -n
> ? (10.0.0.78) at 52:54:00:65:7a:d5 [ether] on br-ex
> ? (192.168.24.1) at 52:54:00:71:69:64 [ether] on eth0
> ? (192.168.24.19) at 52:54:00:06:ab:fe [ether] on eth0
> ? (10.0.0.1) at <incomplete> on br-ex
> ? (172.17.2.58) at 5e:55:18:24:b6:16 [ether] on vlan50
> ? (192.168.24.254) at 52:54:00:d2:dd:3a [ether] on eth0
> ? (172.17.1.22) at <incomplete> on vlan20
> ? (172.17.3.145) at 56:f9:bd:e4:c7:ff [ether] on vlan30
(-n used on ctl-0 as some names are not translatable and it's taking quite long here, ?!)





Version-Release number of selected component (if applicable):

same kernel/ovs on controller-0 and compute-0
> kernel-4.18.0-80.el8.x86_64
> network-scripts-openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
> openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
> openvswitch-selinux-extra-policy-1.0-10.el8fdb.noarch
> rhosp-openvswitch-2.11-0.1.el8ost.noarch

undercloud-0
> ansible-role-tripleo-modify-image-1.0.1-0.20190402220346.012209a.el8ost.noarch
> ansible-tripleo-ipsec-9.0.1-0.20190220162047.f60ad6c.el8ost.noarch
> kernel-4.18.0-80.el8.x86_64
> network-scripts-openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
> openstack-tripleo-common-10.6.1-0.20190404000356.3398bec.el8ost.noarch
> openstack-tripleo-common-containers-10.6.1-0.20190404000356.3398bec.el8ost.noarch
> openstack-tripleo-heat-templates-10.4.1-0.20190403221322.0d98720.el8ost.noarch
> openstack-tripleo-image-elements-10.3.1-0.20190325204940.253fe88.el8ost.noarch
> openstack-tripleo-puppet-elements-10.2.1-0.20190327211339.0f6cacb.el8ost.noarch
> openstack-tripleo-validations-10.3.1-0.20190403171315.a4c40f2.el8ost.noarch
> openvswitch-selinux-extra-policy-1.0-10.el8fdb.noarch
> openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
> puppet-tripleo-10.3.1-0.20190403180925.81d7714.el8ost.noarch
> python3-tripleo-common-10.6.1-0.20190404000356.3398bec.el8ost.noarch
> python3-tripleoclient-11.3.1-0.20190403170353.73cc438.el8ost.noarch
> python3-tripleoclient-heat-installer-11.3.1-0.20190403170353.73cc438.el8ost.noarch
> rhosp-openvswitch-2.11-0.1.el8ost.noarch




How reproducible/steps:
Always, appears in OSP15 Phase1 CI.

Comment 6 Bob Fournier 2019-04-10 13:09:15 UTC

Based on Michele's comment https://bugzilla.redhat.com/show_bug.cgi?id=1697561#c4 and its similarity to https://bugzilla.redhat.com/show_bug.cgi?id=1690510, moving this to DF and they have much more context on the issue here.

Comment 8 Michele Baldessari 2019-04-11 20:51:10 UTC

*** Bug 1690510 has been marked as a duplicate of this bug. ***

Comment 11 Bogdan Dobrelya 2019-04-18 09:09:27 UTC

Highly likely a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1699202

Comment 12 Alex Schultz 2019-04-22 20:24:02 UTC

Leaving this open until Bug 1699202 is resolved. We've added functionality to allow for the disabling of health checks as a possible workaround but that is not a recommended long term remediation for the reported issue.

Comment 14 Alex Schultz 2019-05-15 13:40:11 UTC

Dropping the blocker as this can be worked around by disabling the health checks

Comment 15 Emilien Macchi 2019-05-29 19:35:23 UTC

Note that healthchecks can be disabled on the overcloud & standalone with: ContainerHealthcheckDisabled: true and on the undercloud with undercloud.conf: container_healthcheck_disabled: true.

Comment 16 Omri Hochman 2019-06-16 13:16:15 UTC

(In reply to Alex Schultz from comment #14)
> Dropping the blocker as this can be worked around by disabling the health
> checks

from the bad expirience POV, I would like to get PM decision. I would add the blocker flag unless we decide to disable health-checks by default at the deployment.

Comment 26 RHEL Program Management 2021-02-01 07:40:02 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.