Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1635242

Summary:

neutron-dhcp-agent consumed 50% memory and got SIGHUP

Product:

Red Hat OpenStack

Reporter:

Noam Manos <nmanos>

Component:

openstack-neutron

Assignee:

Nate Johnston <njohnston>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Roee Agiman <ragiman>

Severity:

high

Docs Contact:

Priority:

medium

Version:

14.0 (Rocky)

CC:

amuller, bcafarel, bhaley, chrisw, nmanos, nyechiel

Target Milestone:

---

Keywords:

Triaged

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-12-02 08:01:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
neutron-dhcp-agent consume 50% memory.txt	none

Description Noam Manos 2018-10-02 12:58:53 UTC

Description of problem:

On fresh OSP14 HA deployment with opendaylight, the system was working for few days, and tempest tests passed. However, after leaving it for ten days, controller-0 died as neutron-dhcp-agent consumed 50% memory (16/32 GB).
Trying to restart controller-0 did not resolve it.

Version-Release number of selected component (if applicable):
OSP14 2018-09-06.1


How reproducible:
Sometimes.

Steps to Reproduce:
1. Install OSP14 HA with ODL.
2. Run tempest tests.
3. Check system health after few days.

Actual results:
[stack@undercloud-0 ~]$ urc
+--------------------------------------+--------------+---------+------------------------+----------------+------------+
| ID                                   | Name         | Status  | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+---------+---------------
| d2ed189d-542d-47c8-b6de-551bfed8a0cf | controller-2 | ACTIVE  | ctlplane=192.168.24.21 | overcloud-full | controller |
| edf3cda3-04b2-4ca5-b31a-77142e85363d | controller-0 | SHUTOFF | ctlplane=192.168.24.18 | overcloud-full | controller |
| 88e572bd-abbd-42c7-b1c9-4a9efa7cc585 | controller-1 | ACTIVE  | ctlplane=192.168.24.13 | overcloud-full | controller |
| e11d237f-eebb-4a3e-a005-d4956ae6c6ef | compute-1    | ACTIVE  | ctlplane=192.168.24.14 | overcloud-full | compute    |
| 9f41767a-b8e8-4c6b-a84a-5decccfe1cb2 | compute-0    | ACTIVE  | ctlplane=192.168.24.8  | overcloud-full | compute    |
+--------------------------------------+--------------+---------+---------------

[heat-admin@controller-1 ~]$ tail /var/log/containers/neutron/dhcp-agent.log

WARNING oslo.service.loopingcall Function 'neutron.agent.dhcp.agent.DhcpAgentWithStateReport._report_state' run outlasted interval by 30.00 sec
DEBUG oslo_concurrency.lockutils  Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process._check_child_processes" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
DEBUG oslo_concurrency.lockutils  Lock "_check_child_processes" released by "neutron.agent.linux.external_process._check_child_processes" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
ERROR neutron.agent.dhcp.agent Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID 14fb44a23a8c4f97b049c5337a5f20d4

[heat-admin@controller-1 ~]$ ps aux --sort -rss | head -11

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
qemu     14363  408 25.2 34645588 33201776 ?   Sl   Sep20 70600:48 /usr/libexec/qemu-kvm -name controller-0 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off,dump-guest-core=off -cpu host -m 32768 -realtime mlock=off -smp 8,sockets=8,cores=1,threads=1 -uuid abb428b2-0232-4396-b2c5-4bbf801ddcf4 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-17-controller-0/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -drive file=/var/lib/libvirt/images/controller-0-disk1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=unsafe -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:02:db:e0,bus=pci.0,addr=0x3 -netdev tap,fd=31,id=hostnet1,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:92:f4:81,bus=pci.0,addr=0x4 -netdev tap,fd=33,id=hostnet2,vhost=on,vhostfd=34 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=52:54:00:0f:5b:c8,bus=pci.0,addr=0x5 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-17-controller-0/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0xa -msg timestamp=on


Expected results:
All Controllers should be up and running.


Additional info:
Attaching logs from controller-1. (controller-0 was down, and controller-2 had same behaviour where neutron-dhcp-agent consumed 50% memory)

Comment 1 Noam Manos 2018-10-02 13:01:13 UTC

Created attachment 1489474 [details]
neutron-dhcp-agent consume 50% memory.txt

Comment 2 Noam Manos 2018-10-02 13:39:22 UTC

SOS-REPORTs:
http://rhos-release.virt.bos.redhat.com/log/bz1635242

Comment 8 Nate Johnston 2018-10-10 13:43:59 UTC

Noam, were you running tempest tests continuously over that time period, or was the system just left sitting there doing nothing after a single tempest run?  The rate of increase I am seeing would not correspond to the level you got to - it would need to be about 6 times worse in order to hit that target.

Currently trying to see if this is an OSP 14 blocker.

Comment 9 Noam Manos 2018-10-16 09:29:49 UTC

(In reply to Nate Johnston from comment #8)
> Noam, were you running tempest tests continuously over that time period, or
> was the system just left sitting there doing nothing after a single tempest
> run?  The rate of increase I am seeing would not correspond to the level you
> got to - it would need to be about 6 times worse in order to hit that target.
> 
> Currently trying to see if this is an OSP 14 blocker.

It was during TLV office shutdown - the system just left sitting there doing nothing after a single tempest run.

Comment 10 Nate Johnston 2018-10-16 15:02:21 UTC

Noam, is it still available?  Could I log in to it, or could you reproduce?

Comment 11 Noam Manos 2018-10-17 14:55:47 UTC

Had to reuse the environment. Once reproduced will update.(In reply to Nate Johnston from comment #10)
> Noam, is it still available?  Could I log in to it, or could you reproduce?


Had to reuse the environment. Once reproduced will update.

Comment 12 Nate Johnston 2018-11-30 17:12:03 UTC

Pinging to see if there is an ETA for a reproducer environment.

Comment 14 Noam Manos 2018-12-02 08:01:57 UTC

The described problem of having a controller shutoff after one week of run time, was not reproduced on later puddles (currently 2018-11-21.2).

I will close the bug.


(undercloud) [stack@undercloud-0 ~]$ uptime
 02:57:26 up 9 days, 11:53,  1 user,  load average: 1.14, 1.22, 1.25

(undercloud) [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 
14   -p 2018-11-21.2

openstack server list --all
+--------------------------------------+--------------+--------+----------------
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+----------------
| 4195d562-848f-4b82-b619-0cfee47ca539 | controller-2 | ACTIVE | ctlplane=192.168.24.8  | overcloud-full | controller |
| b460ece2-b835-4cd3-a173-29c29ba7e8b5 | controller-0 | ACTIVE | ctlplane=192.168.24.14 | overcloud-full | controller |
| f2934ee6-987b-4c78-8d9d-c929462d1cab | compute-0    | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | compute    |
| a7b31bc5-fb8b-4332-bd86-913a6f242d8a | controller-1 | ACTIVE | ctlplane=192.168.24.7  | overcloud-full | controller |
| 4a2422ea-2a9f-478e-84b0-392df5efb70b | compute-1    | ACTIVE | ctlplane=192.168.24.6  | overcloud-full | compute    |
+--------------------------------------+--------------+--------+----------------
(undercloud) [stack@undercloud-0 ~]$