Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2179140

Summary:	neutron-haproxy-ovnmeta stopped
Product:	Red Hat OpenStack	Reporter:	Cyril Lopez <cylopez>
Component:	openstack-neutron	Assignee:	Miro Tomaska <mtomaska>
Status:	CLOSED MIGRATED	QA Contact:	Eran Kuris <ekuris>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	16.2 (Train)	CC:	astupnik, bcafarel, chrisw, egarciar, jlibosva, mburns, mlavalle, mtomaska, rribaud, scohen, smooney, svigan, twilson
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2025-01-10 09:14:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Cyril Lopez 2023-03-16 17:32:38 UTC

Description of problem:
ovn_metadata_agent is still up but all neutron-haproxy-ovnmeta are down. So no metadata are provided to new VM.

After restarting container "ovn_metadata_agent", all "neutron-haproxy-ovnmeta" container are back and instance get their metadata back.

Version-Release number of selected component (if applicable): 16.2.3

The only log found is :
/var/log/containers/stdouts/neutron-haproxy-ovnmeta-

2023-03-16T13:05:48.306094087+01:00 stderr F [WARNING] 074/113738 (91677) : Exiting Master process...
2023-03-16T13:05:48.306330908+01:00 stderr F [ALERT] 074/113738 (91677) : Current worker 91680 exited with code 143
2023-03-16T13:05:48.306336518+01:00 stderr F [WARNING] 074/113738 (91677) : All workers exited. Exiting... (143)
2023-03-16T13:09:23.524106739+01:00 stderr F [WARNING] 074/114341 (97109) : Exiting Master process...
2023-03-16T13:09:23.524403206+01:00 stderr F [ALERT] 074/114341 (97109) : Current worker 97112 exited with code 143
2023-03-16T13:09:23.524408679+01:00 stderr F [WARNING] 074/114341 (97109) : All workers exited. Exiting... (143)
2023-03-16T13:33:33.474678351+01:00 stderr F [WARNING] 074/121350 (118921) : Exiting Master process...
2023-03-16T13:33:33.474903230+01:00 stderr F [ALERT] 074/121350 (118921) : Current worker 118924 exited with code 143
2023-03-16T13:33:33.474908538+01:00 stderr F [WARNING] 074/121350 (118921) : All workers exited. Exiting... (143)

Comment 2 Cyril Lopez 2023-03-16 17:38:34 UTC

Also on OVN controller of the compute logs :

2023-03-16T10:52:21.446Z|01310|binding|INFO|Setting lport 59b29103-aac1-4847-b14e-14f19689d832 ovn-installed in OVS 
2023-03-16T10:52:22.178Z|01311|lflow|WARN|Dropped 15 log messages in last 1149 seconds (most recently, 1147 seconds ago) due to excessive rate
2023-03-16T10:52:22.178Z|01312|lflow|WARN|error parsing match "reg0[8] == 1 && (outport == @pg_9d1a5f15_35f4_43ad_84ef_b2c25fbf3468 && ip4 && ip4.src == $pg_5bb0cf4c_6819_4568_9aa6_41985c5b2
d21_ip4 && tcp && tcp.dst == 8300)": Syntax error at `$pg_5bb0cf4c_6819_4568_9aa6_41985c5b2d21_ip4' expecting address set name.
2023-03-16T10:52:23.455Z|01313|binding|INFO|Setting lport 59b29103-aac1-4847-b14e-14f19689d832 up in Southbound     
2023-03-16T10:52:23.455Z|01314|timeval|WARN|Unreasonably long 1790ms poll interval (1781ms user, 1ms system)        
2023-03-16T10:52:23.455Z|01315|timeval|WARN|faults: 288 minor, 0 major                                                                                                                        
2023-03-16T10:52:23.455Z|01316|timeval|WARN|disk: 0 reads, 8 writes                                                                                                                           
2023-03-16T10:52:23.455Z|01317|timeval|WARN|context switches: 0 voluntary, 19 involuntary                                                                                                     
2023-03-16T10:52:23.455Z|01318|coverage|INFO|Skipping details of duplicate event coverage for hash=38a07f41

Comment 5 Miro Tomaska 2023-03-17 17:09:46 UTC

Hi Cyril,

Can you give us more background on this problem? I.e.

1. When did this problem start occuring? A precise timestamp when the customer was experiencing issues would be nice to have so we can look at what was the system doing at that time. The haproxy logs you posted in the original description look fine. That just indicates that haproxy is being taken down by the metadata agent container because the logicalport(most likely a VM port) was removed from the chassis. Here are logs showing this

ovn-controller.log - Port is getting released from the chassis(aka Compute)
 16T12:33:33.066Z|01392|binding|INFO|Releasing lport aebbacb2-b188-46a2-875f-6a60ed5140ce from this chassis.

ovn-metadata-agent.log . 
2023-03-16 13:33:33.091 90125 INFO networking_ovn.agent.metadata.agent [-] Port aebbacb2-b188-46a2-875f-6a60ed5140ce in datapath c1525463-23f3-45a7-9511-95c25fa40ae3 unbound from our chassis
2023-03-16 13:33:33.169 90125 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-c1525463-23f3-45a7-9511-95c25fa40ae3 namespace which is not needed anymore

# It would appear that this port is the last VM on this compute so the metadata agent starts network namespace clean up.

neutron-haproxy-ovnmeta-c1525463-23f3-45a7-9511-95c25fa40ae3.log 
2023-03-16T13:33:33.474678351+01:00 stderr F [WARNING] 074/121350 (118921) : Exiting Master process...
2023-03-16T13:33:33.474903230+01:00 stderr F [ALERT] 074/121350 (118921) : Current worker 118924 exited with code 143
2023-03-16T13:33:33.474908538+01:00 stderr F [WARNING] 074/121350 (118921) : All workers exited. Exiting... (143)
# exit 143 is a polite SIGTERM so that checks out since ovn-metadata-agent is instructing to do that
# All that looks normal to me

2. What is customer trying to do when the issue happens? Start a new VM or reboot a VM? Details like this would be helpful 

3. Sos report only contains logs for one compute. Is this the only compute node with the problem or does it happen on all compute nodes?

Comment 6 Stephane Vigan 2023-03-20 13:21:22 UTC

Hi Miro,

1. We don't have a specific starting date for the issue. We noticed beginning of last week it was related to missing metadata namespace, but we already had the issue 3 weeks ago and managed to solve it by switching instance from host.

2. Usually we noticed the issue when customer is trying to spawn a new instance.

3. This happened on various compute node, not at the same time.

Comment 15 Miro Tomaska 2023-04-14 16:59:05 UTC

Hi Cyril 

Are there any updates from the customer? If I understand correctly, customer enabled debug logs as we asked in comment#11. Did they run into the issue already or not yet. Just trying to push this BZ forward...

Thanks!

Comment 16 Miro Tomaska 2023-04-25 20:48:05 UTC

Hi Tommy,

I noticed you were the last one to comment on the CU case. I see that customer uploaded some new files on 4/21 (21042023.tar.gz). Based on its content I am assuming its the information I asked in comment#11 ?
If thats the case then, I have two questions.

1. Can the customer give me a timestamp when they ran into the issue? That would help me focuse on a specific logs in the log files.
2. I am not sure if this is just me but neutron logs in 03463839/0150-21042023.tar.gz/21042023/sosreport-cell1-compute-22-2023-04-21-snxdxfu/var/log/containers/neutron are empty. Can you confirm? Or perhaps customer made a mistake?
   Also, I noticed that debug for metadata agent is set debug=false in compute cell-1 var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/networking-ovn/ networking-ovn-metadata-agent.ini. In order to enable debug logs it should be set to true and metadata container restarted i.e. systemctl restart tripleo_ovn_metadata_agent.service 

Maybe I am jumping the gun since customer did not provide details why they uploaded those files on 4/21 :). I am just trying to be proactive since this case has been open for quite some time without any progress.

Comment 17 Tommy Doucet 2023-05-01 11:10:23 UTC

Hi Miro,

Some requested information was missing. We re-requested it from the customer. If you have any other questions related to that case, please check with Cyril who opened that BZ, he should know more than me on this issue!

Comment 23 Cyril Lopez 2023-08-21 07:26:53 UTC

Feedback from our GPS at customer :

    In conntrack, the flow from the VM to 169.254.169.254 is in "SYN_SENT" state only. Even if the service is there, there is no response.


    Checked ACL rules inside br-int: no corresponding rules for the 169.254.169.254 access for the specified VM (nor for any other VMs, which is a bit strange).


    Manually added following rules:

    sudo ovs-ofctl add-flow br-int "table=0,priority=100,ip,nw_dst=169.254.169.254,tp_dst=80,actions=normal"
    sudo ovs-ofctl add-flow br-int "table=0,priority=100,ip,nw_src=169.254.169.254,actions=normal"


    Asked the user to update their metadata: OK.

Comment 32 René Ribaud 2024-05-22 13:01:43 UTC

I spent time looking into the SOS report. But I did not find something special.
As Sean said I think the issue was due to the infra.

As this BZ is quite old I propose to close it. Please feel free to open it again if needed.

Comment 35 Cyril Lopez 2024-11-07 16:45:13 UTC

Following diagnostics it could be https://review.opendev.org/c/openstack/neutron/+/881487

Comment 36 Cyril Lopez 2024-11-07 16:48:48 UTC

https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/2017748