Bug 2179140

Summary: neutron-haproxy-ovnmeta stopped
Product: Red Hat OpenStack Reporter: Cyril Lopez <cylopez>
Component: openstack-neutronAssignee: Miro Tomaska <mtomaska>
Status: NEW --- QA Contact: Eran Kuris <ekuris>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 16.2 (Train)CC: astupnik, chrisw, egarciar, jlibosva, mlavalle, mtomaska, scohen, svigan, tdoucet
Target Milestone: ---Flags: cylopez: needinfo? (mtomaska)
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Cyril Lopez 2023-03-16 17:32:38 UTC
Description of problem:
ovn_metadata_agent is still up but all neutron-haproxy-ovnmeta are down. So no metadata are provided to new VM.

After restarting container "ovn_metadata_agent", all "neutron-haproxy-ovnmeta" container are back and instance get their metadata back.

Version-Release number of selected component (if applicable): 16.2.3

The only log found is :
/var/log/containers/stdouts/neutron-haproxy-ovnmeta-

2023-03-16T13:05:48.306094087+01:00 stderr F [WARNING] 074/113738 (91677) : Exiting Master process...
2023-03-16T13:05:48.306330908+01:00 stderr F [ALERT] 074/113738 (91677) : Current worker 91680 exited with code 143
2023-03-16T13:05:48.306336518+01:00 stderr F [WARNING] 074/113738 (91677) : All workers exited. Exiting... (143)
2023-03-16T13:09:23.524106739+01:00 stderr F [WARNING] 074/114341 (97109) : Exiting Master process...
2023-03-16T13:09:23.524403206+01:00 stderr F [ALERT] 074/114341 (97109) : Current worker 97112 exited with code 143
2023-03-16T13:09:23.524408679+01:00 stderr F [WARNING] 074/114341 (97109) : All workers exited. Exiting... (143)
2023-03-16T13:33:33.474678351+01:00 stderr F [WARNING] 074/121350 (118921) : Exiting Master process...
2023-03-16T13:33:33.474903230+01:00 stderr F [ALERT] 074/121350 (118921) : Current worker 118924 exited with code 143
2023-03-16T13:33:33.474908538+01:00 stderr F [WARNING] 074/121350 (118921) : All workers exited. Exiting... (143)

Comment 2 Cyril Lopez 2023-03-16 17:38:34 UTC
Also on OVN controller of the compute logs :

2023-03-16T10:52:21.446Z|01310|binding|INFO|Setting lport 59b29103-aac1-4847-b14e-14f19689d832 ovn-installed in OVS 
2023-03-16T10:52:22.178Z|01311|lflow|WARN|Dropped 15 log messages in last 1149 seconds (most recently, 1147 seconds ago) due to excessive rate
2023-03-16T10:52:22.178Z|01312|lflow|WARN|error parsing match "reg0[8] == 1 && (outport == @pg_9d1a5f15_35f4_43ad_84ef_b2c25fbf3468 && ip4 && ip4.src == $pg_5bb0cf4c_6819_4568_9aa6_41985c5b2
d21_ip4 && tcp && tcp.dst == 8300)": Syntax error at `$pg_5bb0cf4c_6819_4568_9aa6_41985c5b2d21_ip4' expecting address set name.
2023-03-16T10:52:23.455Z|01313|binding|INFO|Setting lport 59b29103-aac1-4847-b14e-14f19689d832 up in Southbound     
2023-03-16T10:52:23.455Z|01314|timeval|WARN|Unreasonably long 1790ms poll interval (1781ms user, 1ms system)        
2023-03-16T10:52:23.455Z|01315|timeval|WARN|faults: 288 minor, 0 major                                                                                                                        
2023-03-16T10:52:23.455Z|01316|timeval|WARN|disk: 0 reads, 8 writes                                                                                                                           
2023-03-16T10:52:23.455Z|01317|timeval|WARN|context switches: 0 voluntary, 19 involuntary                                                                                                     
2023-03-16T10:52:23.455Z|01318|coverage|INFO|Skipping details of duplicate event coverage for hash=38a07f41

Comment 5 Miro Tomaska 2023-03-17 17:09:46 UTC
Hi Cyril,

Can you give us more background on this problem? I.e.

1. When did this problem start occuring? A precise timestamp when the customer was experiencing issues would be nice to have so we can look at what was the system doing at that time. The haproxy logs you posted in the original description look fine. That just indicates that haproxy is being taken down by the metadata agent container because the logicalport(most likely a VM port) was removed from the chassis. Here are logs showing this

ovn-controller.log - Port is getting released from the chassis(aka Compute)
 16T12:33:33.066Z|01392|binding|INFO|Releasing lport aebbacb2-b188-46a2-875f-6a60ed5140ce from this chassis.

ovn-metadata-agent.log . 
2023-03-16 13:33:33.091 90125 INFO networking_ovn.agent.metadata.agent [-] Port aebbacb2-b188-46a2-875f-6a60ed5140ce in datapath c1525463-23f3-45a7-9511-95c25fa40ae3 unbound from our chassis
2023-03-16 13:33:33.169 90125 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-c1525463-23f3-45a7-9511-95c25fa40ae3 namespace which is not needed anymore

# It would appear that this port is the last VM on this compute so the metadata agent starts network namespace clean up.

neutron-haproxy-ovnmeta-c1525463-23f3-45a7-9511-95c25fa40ae3.log 
2023-03-16T13:33:33.474678351+01:00 stderr F [WARNING] 074/121350 (118921) : Exiting Master process...
2023-03-16T13:33:33.474903230+01:00 stderr F [ALERT] 074/121350 (118921) : Current worker 118924 exited with code 143
2023-03-16T13:33:33.474908538+01:00 stderr F [WARNING] 074/121350 (118921) : All workers exited. Exiting... (143)
# exit 143 is a polite SIGTERM so that checks out since ovn-metadata-agent is instructing to do that
# All that looks normal to me

2. What is customer trying to do when the issue happens? Start a new VM or reboot a VM? Details like this would be helpful 

3. Sos report only contains logs for one compute. Is this the only compute node with the problem or does it happen on all compute nodes?

Comment 6 Stephane Vigan 2023-03-20 13:21:22 UTC
Hi Miro,

1. We don't have a specific starting date for the issue. We noticed beginning of last week it was related to missing metadata namespace, but we already had the issue 3 weeks ago and managed to solve it by switching instance from host.

2. Usually we noticed the issue when customer is trying to spawn a new instance.

3. This happened on various compute node, not at the same time.

Comment 15 Miro Tomaska 2023-04-14 16:59:05 UTC
Hi Cyril 

Are there any updates from the customer? If I understand correctly, customer enabled debug logs as we asked in comment#11. Did they run into the issue already or not yet. Just trying to push this BZ forward...

Thanks!

Comment 16 Miro Tomaska 2023-04-25 20:48:05 UTC
Hi Tommy,

I noticed you were the last one to comment on the CU case. I see that customer uploaded some new files on 4/21 (21042023.tar.gz). Based on its content I am assuming its the information I asked in comment#11 ?
If thats the case then, I have two questions.

1. Can the customer give me a timestamp when they ran into the issue? That would help me focuse on a specific logs in the log files.
2. I am not sure if this is just me but neutron logs in 03463839/0150-21042023.tar.gz/21042023/sosreport-cell1-compute-22-2023-04-21-snxdxfu/var/log/containers/neutron are empty. Can you confirm? Or perhaps customer made a mistake?
   Also, I noticed that debug for metadata agent is set debug=false in compute cell-1 var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/networking-ovn/ networking-ovn-metadata-agent.ini. In order to enable debug logs it should be set to true and metadata container restarted i.e. systemctl restart tripleo_ovn_metadata_agent.service 

Maybe I am jumping the gun since customer did not provide details why they uploaded those files on 4/21 :). I am just trying to be proactive since this case has been open for quite some time without any progress.

Comment 17 Tommy Doucet 2023-05-01 11:10:23 UTC
Hi Miro,

Some requested information was missing. We re-requested it from the customer. If you have any other questions related to that case, please check with Cyril who opened that BZ, he should know more than me on this issue!