Description of problem: ovn_metadata_agent is still up but all neutron-haproxy-ovnmeta are down. So no metadata are provided to new VM. After restarting container "ovn_metadata_agent", all "neutron-haproxy-ovnmeta" container are back and instance get their metadata back. Version-Release number of selected component (if applicable): 16.2.3 The only log found is : /var/log/containers/stdouts/neutron-haproxy-ovnmeta- 2023-03-16T13:05:48.306094087+01:00 stderr F [WARNING] 074/113738 (91677) : Exiting Master process... 2023-03-16T13:05:48.306330908+01:00 stderr F [ALERT] 074/113738 (91677) : Current worker 91680 exited with code 143 2023-03-16T13:05:48.306336518+01:00 stderr F [WARNING] 074/113738 (91677) : All workers exited. Exiting... (143) 2023-03-16T13:09:23.524106739+01:00 stderr F [WARNING] 074/114341 (97109) : Exiting Master process... 2023-03-16T13:09:23.524403206+01:00 stderr F [ALERT] 074/114341 (97109) : Current worker 97112 exited with code 143 2023-03-16T13:09:23.524408679+01:00 stderr F [WARNING] 074/114341 (97109) : All workers exited. Exiting... (143) 2023-03-16T13:33:33.474678351+01:00 stderr F [WARNING] 074/121350 (118921) : Exiting Master process... 2023-03-16T13:33:33.474903230+01:00 stderr F [ALERT] 074/121350 (118921) : Current worker 118924 exited with code 143 2023-03-16T13:33:33.474908538+01:00 stderr F [WARNING] 074/121350 (118921) : All workers exited. Exiting... (143)
Also on OVN controller of the compute logs : 2023-03-16T10:52:21.446Z|01310|binding|INFO|Setting lport 59b29103-aac1-4847-b14e-14f19689d832 ovn-installed in OVS 2023-03-16T10:52:22.178Z|01311|lflow|WARN|Dropped 15 log messages in last 1149 seconds (most recently, 1147 seconds ago) due to excessive rate 2023-03-16T10:52:22.178Z|01312|lflow|WARN|error parsing match "reg0[8] == 1 && (outport == @pg_9d1a5f15_35f4_43ad_84ef_b2c25fbf3468 && ip4 && ip4.src == $pg_5bb0cf4c_6819_4568_9aa6_41985c5b2 d21_ip4 && tcp && tcp.dst == 8300)": Syntax error at `$pg_5bb0cf4c_6819_4568_9aa6_41985c5b2d21_ip4' expecting address set name. 2023-03-16T10:52:23.455Z|01313|binding|INFO|Setting lport 59b29103-aac1-4847-b14e-14f19689d832 up in Southbound 2023-03-16T10:52:23.455Z|01314|timeval|WARN|Unreasonably long 1790ms poll interval (1781ms user, 1ms system) 2023-03-16T10:52:23.455Z|01315|timeval|WARN|faults: 288 minor, 0 major 2023-03-16T10:52:23.455Z|01316|timeval|WARN|disk: 0 reads, 8 writes 2023-03-16T10:52:23.455Z|01317|timeval|WARN|context switches: 0 voluntary, 19 involuntary 2023-03-16T10:52:23.455Z|01318|coverage|INFO|Skipping details of duplicate event coverage for hash=38a07f41
Hi Cyril, Can you give us more background on this problem? I.e. 1. When did this problem start occuring? A precise timestamp when the customer was experiencing issues would be nice to have so we can look at what was the system doing at that time. The haproxy logs you posted in the original description look fine. That just indicates that haproxy is being taken down by the metadata agent container because the logicalport(most likely a VM port) was removed from the chassis. Here are logs showing this ovn-controller.log - Port is getting released from the chassis(aka Compute) 16T12:33:33.066Z|01392|binding|INFO|Releasing lport aebbacb2-b188-46a2-875f-6a60ed5140ce from this chassis. ovn-metadata-agent.log . 2023-03-16 13:33:33.091 90125 INFO networking_ovn.agent.metadata.agent [-] Port aebbacb2-b188-46a2-875f-6a60ed5140ce in datapath c1525463-23f3-45a7-9511-95c25fa40ae3 unbound from our chassis 2023-03-16 13:33:33.169 90125 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-c1525463-23f3-45a7-9511-95c25fa40ae3 namespace which is not needed anymore # It would appear that this port is the last VM on this compute so the metadata agent starts network namespace clean up. neutron-haproxy-ovnmeta-c1525463-23f3-45a7-9511-95c25fa40ae3.log 2023-03-16T13:33:33.474678351+01:00 stderr F [WARNING] 074/121350 (118921) : Exiting Master process... 2023-03-16T13:33:33.474903230+01:00 stderr F [ALERT] 074/121350 (118921) : Current worker 118924 exited with code 143 2023-03-16T13:33:33.474908538+01:00 stderr F [WARNING] 074/121350 (118921) : All workers exited. Exiting... (143) # exit 143 is a polite SIGTERM so that checks out since ovn-metadata-agent is instructing to do that # All that looks normal to me 2. What is customer trying to do when the issue happens? Start a new VM or reboot a VM? Details like this would be helpful 3. Sos report only contains logs for one compute. Is this the only compute node with the problem or does it happen on all compute nodes?
Hi Miro, 1. We don't have a specific starting date for the issue. We noticed beginning of last week it was related to missing metadata namespace, but we already had the issue 3 weeks ago and managed to solve it by switching instance from host. 2. Usually we noticed the issue when customer is trying to spawn a new instance. 3. This happened on various compute node, not at the same time.
Hi Cyril Are there any updates from the customer? If I understand correctly, customer enabled debug logs as we asked in comment#11. Did they run into the issue already or not yet. Just trying to push this BZ forward... Thanks!
Hi Tommy, I noticed you were the last one to comment on the CU case. I see that customer uploaded some new files on 4/21 (21042023.tar.gz). Based on its content I am assuming its the information I asked in comment#11 ? If thats the case then, I have two questions. 1. Can the customer give me a timestamp when they ran into the issue? That would help me focuse on a specific logs in the log files. 2. I am not sure if this is just me but neutron logs in 03463839/0150-21042023.tar.gz/21042023/sosreport-cell1-compute-22-2023-04-21-snxdxfu/var/log/containers/neutron are empty. Can you confirm? Or perhaps customer made a mistake? Also, I noticed that debug for metadata agent is set debug=false in compute cell-1 var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/networking-ovn/ networking-ovn-metadata-agent.ini. In order to enable debug logs it should be set to true and metadata container restarted i.e. systemctl restart tripleo_ovn_metadata_agent.service Maybe I am jumping the gun since customer did not provide details why they uploaded those files on 4/21 :). I am just trying to be proactive since this case has been open for quite some time without any progress.
Hi Miro, Some requested information was missing. We re-requested it from the customer. If you have any other questions related to that case, please check with Cyril who opened that BZ, he should know more than me on this issue!