Description of problem: If Controller-0 is stopped, the administrative/tenant flow logs are not offloaded, there are just gone (they are not in any other controller). Version-Release number of selected component (if applicable): (overcloud) [stack@undercloud-0 ~]$ cat /var/lib/rhos-release/latest-installed 16.1 -p RHOS-16.1-RHEL-8-20200813.n.0 How reproducible: 100%, I managed to reproduce it on my environment. Steps to Reproduce: 1. Deploy OSP 16.1 in HA 2. Change the flag of the OctaviaLogOffload: true in /home/stack/virt/extra_templates.yaml 3. Due to bug https://bugzilla.redhat.com/show_bug.cgi?id=1856835, copy the templates folder in the following way: sudo cp -r /usr/share/ansible/roles/octavia_controller_post_config/templates /usr/share/ansible/roles/octavia-controller-post-config/templates 4. run overcloud_deploy.sh 5. Stop controller-0. 6. Create a LB. 7. Check in the other controllers (1,2): there is no octavia-amphora.log file (which contains the offloaded logs). Actual results: No octavia-amphora.log file in Controller-1 or Controller-2. Expected results: We expect to see an octavia-amphora.log file with the LB creation details in any of the other controllers. Additional info
Yes, this is expected. The rsyslog infrastructure/containers are setup for UDP rsyslog protocol. This feature was implemented as low overhead and high volume(single load balancers can produce tens of thousands of messages per second with tenant traffic logging enabled). As the important log issues that occur inside the amphora are already logged on the controllers, this was setup as a lowest overhead, best-effort implementation. It will queue messages until the target server is available again, and in some situations it will eventually switch to one of the secondary servers. There are many levels of reliability for logging. As you go up this chain you add overhead and increase the amount of system resources (CPU, RAM, and disk) required. Level 1, as implemented, is UDP transports and minimal queuing. Lowest CPU, RAM, and disk space requirements. Level 2, would be switching transports over to TCP. Octavia supports this, tripleo is not setup for this. This increases the CPU and RAM overhead on both the amphora and the controllers. This can still drop messages. Level 3, full bidirectional confirmation. This requires switching to RELP. It requires significant queuing resources on the amphora and significantly increases the RAM and CPU requirements on the controller. Since these logs were considered "nice to have", we implemented level 1 reliability. This meets the criteria requested in BZ 1623977 with little impact to the cloud. If there is a need for a higher level of reliability for these logs, we can move up levels, but that will require work and may be best entered as an additional RFE.
Makes sense to me, thanks for the information Michael, I am closing this bug.