Bug 1877689

Summary: [Octavia][16.1] Amphora Log Offloading, logs are gone if controller-0 stops
Product: Red Hat OpenStack Reporter: Omer Schwartz <oschwart>
Component: openstack-octaviaAssignee: Nate Johnston <njohnston>
Status: CLOSED NOTABUG QA Contact: Omer Schwartz <oschwart>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 16.1 (Train)CC: ihrachys, lpeer, majopela, michjohn, scohen, tfreger
Target Milestone: ---Keywords: UserExperience
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-13 07:32:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1623977    

Description Omer Schwartz 2020-09-10 08:04:10 UTC
Description of problem:
If Controller-0 is stopped, the administrative/tenant flow logs are not offloaded, there are just gone (they are not in any other controller).

Version-Release number of selected component (if applicable):
(overcloud) [stack@undercloud-0 ~]$ cat /var/lib/rhos-release/latest-installed
16.1  -p RHOS-16.1-RHEL-8-20200813.n.0

How reproducible:
100%, I managed to reproduce it on my environment.

Steps to Reproduce:
1. Deploy OSP 16.1 in HA
2. Change the flag of the OctaviaLogOffload: true in /home/stack/virt/extra_templates.yaml
3. Due to bug https://bugzilla.redhat.com/show_bug.cgi?id=1856835, copy the templates folder in the following way:
sudo cp -r /usr/share/ansible/roles/octavia_controller_post_config/templates /usr/share/ansible/roles/octavia-controller-post-config/templates
4. run overcloud_deploy.sh
5. Stop controller-0.
6. Create a LB.
7. Check in the other controllers (1,2): there is no octavia-amphora.log file (which contains the offloaded logs).

Actual results:
No octavia-amphora.log file in Controller-1 or Controller-2.

Expected results:
We expect to see an octavia-amphora.log file with the LB creation details in any of the other controllers.

Additional info

Comment 1 Michael Johnson 2020-09-11 23:36:35 UTC
Yes, this is expected.

The rsyslog infrastructure/containers are setup for UDP rsyslog protocol.

This feature was implemented as low overhead and high volume(single load balancers can produce tens of thousands of messages per second with tenant traffic logging enabled). As the important log issues that occur inside the amphora are already logged on the controllers, this was setup as a lowest overhead, best-effort implementation.

It will queue messages until the target server is available again, and in some situations it will eventually switch to one of the secondary servers.

There are many levels of reliability for logging. As you go up this chain you add overhead and increase the amount of system resources (CPU, RAM, and disk) required.

Level 1, as implemented, is UDP transports and minimal queuing. Lowest CPU, RAM, and disk space requirements.
Level 2, would be switching transports over to TCP. Octavia supports this, tripleo is not setup for this. This increases the CPU and RAM overhead on both the amphora and the controllers. This can still drop messages.
Level 3, full bidirectional confirmation. This requires switching to RELP. It requires significant queuing resources on the amphora and significantly increases the RAM and CPU requirements on the controller.

Since these logs were considered "nice to have", we implemented level 1 reliability. This meets the criteria requested in BZ 1623977 with little impact to the cloud.

If there is a need for a higher level of reliability for these logs, we can move up levels, but that will require work and may be best entered as an additional RFE.

Comment 2 Omer Schwartz 2020-09-13 07:17:14 UTC
Makes sense to me, thanks for the information Michael, I am closing this bug.