2084950 – RHUIv4 does not function when RHUA is unavailable

Bug 2084950 - RHUIv4 does not function when RHUA is unavailable

Summary: RHUIv4 does not function when RHUA is unavailable

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Update Infrastructure for Cloud Providers
Classification:	Red Hat
Component:	CDS
Sub Component:
Version:	4.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	4.4.0
Target Release:	4.x
Assignee:	RHUI Bug List
QA Contact:	Radek Bíba
Docs Contact:
URL:
Whiteboard:
Depends On:	2027521
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-12 17:11 UTC by Liam Hopkins
Modified:	2023-05-03 14:56 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2027521
Environment:
Last Closed:	2023-05-03 14:56:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2023:2101	0	None	None	None	2023-05-03 14:56:50 UTC

Description Liam Hopkins 2022-05-12 17:11:01 UTC

+++ This bug was initially created as a clone of Bug #2027521 +++

Description of problem:

CDS->RHUA fallback does not function for yum/dnf clients.

Version-Release number of selected component (if applicable):

4.1


How reproducible:

Consistently


Steps to Reproduce:

1. Launch a RHUIv4 cluster, create entitlement certs, packages, subscribe client
2. Shut down the RHUA node
3. Attempt to perform any yum/dnf action on the client

Alternatively:

1. Launch a RHUIv4 cluster etc...
2. Attempt to access any content which would return a 404 and thus trigger a fallback to RHUA.

Actual results:

Failure to perform any yum/dnf action e.g. yum makecache

Output from client:
```
Red Hat Enterprise Linux 8 for x86_64 - BaseOS from RHUI (RPMs)                                       34  B/s | 4.1 kB     02:02    
Errors during downloading metadata for repository 'rhui-rhel-8-for-x86_64-baseos-rhui-rpms':
  - Curl error (28): Timeout was reached for https://cds.example.com/pulp/content/content/dist/rhel8/rhui/8/x86_64/baseos/os/repodata/repomd.xml [Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds]
Error: Failed to download metadata for repo 'rhui-rhel-8-for-x86_64-baseos-rhui-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
```

Expected results:

Successful yum/dnf operation

Additional info:

The nginx configuration on the CDS nodes attempts to first reach NFS and fall back to rhua-fetcher (really, API call to Pulp on RHUA node) on some content, or first try rhua-fetcher and fall back to NFS on other content. However, this introducing a >30s delay in completing this fallback, which yum clients appear not to be resilient to. Using 'curl' we can try to get any content and it will *eventually* succeed, after a lengthy delay. Yum/Dnf clients do not function.

We have worked around this by disabling this fallback behavior altogether:
https://github.com/GoogleCloudPlatform/compute-image-tools/pull/1924

Comment 1 Martin Minar 2022-05-13 08:36:21 UTC

Hello Liam,
can you confirm, that you do have symlinks generated for that path? Eg.:

ls -la /var/lib/rhui/remote_share/symlinks/pulp/content/content/dist/rhel8/rhui/8/x86_64/baseos/os/repodata/repomd.xml
lrwxrwxrwx. 1 root root 107 May 11 22:27 /var/lib/rhui/remote_share/symlinks/pulp/content/content/dist/rhel8/rhui/8/x86_64/baseos/os/repodata/repomd.xml -> /var/lib/rhui/remote_share/pulp3/artifact/0e/259277bb33a11457ee2ac45499147fbd4007fa1e0e737f86e34e49b5f7852b

Because I can reproduce that behaviour you are describing only if the I delete that symlink and then disable RHUA. When symlink exists RHUA can be disabled.

The thing is - generating symlinks can have delay after repo is synced - there is a cronjob running every 5 minutes that generates symlinks - that symlink generation can take some time if there is a lot of repos without symlinks generated (same as when you run it manually). It will skip repos already exported and those currently running, but still - big repos take some time - looking into logs, rhel8 baseos rpm repo alone takes about 2 minutes.

Comment 2 Liam Hopkins 2022-05-13 17:51:15 UTC

this symlink does exist. the issue here is that encountering 404s is a normal part of yum operations, but that any 404 with the cds->rhua fallback becomes a 30s+ timeout. what testing have you done on your end? are other people using rhuiv4 successfully?

Comment 3 Radek Bíba 2022-05-16 11:22:29 UTC

Test scenarios:

1) Repository is added, synced, but not exported
1.1) Nginx on the RHUA is inaccessible and HTTP requests are immediately rejected (e.g. with "connection refused")

-> yum/dnf on the client fails immediately, with HTTP 404

1.2) HTTP requests on the RHUA are dropped (IOW, no reply):

> yum/dnf on the client times out

2) Repository is added, synced, exported
2.1) Nginx on the RHUA is inaccessible and HTTP requests are immediately rejected (e.g. with "connection refused")

-> content is taken from the CDS directly from the shared file system

2.2) HTTP requests are dropped (IOW, no reply):

-> ditto, content is served


I don't understand how the unavailability of nginx on the RHUA would affect the ability of the CDS to serve exported content. Only if the content isn't exported does the CDS contact the RHUA asking it for help. If the RHUA can't help, there's nothing the CDS can do, and has to issue HTTP 404 or give up after some time.

To answer your last question, I'm not aware of cloud providers who shut down the RHUA node. It's meant to be up, and sync (and re-export) repos regularly.

Comment 4 Radek Bíba 2022-05-16 11:42:58 UTC

To clarify, when I write "content is taken.../served", it's what happens under the hood. From the perspective of the client VM, this doesn't matter; what matters is it's able to retrieve the content and pass it to the user.

Comment 5 Liam Hopkins 2022-05-16 16:20:25 UTC

I can retitle the bug if it is helpful. This is not about unavailability of nginx on the RHUA, this issue occurs even if RHUA is available. Any use of the RHUA fallback induces timeouts in yum/dnf clients: 'Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds'.

Comment 6 Radek Bíba 2022-05-17 05:54:25 UTC

Is your CDS generally unable to connect to RHUA:443 due to some network configuration?

The use of the RHUA fallback is supposed to work just fine if this communication is possible. With the DEBUG log level on the CDS, one would then see something like:

2022-05-17 05:11:53,120 [26133] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443
2022-05-17 05:11:53,164 [26133] [DEBUG] https://rhua.example.com:443 "GET /pulp/content/content/dist/rhel8/rhui/8/x86_64/sap-solutions/os/Packages/c/compat-sap-c++-10-10.2.1-1.el8_2.x86_64.rpm HTTP/1.1" 200 541508
2022-05-17 05:11:53,164 [26133] [INFO] Retrieved from RHUA: https://rhua.example.com/pulp/content/content/dist/rhel8/rhui/8/x86_64/sap-solutions/os/Packages/c/compat-sap-c++-10-10.2.1-1.el8_2.x86_64.rpm
2022-05-17 05:11:53,178 [26133] [INFO] Created symlink: pulp/content/content/dist/rhel8/rhui/8/x86_64/sap-solutions/os/Packages/c/compat-sap-c++-10-10.2.1-1.el8_2.x86_64.rpm
    to: /var/lib/rhui/remote_share/pulp3/artifact/8a/7d3c1256b6b6a8917dd17ab677a6a9ceccb4fef5d55fa5143ca3bbe7546642

in /var/log/nginx/gunicorn-content_manager.log. It all takes a fraction of a second, so the delay is unnoticeable [*]. And this only happens when the content isn't symlinked. The RHUA isn't contacted at all if the symlinks exist. Or are you observing something else?

As for 404, I agree they aren't uncommon. I mean, they shouldn't normally happen if the repository configuration on the client is correct, but to simulate such an error, I could set the wrong (or any) version in /etc/yum/vars/releasever on the client VM, but still, as long as the CDS can contact the RHUA, the result is 404, *immediately*.

2022-05-17 05:46:11,776 [25958] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443
2022-05-17 05:46:11,794 [25958] [DEBUG] https://rhua.example.com:443 "GET /pulp/content/content/dist/rhel8/rhui/8.6/x86_64/sap/os/repodata/repomd.xml HTTP/1.1" 404 14

(repeated four times because dnf tries this four times)

If the RHUA doesn't reply, then and only then the timeout appears. Again, from the CDS' perspective:

2022-05-17 05:49:12,983 [31083] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443
2022-05-17 05:49:43,993 [31149] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443
2022-05-17 05:50:14,999 [25958] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443
2022-05-17 05:50:45,119 [31151] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443

If curl is used instead of dnf, then the operation eventually completes, but the result is far from what could be called success. It's HTTP 502, and the body contains:

            <h3>The page you are looking for is temporarily unavailable.  Please try again later.</h3>

            <div class="alert">
                <h2>Website Administrator</h2>
                <div class="content">
                    <p>Something has triggered missing webpage on your
                    website. This is the default error page for
                    <strong>nginx</strong> that is distributed with
                    Red Hat Enterprise Linux.  It is located
                    <tt>/usr/share/nginx/html/50x.html</tt></p>

                    <p>You should customize this error page for your own
                    site or edit the <tt>error_page</tt> directive in
                    the <strong>nginx</strong> configuration file
                    <tt>/etc/nginx/nginx.conf</tt>.</p>

                    <p>For information on Red Hat Enterprise Linux, please visit the <a href="http://www.redhat.com/">Red Hat, Inc. website</a>. The documentation for Red Hat Enterprise Linux is <a href="http://www.redhat.com/docs/manuals/enterprise/">available on the Red Hat, Inc. website</a>.</p>

                </div>
            </div>




[*] Having said that, in our performance testing, we actually saw a major difference between the time required to retrieve about 1000 already symlinked RPMs content vs. the state where these symlinks had to be created on the fly.

Comment 7 Liam Hopkins 2022-05-17 15:33:43 UTC

> Is your CDS generally unable to connect to RHUA:443 due to some network configuration?

the rhua is available. it does eventually return results, either content or a 404, depending on what is requested

> the use of the RHUA fallback is supposed to work just fine if this communication is possible.

it functions, but the delay it introduces causes clients to complain: Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds

> If curl is used instead of dnf, then the operation eventually completes, but the result is far from what could be called success. It's HTTP 502, and the body contains:

i can get this issue to occur even with content that does exist. the curl will hang without outputting any data for some number of seconds, then eventually it will successfully complete with 200 code.

i am happy to get on a call about this if it will help clarify the issue we're seeing. the issue is that the fallback does not work under any circumstance because clients are not accepting of the long delay without returning data that it introduces. clients consider it a timeout.

Comment 8 Radek Bíba 2022-05-19 14:41:09 UTC

Do you happen to know why the RHUA returns results eventually rather then immediately? That's the part we haven't observed yet. In our test and production environment, it replies immediately, unless access to it is blocked somehow.

Comment 9 Liam Hopkins 2022-05-19 17:13:03 UTC

Ah, this is useful information. I don't think we've done anything special with our setup - the CDS and RHUA nodes are in the same virtual network, they have direct connectivity. If this is something specific to us, is there a repro command to simulate the CDS->RHUA connectivity? I imagine like a curl command I could run from a CDS node against the RHUA node. If curl similarly is slow, we'd know it was environmental. If not, we'd know it was something about nginx or resolution configuration.

Comment 10 Radek Bíba 2022-05-20 05:12:18 UTC

For details about the fallback, please see /usr/lib/python3.6/site-packages/rhui_cds_plugin/content_manager.py on the CDS. It's all a simple HTTP request for the given content path, just with the RHUA hostname. Here are the key lines from that script:

    uri = f'https://{rhua_hostname}/{filename}'
...
        r = requests.get(uri, stream=True, verify=ssl_ca_file)

Here's what I see:

[root@cds01 ~]# curl --cacert /etc/pki/rhui/certs/ca.crt https://rhua.example.com/pulp/content/content/dist/rhel8/rhui/8/x86_64/sap-solutions/os/repodata/repomd.xml

### absolutely no delay here ###

<?xml version="1.0" encoding="UTF-8"?>
...

Feel free to use any content path. For example, with /pulp/content/content/dist/rhel8/rhui/8/x86_64/baseos/os/repodata/repomd.xml, I get 404 because the BaseOS repo isn't available in this test environment.

It should only be necessary for the CDS to be able to resolve the RHUA hostname and connect to its port 443.

Comment 11 Liam Hopkins 2022-05-20 05:35:19 UTC

yes, with this information we were able to discover the cause of the issue - it was actually a network connection failure from CDS to RHUA, but it was presenting as a long timeout when going over the fallback path. the firewall configuration was not permitting CDS->RHUA traffic for port 443, but was for other ports, leading us to believe there was good connectivity. this resolves the 'bug' experience where fallback doesn't work at all, and leaves us only with wanting to not have the fallback for reliability reasons. what is the risk if we proceed with the cds->rhua fallback disabled? is it only that there may be more 404s in the few minutes prior to symlink creation, or are there other risks?

Comment 12 Radek Bíba 2022-05-23 06:35:22 UTC

Glad to hear you were able to find the cause!

If/when the CDS is added on the RHUA, then yes - you'll only get 404s for content that hasn't been exported yet.

However, the registration of a new CDS in rhui-manager contains a connectivity check, which would fail like this:

~~~
The CDS will now be configured:

Checking that the RHUA services are reachable from the instance...

An unexpected error has occurred during the last operation.

Port 443 on the host rhua.example.com is not accessible from cds01.example.com.
~~~

So if this connection is blocked on purpose, it will be necessary to remove this check. It's in /usr/lib/python3.6/site-packages/rhui/tools/screens/instances.py, which uses /usr/share/rhui-tools/playbooks/check-port-connectivity.yml. Of course, if you deploy your CDS nodes outside rhui-manager, you needn't worry about this.

Comment 13 Liam Hopkins 2022-05-23 16:33:47 UTC

We don't use rhui-manager to add nodes; we would never have an engineer manually run a command on our production nodes to make a change to the service configuration such as adding a new CDS node. So if that's the main risk, we can proceed without the fallback snippet. Happy to close the bug, then.

Comment 14 Radek Bíba 2022-05-24 04:53:18 UTC

Yes, it's the only other risk, and it doesn't affect your environment.

Comment 15 Radek Bíba 2023-04-24 13:53:55 UTC

RHUI 4.4 will bring a configuration option to prevent the use of the RHUA fetcher.

Comment 20 errata-xmlrpc 2023-05-03 14:56:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHUI 4.4.0 release - Security Fixes, Bug Fixes, and Enhancements Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2101

Note You need to log in before you can comment on or make changes to this bug.