Bug 2084950
Summary: | RHUIv4 does not function when RHUA is unavailable | ||
---|---|---|---|
Product: | Red Hat Update Infrastructure for Cloud Providers | Reporter: | Liam Hopkins <liamh> |
Component: | CDS | Assignee: | RHUI Bug List <rhui-bugs> |
Status: | CLOSED ERRATA | QA Contact: | Radek Bíba <rbiba> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.0.0 | CC: | ahumbe, gtanzill, jwarne, mminar, rbiba, rhui-bugs, sskracic, zmarano |
Target Milestone: | 4.4.0 | Keywords: | Triaged |
Target Release: | 4.x | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 2027521 | Environment: | |
Last Closed: | 2023-05-03 14:56:19 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2027521 | ||
Bug Blocks: |
Description
Liam Hopkins
2022-05-12 17:11:01 UTC
Hello Liam, can you confirm, that you do have symlinks generated for that path? Eg.: ls -la /var/lib/rhui/remote_share/symlinks/pulp/content/content/dist/rhel8/rhui/8/x86_64/baseos/os/repodata/repomd.xml lrwxrwxrwx. 1 root root 107 May 11 22:27 /var/lib/rhui/remote_share/symlinks/pulp/content/content/dist/rhel8/rhui/8/x86_64/baseos/os/repodata/repomd.xml -> /var/lib/rhui/remote_share/pulp3/artifact/0e/259277bb33a11457ee2ac45499147fbd4007fa1e0e737f86e34e49b5f7852b Because I can reproduce that behaviour you are describing only if the I delete that symlink and then disable RHUA. When symlink exists RHUA can be disabled. The thing is - generating symlinks can have delay after repo is synced - there is a cronjob running every 5 minutes that generates symlinks - that symlink generation can take some time if there is a lot of repos without symlinks generated (same as when you run it manually). It will skip repos already exported and those currently running, but still - big repos take some time - looking into logs, rhel8 baseos rpm repo alone takes about 2 minutes. this symlink does exist. the issue here is that encountering 404s is a normal part of yum operations, but that any 404 with the cds->rhua fallback becomes a 30s+ timeout. what testing have you done on your end? are other people using rhuiv4 successfully? Test scenarios:
1) Repository is added, synced, but not exported
1.1) Nginx on the RHUA is inaccessible and HTTP requests are immediately rejected (e.g. with "connection refused")
-> yum/dnf on the client fails immediately, with HTTP 404
1.2) HTTP requests on the RHUA are dropped (IOW, no reply):
> yum/dnf on the client times out
2) Repository is added, synced, exported
2.1) Nginx on the RHUA is inaccessible and HTTP requests are immediately rejected (e.g. with "connection refused")
-> content is taken from the CDS directly from the shared file system
2.2) HTTP requests are dropped (IOW, no reply):
-> ditto, content is served
I don't understand how the unavailability of nginx on the RHUA would affect the ability of the CDS to serve exported content. Only if the content isn't exported does the CDS contact the RHUA asking it for help. If the RHUA can't help, there's nothing the CDS can do, and has to issue HTTP 404 or give up after some time.
To answer your last question, I'm not aware of cloud providers who shut down the RHUA node. It's meant to be up, and sync (and re-export) repos regularly.
To clarify, when I write "content is taken.../served", it's what happens under the hood. From the perspective of the client VM, this doesn't matter; what matters is it's able to retrieve the content and pass it to the user. I can retitle the bug if it is helpful. This is not about unavailability of nginx on the RHUA, this issue occurs even if RHUA is available. Any use of the RHUA fallback induces timeouts in yum/dnf clients: 'Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds'. Is your CDS generally unable to connect to RHUA:443 due to some network configuration? The use of the RHUA fallback is supposed to work just fine if this communication is possible. With the DEBUG log level on the CDS, one would then see something like: 2022-05-17 05:11:53,120 [26133] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443 2022-05-17 05:11:53,164 [26133] [DEBUG] https://rhua.example.com:443 "GET /pulp/content/content/dist/rhel8/rhui/8/x86_64/sap-solutions/os/Packages/c/compat-sap-c++-10-10.2.1-1.el8_2.x86_64.rpm HTTP/1.1" 200 541508 2022-05-17 05:11:53,164 [26133] [INFO] Retrieved from RHUA: https://rhua.example.com/pulp/content/content/dist/rhel8/rhui/8/x86_64/sap-solutions/os/Packages/c/compat-sap-c++-10-10.2.1-1.el8_2.x86_64.rpm 2022-05-17 05:11:53,178 [26133] [INFO] Created symlink: pulp/content/content/dist/rhel8/rhui/8/x86_64/sap-solutions/os/Packages/c/compat-sap-c++-10-10.2.1-1.el8_2.x86_64.rpm to: /var/lib/rhui/remote_share/pulp3/artifact/8a/7d3c1256b6b6a8917dd17ab677a6a9ceccb4fef5d55fa5143ca3bbe7546642 in /var/log/nginx/gunicorn-content_manager.log. It all takes a fraction of a second, so the delay is unnoticeable [*]. And this only happens when the content isn't symlinked. The RHUA isn't contacted at all if the symlinks exist. Or are you observing something else? As for 404, I agree they aren't uncommon. I mean, they shouldn't normally happen if the repository configuration on the client is correct, but to simulate such an error, I could set the wrong (or any) version in /etc/yum/vars/releasever on the client VM, but still, as long as the CDS can contact the RHUA, the result is 404, *immediately*. 2022-05-17 05:46:11,776 [25958] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443 2022-05-17 05:46:11,794 [25958] [DEBUG] https://rhua.example.com:443 "GET /pulp/content/content/dist/rhel8/rhui/8.6/x86_64/sap/os/repodata/repomd.xml HTTP/1.1" 404 14 (repeated four times because dnf tries this four times) If the RHUA doesn't reply, then and only then the timeout appears. Again, from the CDS' perspective: 2022-05-17 05:49:12,983 [31083] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443 2022-05-17 05:49:43,993 [31149] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443 2022-05-17 05:50:14,999 [25958] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443 2022-05-17 05:50:45,119 [31151] [DEBUG] Starting new HTTPS connection (1): rhua.example.com:443 If curl is used instead of dnf, then the operation eventually completes, but the result is far from what could be called success. It's HTTP 502, and the body contains: <h3>The page you are looking for is temporarily unavailable. Please try again later.</h3> <div class="alert"> <h2>Website Administrator</h2> <div class="content"> <p>Something has triggered missing webpage on your website. This is the default error page for <strong>nginx</strong> that is distributed with Red Hat Enterprise Linux. It is located <tt>/usr/share/nginx/html/50x.html</tt></p> <p>You should customize this error page for your own site or edit the <tt>error_page</tt> directive in the <strong>nginx</strong> configuration file <tt>/etc/nginx/nginx.conf</tt>.</p> <p>For information on Red Hat Enterprise Linux, please visit the <a href="http://www.redhat.com/">Red Hat, Inc. website</a>. The documentation for Red Hat Enterprise Linux is <a href="http://www.redhat.com/docs/manuals/enterprise/">available on the Red Hat, Inc. website</a>.</p> </div> </div> [*] Having said that, in our performance testing, we actually saw a major difference between the time required to retrieve about 1000 already symlinked RPMs content vs. the state where these symlinks had to be created on the fly. > Is your CDS generally unable to connect to RHUA:443 due to some network configuration? the rhua is available. it does eventually return results, either content or a 404, depending on what is requested > the use of the RHUA fallback is supposed to work just fine if this communication is possible. it functions, but the delay it introduces causes clients to complain: Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds > If curl is used instead of dnf, then the operation eventually completes, but the result is far from what could be called success. It's HTTP 502, and the body contains: i can get this issue to occur even with content that does exist. the curl will hang without outputting any data for some number of seconds, then eventually it will successfully complete with 200 code. i am happy to get on a call about this if it will help clarify the issue we're seeing. the issue is that the fallback does not work under any circumstance because clients are not accepting of the long delay without returning data that it introduces. clients consider it a timeout. Do you happen to know why the RHUA returns results eventually rather then immediately? That's the part we haven't observed yet. In our test and production environment, it replies immediately, unless access to it is blocked somehow. Ah, this is useful information. I don't think we've done anything special with our setup - the CDS and RHUA nodes are in the same virtual network, they have direct connectivity. If this is something specific to us, is there a repro command to simulate the CDS->RHUA connectivity? I imagine like a curl command I could run from a CDS node against the RHUA node. If curl similarly is slow, we'd know it was environmental. If not, we'd know it was something about nginx or resolution configuration. For details about the fallback, please see /usr/lib/python3.6/site-packages/rhui_cds_plugin/content_manager.py on the CDS. It's all a simple HTTP request for the given content path, just with the RHUA hostname. Here are the key lines from that script: uri = f'https://{rhua_hostname}/{filename}' ... r = requests.get(uri, stream=True, verify=ssl_ca_file) Here's what I see: [root@cds01 ~]# curl --cacert /etc/pki/rhui/certs/ca.crt https://rhua.example.com/pulp/content/content/dist/rhel8/rhui/8/x86_64/sap-solutions/os/repodata/repomd.xml ### absolutely no delay here ### <?xml version="1.0" encoding="UTF-8"?> ... Feel free to use any content path. For example, with /pulp/content/content/dist/rhel8/rhui/8/x86_64/baseos/os/repodata/repomd.xml, I get 404 because the BaseOS repo isn't available in this test environment. It should only be necessary for the CDS to be able to resolve the RHUA hostname and connect to its port 443. yes, with this information we were able to discover the cause of the issue - it was actually a network connection failure from CDS to RHUA, but it was presenting as a long timeout when going over the fallback path. the firewall configuration was not permitting CDS->RHUA traffic for port 443, but was for other ports, leading us to believe there was good connectivity. this resolves the 'bug' experience where fallback doesn't work at all, and leaves us only with wanting to not have the fallback for reliability reasons. what is the risk if we proceed with the cds->rhua fallback disabled? is it only that there may be more 404s in the few minutes prior to symlink creation, or are there other risks? Glad to hear you were able to find the cause! If/when the CDS is added on the RHUA, then yes - you'll only get 404s for content that hasn't been exported yet. However, the registration of a new CDS in rhui-manager contains a connectivity check, which would fail like this: ~~~ The CDS will now be configured: Checking that the RHUA services are reachable from the instance... An unexpected error has occurred during the last operation. Port 443 on the host rhua.example.com is not accessible from cds01.example.com. ~~~ So if this connection is blocked on purpose, it will be necessary to remove this check. It's in /usr/lib/python3.6/site-packages/rhui/tools/screens/instances.py, which uses /usr/share/rhui-tools/playbooks/check-port-connectivity.yml. Of course, if you deploy your CDS nodes outside rhui-manager, you needn't worry about this. We don't use rhui-manager to add nodes; we would never have an engineer manually run a command on our production nodes to make a change to the service configuration such as adding a new CDS node. So if that's the main risk, we can proceed without the fallback snippet. Happy to close the bug, then. Yes, it's the only other risk, and it doesn't affect your environment. RHUI 4.4 will bring a configuration option to prevent the use of the RHUA fetcher. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHUI 4.4.0 release - Security Fixes, Bug Fixes, and Enhancements Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2101 |