Bug 1850779
Summary: | azure-lb: Resource fails intermittently due to nc output redirection to pidfile [rhel-7.9.z] | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Reid Wahl <nwahl> | ||||
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brandon Perkins <bperkins> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.8 | CC: | agk, bperkins, cfeist, cluster-maint, cluster-qe, cnewsom, fdinitto, jreznik, oalbrigt, phfox | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | resource-agents-4.1.1-61.el7_9.2 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 1850778 | ||||||
: | 1876964 1876965 1876966 (view as bug list) | Environment: | |||||
Last Closed: | 2020-11-10 12:56:47 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1850778 | ||||||
Bug Blocks: | 1876964, 1876965, 1876966 | ||||||
Attachments: |
|
Description
Reid Wahl
2020-06-24 21:06:05 UTC
Created attachment 1698711 [details]
Example binary pid file
I forgot to mention a pretty important point. When this happens, the start fails afterward. That's because lb_start()'s call to lb_monitor() returns failure (for the same reason the monitor failed in the first place), and so lb_start() spawns a new `nc` process. But the old `nc` process is still running and using the configured port. ~~~ Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]: notice: Result of stop operation for nc_PF2_02 on ADZW2RHNAF400: 0 (ok) Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]: notice: nc_PF2_02_start_0:777:stderr [ Ncat: bind to :::62502: Address already in use. QUITTING. ] Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]: notice: nc_PF2_02_start_0:777:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 91: kill: (818) - No such process ] Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]: notice: Result of start operation for nc_PF2_02 on ADZW2RHNAF400: 1 (unknown error) ~~~ If the issue has already happened on one node and then it happens on another node before the failure is manually cleaned up, the resource won't recovery automatically (due to start failures on each node). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Low: resource-agents security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5004 |