Bug 2155453
| Summary: | fence_ibm_powervs fencing agent performance enhancements needed (RHEL8) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Andreas Schauberer <andreas.schauberer> | ||||
| Component: | fence-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | ||||
| Status: | MODIFIED --- | QA Contact: | Brandon Perkins <bperkins> | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 8.4 | CC: | bperkins, cfeist, cluster-maint, fdanapfe, ksatarin | ||||
| Target Milestone: | rc | Keywords: | Triaged | ||||
| Target Release: | 8.9 | ||||||
| Hardware: | ppc64le | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | fence-agents-4.2.1-119.el8 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 2221643 (view as bug list) | Environment: | |||||
| Last Closed: | Type: | Bug | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2221643 | ||||||
| Attachments: |
|
||||||
|
Description
Andreas Schauberer
2022-12-21 10:00:55 UTC
We strongly recommend against using cycle because we've had problems in the past with the cycle/reboot call failing to succeed, but not notifying the cluster. With off/on calls the cluster runs a check to confirm the node is powered off giving us higher confidence that we won't have to deal with a split brain situation. That being said, I don't think using the 'onoff' method should double recovery times (assuming the actual API calls happen quickly). How long does it take for the fencing agent to just power off the node? Is there a large delay between the API call and when the node actually powers off? Also, when using the cycle method does the API just immediately return before the node is actually restarted? This will make things appear to speed up, but could be dangerous because the cluster thinks the node has been powered off, but it is actually running. Hi, just returned from vacation. Please give me another week to provide the data showing the double recovery time using "onoff" compared to "cycle", since each API call takes 10sec up to 1min. I executed 36 HANA failover tests and all failovers where successful. This leads me to the conclusion that this is a rock solid HANA HA solution. Never the less, my tests show that the fencing agent can be improved to get faster failover times. The biggest difference is the number of REST API calls comparing the GA and new proposed fencing agents. Below the failover times for the GA fence_ibm_powervs (extension .org) compared to the failover times for the new proposed fence_ibm_powervs (extension .mod). Change history for proposed fence_ibm_powervs.mod - For action=status one PowerVS REST API call less is used, therefore execution time is half of the original time. - For action=reboot (cycle) now only the PowerVS REST API action HARD_REBOOT is called. When this call returns ok, it is save to say that the node was stopped. Three failover scenarios where tested with recovery times for an HANA client connected to the current HANA primary node: - Time to recover in seconds is calculated as the time between the last successful client call to the old primary node and the first successful client call to the new primary node. Failover scenarios (time to recover avg/min/max): - Scenario: HDB primary cmd “HDB kill -9” - fence_ibm_powervs.mod: avg 61, min 24, max 81 sec - fence_ibm_powervs.org: avg 68, min 51, max 83 sec - Scenario: HDB primary “LPAR Immediate shutdown” - fence_ibm_powervs.mod: avg 60, min 41, max 66 sec - fence_ibm_powervs.org: avg 98, min 64, max 135 sec - Scenario: HDB primary “OS kernel panic” - fence_ibm_powervs.mod: avg 54, min 30, max 89 sec - fence_ibm_powervs.org: avg 120, min 89, max 128 sec Four different fencing actions where tested: - Execution time in seconds is calculated as the time for the fencing agent calling required REST APIs for one specific fencing action. Fencing action scenarios (PowerVS REST API call number) (execution time avg/min/max): - Scenario: action=status - “fence_ibm_powervs.mod -o status” (3 API calls) avg 12, min 6, max 28 sec - “fence_ibm_powervs.org -o status” (2 API calls) avg 24, min 10, max 39 sec - Scenario: action=off - “fence_ibm_powervs.mod -o off” (6 API calls) avg 44, min 28, max 57 sec - “fence_ibm_powervs.org -o off” (4 API calls) avg 81, min 74, max 89 sec - Scenario: action=on - “fence_ibm_powervs.mod -o on” (6 API calls) avg 50, min 24, max 73 sec - “fence_ibm_powervs.org -o on” (4 API calls) avg 74, min 48, max 94 sec - Scenario: action=reboot - “fence_ibm_powervs.mod -o reboot” (9 API calls) avg 12, min 9, max 19 sec - “fence_ibm_powervs.org -o reboot” (3 API calls) avg 95, min 71, max 132 sec Test Setup: - 2 node pacemaker cluster using RHEL HA Add-On SAP HANA HSR policy - Third system to run sql client to get failover times using script: while [ 1=1 ]; do date;/usr/sap/hdbclient/hdbsql -n <IP-Address> -i 00 -u SYSTEM -p <pw> 'select HARDWARE_KEY,SYSTEM_ID,EXPIRATION_DATE,PERMANENT,VALID from M_LICENSE';sleep 2;done I executed 36 HANA failover tests and all failovers where successful. This leads me to the conclusion that this is a rock solid HANA HA solution. Never the less, my tests show that the fencing agent can be improved to get faster failover times. The biggest difference is the number of REST API calls comparing the GA and new proposed fencing agents. Below the failover times for the GA fence_ibm_powervs (extension .org) compared to the failover times for the new proposed fence_ibm_powervs (extension .mod). Change history for proposed fence_ibm_powervs.mod - For action=status now one PowerVS REST API call less is used, therefore execution time is half of the original time. The original code triggered a second status call to get the list of all LPAR instances for a given workspace, this is now only done if the first status call fails. - For action=reboot (method=cycle) now only the PowerVS REST API action HARD_REBOOT is called. When this call returns ok, it is save to say that the node was stopped. The original code had no cycle method implemented. Three failover scenarios where tested with recovery times for an HANA client connected to the current HANA primary node: - Time to recover in seconds is calculated as the time between the last successful client call to the old primary node and the first successful client call to the new primary node. Failover scenarios (time to recover avg/min/max): - Scenario: HDB primary cmd “HDB kill -9” - fence_ibm_powervs.mod: avg 61, min 24, max 81 sec - fence_ibm_powervs.org: avg 68, min 51, max 83 sec - Scenario: HDB primary “LPAR Immediate shutdown” - fence_ibm_powervs.mod: avg 60, min 41, max 66 sec - fence_ibm_powervs.org: avg 98, min 64, max 135 sec - Scenario: HDB primary “OS kernel panic” - fence_ibm_powervs.mod: avg 54, min 30, max 89 sec - fence_ibm_powervs.org: avg 120, min 89, max 128 sec Four different fencing actions where tested: - Execution time in seconds is calculated as the time for the fencing agent calling required REST APIs for one specific fencing action. Fencing action scenarios (PowerVS REST API call number) (execution time avg/min/max): - Scenario: action=status - “fence_ibm_powervs.mod -o status” (2 API calls) avg 12, min 6, max 28 sec - “fence_ibm_powervs.org -o status” (3 API calls) avg 24, min 10, max 39 sec - Scenario: action=off - “fence_ibm_powervs.mod -o off” (4 API calls) avg 44, min 28, max 57 sec - “fence_ibm_powervs.org -o off” (6 API calls) avg 81, min 74, max 89 sec - Scenario: action=on - “fence_ibm_powervs.mod -o on” (4 API calls) avg 50, min 24, max 73 sec - “fence_ibm_powervs.org -o on” (6 API calls) avg 74, min 48, max 94 sec - Scenario: action=reboot - “fence_ibm_powervs.mod -o reboot” (3 API calls) avg 12, min 9, max 19 sec - “fence_ibm_powervs.org -o reboot” (9 API calls) avg 95, min 71, max 132 sec Test Setup: - 2 node pacemaker cluster using RHEL HA Add-On SAP HANA HSR policy - Third system to run sql client to get failover times using script: while [ 1=1 ]; do date;/usr/sap/hdbclient/hdbsql -n <IP-Address> -i 00 -u SYSTEM -p <pw> 'select HARDWARE_KEY,SYSTEM_ID,EXPIRATION_DATE,PERMANENT,VALID from M_LICENSE';sleep 2;done Nice. Can you add the modified agent to the bz, so I can see your changes? Created attachment 1942698 [details]
enhanced fence_ibm_powervs.py
enhanced fence_ibm_powervs.py:
- For action=status now one PowerVS REST API call less is used, therefore execution time is half of the original time. The original code triggered a second status call to get the list of all LPAR instances for a given workspace, this is now only done if the first status call fails.
- For action=reboot (method=cycle) now only the PowerVS REST API action HARD_REBOOT is called. When this call returns ok, it is save to say that the node was stopped. The original code had no cycle method implemented.
Looks good to me, but we should keep the cycle method optional as the onoff method ensures the device has been fully rebooted. Feel free to make an upstream PR at https://github.com/ClusterLabs/fence-agents unless you want me to make one for you. I plan to open the upstream PR this week. Any updates on this? Opened pull request #542 for these changes. https://github.com/ClusterLabs/fence-agents/pull/542 All review comments solved in PR #542. |