Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2008298

Summary:

Invalid values of some hacluster metrics on s390x

Product:

Red Hat Enterprise Linux 8

Reporter:

Jan Kurik <jkurik>

Component:

pcp

Assignee:

Nathan Scott <nathans>

Status:

CLOSED ERRATA

QA Contact:

Jan Kurik <jkurik>

Severity:

unspecified

Docs Contact:

Apurva Bhide <abhide>

Priority:

unspecified

Version:

8.6

CC:

agerstmayr, jkurik, nathans, pevans

Target Milestone:

Keywords:

Bugfix, Triaged

Target Release:

8.6

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pcp-5.3.4-1.el8

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-05-10 13:31:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Output of the failed test	none

Description Jan Kurik 2021-09-27 20:07:26 UTC

Created attachment 1826771 [details]
Output of the failed test

Description of problem:
Testing of hacluster pmda on s390x platform shows unexpected values in the following metrics:

* ha_cluster.corosync.member_votes.node_id
* ha_cluster.drbd.connections_sent


Version-Release number of selected component (if applicable):
* pcp-5.3.3-1.el8


How reproducible:
* Always on s390x
* Possibly on other arches

Steps to Reproduce:
1. Install pcp-5.3.3-1.el8 on the latest RHEL-8.6
2. From pcp testsuite run the test #1897

Actual results:
The test fails with the following output:

1897 - output mismatch (see 1897.out.bad)
19,20c19,20
<     inst [0 or "node-1"] value 1
<     inst [1 or "node-2"] value 2
---
>     inst [0 or "node-1"] value 0
>     inst [1 or "node-2"] value 0
196c196
<     inst [0 or "drbd1:1"] value 1888160
---
>     inst [0 or "drbd1:1"] value 8109585449615360


Expected results:
The test pass.

Additional info:
Conversion of the valid and invalid values of ha_cluster.drbd.connections_sent metric (see the 'Actual results' above) to its hexadecimal representation indicates there is a memory alignment (or big/little endian) issue, caused by an improper conversion of 32 bits vs. 64 bits values:

8109585449615360 = 0x1CCFA000000000
1888160 = 0x1CCFA0

Comment 1 Nathan Scott 2021-09-28 04:51:19 UTC

Paul, any ideas as to what might be causing these issues on big endian platforms?  Thanks!

Comment 2 Jan Kurik 2021-09-28 17:59:31 UTC

I was looking into the source code of the PMDA and I found some mishmash in type definition

In file "pmda.c" is the following code:

<snip pmda.c>
        { .m_desc = {
                PMDA_PMID(CLUSTER_DRBD_PEER_DEVICE, DRBD_PEER_DEVICE_CONNECTIONS_RECEIVED),
                PM_TYPE_U64, DRBD_PEER_DEVICE_INDOM, PM_SEM_INSTANT,
                PMDA_PMUNITS(0,0,1,PM_SPACE_KBYTE,0,PM_COUNT_ONE) } },
        { .m_desc = {
                PMDA_PMID(CLUSTER_DRBD_PEER_DEVICE, DRBD_PEER_DEVICE_CONNECTIONS_SENT),
                PM_TYPE_U64, DRBD_PEER_DEVICE_INDOM, PM_SEM_INSTANT,
                PMDA_PMUNITS(0,0,1,PM_SPACE_KBYTE,0,PM_COUNT_ONE) } },
        { .m_desc = {
                PMDA_PMID(CLUSTER_DRBD_PEER_DEVICE, DRBD_PEER_DEVICE_CONNECTIONS_PENDING),
                PM_TYPE_U32, DRBD_PEER_DEVICE_INDOM, PM_SEM_INSTANT,
                PMDA_PMUNITS(0,0,1,0,0,PM_COUNT_ONE) } },
        { .m_desc = {
                PMDA_PMID(CLUSTER_DRBD_PEER_DEVICE, DRBD_PEER_DEVICE_CONNECTIONS_UNACKED),
                PM_TYPE_U32, DRBD_PEER_DEVICE_INDOM, PM_SEM_INSTANT,
                PMDA_PMUNITS(0,0,1,0,0,PM_COUNT_ONE) } },
</snip>

however in "drbd.h" the data types are defined differently:
<snip drbd.h>
        uint32_t connections_received;
        uint32_t connections_sent;
        uint64_t connections_pending;
        uint64_t connections_unacked;
</snip>



Similarly for "ha_cluster.corosync.member_votes.node_id" merics, the datatypes differs in "pmda.c" resp. "corosync.h" files:

<snip pmda.c>
        { .m_desc = {
                PMDA_PMID(CLUSTER_COROSYNC_NODE, COROSYNC_MEMBER_VOTES_NODE_ID),
                PM_TYPE_U32, COROSYNC_NODE_INDOM, PM_SEM_INSTANT,
                PMDA_PMUNITS(0,0,1,0,0,PM_COUNT_ONE) } },
</snip>

<snip corosync.h>
struct member_votes {
        uint32_t        votes;
        uint8_t         local;
        uint64_t        node_id;
};
</snip>


Due to some time constraints I have not tried yet to modify the code and test it with aligned data types, however IMO this is the core of the issue.
If I am mistaken, then feel free to correct me :-)

Comment 3 Nathan Scott 2021-09-29 00:44:20 UTC

> If I am mistaken, then feel free to correct me :-)

Those are exactly the sorts of places to look at Jan.  The other place where things can go wrong is in the fetchCallback routine, where we copy into the pmAtomValue union field of each type - if the wrong field (ll, ull, l, ul) is used, truncation or sign extension can result.

Comment 5 Nathan Scott 2021-09-30 03:15:53 UTC

Merged upstream (Paul, can you also review?  LGTM).

commit cf5aefe663ba48ef0848290a1d5b51850c336702
Author: Jan Kurik <jkurik>
Date:   Wed Sep 29 18:26:42 2021 +0200

    Fix of bz2008298
    
    Fix of datatypes for ha_cluster.corosync.member_votes.node_id and
    ha_cluster.drbd.connections_* metrics.

Comment 6 Paul Evans 2021-09-30 09:33:13 UTC

Hi,

Can confirm the changes look good to me also (ACK), not too sure how the mismatches happened there. Have double-checked each other type definition and they look correct.

Thanks Jan for the fix!

Cheers,

Paul

Comment 11 errata-xmlrpc 2022-05-10 13:31:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcp bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1765