Bug 2257395

Summary: [RHOSP 16.2] "ipmi/main" plugin read error in collectd container
Product: Red Hat OpenStack Reporter: Sukhendu Kar <sukar>
Component: openstack-tripleo-heat-templatesAssignee: OSP Team <rhos-maint>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: low    
Version: 16.2 (Train)CC: abhijadh, dhill, ebarrera, jbadiapa, jjoyce, jveiraca, lars, mariel, mburns, mmagr, mrunge, pweeks, ykulkarn
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20230808225222.el8ost python-paunch-5.5.2-2.20240522085137.a0ae7ec.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2274010 (view as bug list) Environment:
Last Closed: 2025-01-09 14:55:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2274010    

Comment 5 David Hill 2024-01-11 22:02:47 UTC
[2024-01-09 14:09:37] ipmi plugin: Legacy configuration found! Please update your config file.
[2024-01-09 14:09:37] plugin_load: plugin "load" successfully loaded.
[2024-01-09 14:09:37] plugin_load: plugin "memory" successfully loaded.
[2024-01-09 14:09:37] plugin_load: plugin "python" successfully loaded.
[2024-01-09 14:09:37] plugin_load: plugin "thermal" successfully loaded.
[2024-01-09 14:09:37] plugin_load: plugin "unixsock" successfully loaded.
[2024-01-09 14:09:37] plugin_load: plugin "uptime" successfully loaded.
[2024-01-09 14:09:37] plugin_load: plugin "virt" successfully loaded.
[2024-01-09 14:09:37] plugin_load: plugin "vmem" successfully loaded.
[2024-01-09 14:09:37] UNKNOWN plugin: plugin_get_interval: Unable to determine Interval from context.
[2024-01-09 14:09:37] plugin_load: plugin "libpodstats" successfully loaded.
[2024-01-09 14:09:37] ipmi plugin: ipmi_smi_setup_con failed for `main`: OS: Operation not permitted <============================================================================
[2024-01-09 14:09:37] ipmi plugin: c_ipmi_thread_init failed.
[2024-01-09 14:09:37] virt plugin: reader virt-0 initialized
[2024-01-09 14:09:37] Initialization complete, entering read-loop.
[2024-01-09 14:09:37] ipmi plugin: c_ipmi_read: I'm not active, returning false.
[2024-01-09 14:09:37] read-function of plugin `ipmi/main' failed. Will suspend it for 120.000 seconds.
[2024-01-09 14:11:37] ipmi plugin: c_ipmi_read: I'm not active, returning false.
[2024-01-09 14:11:37] read-function of plugin `ipmi/main' failed. Will suspend it for 240.000 seconds.
[2024-01-09 14:15:37] ipmi plugin: c_ipmi_read: I'm not active, returning false.
[2024-01-09 14:15:37] read-function of plugin `ipmi/main' failed. Will suspend it for 480.000 seconds.

selinux ?

Comment 7 David Hill 2024-01-12 13:07:54 UTC
The call that is failing is in OpenIPMI:
~~~
ipmi_smi_setup_con(int          if_num,
       os_handler_t *handlers,
       void         *user_data,
       ipmi_con_t   **new_con)
{ 
    int err;
    
    if (!handlers->add_fd_to_wait_for
  || !handlers->remove_fd_to_wait_for
  || !handlers->alloc_timer
  || !handlers->free_timer)
  return ENOSYS;
    
    err = setup(if_num, handlers, user_data, new_con);
    return err;
}
~~~

after being called from collectd:
~~~
static int c_ipmi_thread_init(c_ipmi_instance_t *st) {
  ipmi_domain_id_t domain_id;
  int status;

  if (st->connaddr != NULL) {
    status = ipmi_ip_setup_con(
        &st->connaddr, &(char *){IPMI_LAN_STD_PORT_STR}, 1, st->authtype,
        (unsigned int)IPMI_PRIVILEGE_USER, st->username, strlen(st->username),
        st->password, strlen(st->password), os_handler,
        /* user data = */ NULL, &st->connection);
    if (status != 0) {
      c_ipmi_error(st, "ipmi_ip_setup_con", status);
      return -1;
    }
  } else {
    status = ipmi_smi_setup_con(/* if_num = */ 0, os_handler, <===============================================================
                                /* user data = */ NULL, &st->connection);
    if (status != 0) {
      c_ipmi_error(st, "ipmi_smi_setup_con", status);
      return -1;
    }
  }
~~~


[dhill@supportshell-1 kernel]$ cat lsmod  | grep ipmi
ipmi_ssif              32768  0
acpi_ipmi              16384  0
ipmi_si                65536  2
ipmi_devintf           20480  2
ipmi_msghandler       110592  4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif

Comment 9 David Hill 2024-01-12 13:45:29 UTC
Yeah that's what I was consider doing in a remote session ... either --privileged or "--cap_add all" .  Is it the same ?

Comment 10 Matthias Runge 2024-01-15 12:05:23 UTC
There was a patch merged, that would allow to add capabilities to the container as a config option.

Comment 12 Matthias Runge 2024-01-15 12:14:21 UTC
or see eg. https://bugzilla.redhat.com/show_bug.cgi?id=1984556

Comment 32 Leonid Natapov 2024-04-18 14:04:25 UTC
Step by step verification.

1.Have a BM environment with OSP16.2 deployed.
2. Install THT RPM with the fix on undercloud. RPM can be found here:  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2989979
3.Install paunch RPM with the fix on all overcloud nodes. RPM can be found here: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2975960
4.Create custom template with the following content

resource_registry:
  OS::TripleO::Services::Collectd: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml

parameter_defaults:
  ComputeSriovOffloadParameters:
    IpmiMonitor: '/dev/ipmi0'

5. Include  path to this yaml in overcloud_deploy.sh 
6. Run ./overcloud_deploy.sh
7.After update successfully finished connect to one of overcloud nodes and check that collectd container is up and running.
8.Connect to collectd container by running: podman exec -it collectd /bin/sh
9.executed ipmitool sensor  command inside collectd container and got the output.

Comment 53 errata-xmlrpc 2025-01-09 14:55:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.2 bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:0200