1138851 – NTP triggers Infiniband multicast warning

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1138851 - NTP triggers Infiniband multicast warning

Summary: NTP triggers Infiniband multicast warning

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	ntp
Sub Component:
Version:	6.5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Miroslav Lichvar
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:	https://lc.llnl.gov/jira/browse/TOSS-...
Whiteboard:
Depends On:
Blocks:	1249180
TreeView+	depends on / blocked

Reported:	2014-09-05 19:20 UTC by Ben Woodard
Modified:	2017-10-30 18:42 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-27 14:57:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Test Attachment (20 bytes, text/plain) 2016-01-10 21:22 UTC, Travis Gummels	no flags	Details
Show Obsolete (1) View All

Description Ben Woodard 2014-09-05 19:20:57 UTC

Description of problem:
We are seeing approx 2600 messages like the following in the past 4 hours from a variety of nodes. 
	
Jul 23 14:28:22 063739 [ACCF6700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B11: Port 0x0002c903000b64c5 (rzrelici mlx4_0) failed to join non-existing multicast group with MGID ff12:401b:ffff::101, insufficient components specified for implicit create (comp_mask 0x10083)

Since ff12:401b means it's an IPv4 multicast address, with the lower 28 bits (0x101) being the address. So that's the same as 224.0.1.1, which is registered (and being used by) NTP. I'm not sure at this point what may have changed that would cause previously-functional joins to break.

This started happening after we upgraded from 6.4 to 6.5 ntp-4.2.6p5-1.el6.x86_64, which comes with RHEL6.5 causes 55 of the ERR 1B11 messages to appear in the log every hour on hype,

ntp otherwise seems to work fine.  Clients are staying in sync, at least, and I don't recall seeing any weird log messages.  Note that we have a route on all of our nodes for 224.0.1.0/24 on the management ethernet interface.

Version-Release number of selected component (if applicable):
ntp-4.2.6p5-1.el6.x86_64 causes the problem. Reverting to:  ntp-4.2.4p8-3.el6.x86_64 makes the problem go away.

Comment 2 Ben Woodard 2014-09-09 15:21:34 UTC

I want to clearly point out that something regressed between: 
 4.2.4p8-3 and 4.2.6p5-1

Comment 3 Ben Woodard 2014-09-09 15:25:20 UTC

Doug,
Can you help me understand the error message coming from IB. It looks to me like it is something coming from the SA layer of IB but I don't understand enough about how that works to understand what may be going wrong here. 

Jul 23 14:28:22 063739 [ACCF6700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B11: Port 0x0002c903000b64c5 (rzrelici mlx4_0) failed to join non-existing multicast group with MGID ff12:401b:ffff::101, insufficient components specified for implicit create (comp_mask 0x10083)

1) failed to join non-existing multicast group surprises me because the multicast group for NTP should have been created already.

2) Assuming that it doesn't exist because it hasn't been created by some other node because of the error in #1 why can't it create a new multicast group.

Comment 4 Miroslav Lichvar 2014-09-09 15:26:14 UTC

Does this happen with the default ntp.conf? Are there any broadcast or multicast directives?

Comment 5 Doug Ledford 2014-09-09 15:37:37 UTC

The IniniBand subnet mananger will autocreate a multicast group if the join request includes enough information that the group can be created implicitly.  In this instance, the SM is pretty certain that the join request does not contain all the required information for implicit creation, and no one has done an explicit creation of the group.  As a result, all of the joins are failing.

You can try to fix this in a couple ways.

1) Explicit group creation.  Something like this in /etc/rdma/partitions.conf on the opensm machine:

Default=0x7fff, rate=3 mtu=4 scope=2, defmember=full:
	ALL, ALL_SWITCHES=full;
Default=0x7fff, ipoib, rate=3 mtu=4 scope=2:
	mgid=ff12:401b::ffff:ffff	# IPv4 Broadcast address
	mgid=ff12:401b::1		# IPv4 All Hosts group
	mgid=ff12:401b::2		# IPv4 All Routers group
	mgid=ff12:401b::16		# IPv4 IGMP group
	mgid=ff12:401b::fb		# IPv4 mDNS group
	mgid=ff12:401b::fc		# IPv4 Multicast Link Local Name Resolution group
	mgid=ff12:401b::101		# IPv4 NTP group
	mgid=ff12:401b::202		# IPv4 Sun RPC
	mgid=ff12:601b::1		# IPv6 All Hosts group
	mgid=ff12:601b::2		# IPv6 All Routers group
	mgid=ff12:601b::16		# IPv6 MLDv2-capable Routers group
	mgid=ff12:601b::fb		# IPv6 mDNS group
	mgid=ff12:601b::101		# IPv6 NTP group
	mgid=ff12:601b::202		# IPv6 Sun RPC group
	mgid=ff12:601b::1:3		# IPv6 Multicast Link Local Name Resolution group
	ALL=full, ALL_SWITCHES=full;

Notice I have both an IPv4 and IPv6 NTP group by default.

2) Find out what bits are missing from the join request and try to modify either the ntp program or IPoIB to add the right bits so that implicit creation works.  I don't know the creation mask myself, but I'm sure if you look at the code in drivers/infiniband/ulp/ipoib/ipoip_multicast.c and look at the code flow in ipoib_multicast_join() when create==1 you will see all the bits that need to be initialized, and then you can compare that to when IPoIB gets a join request from user space and attempts to join on user space's behalf and see what's missing.

Comment 6 Ben Woodard 2014-09-09 18:09:59 UTC

Doug, 
The options set appear to be:
 comp_mask =
   IB_SA_MCMEMBER_REC_MGID  |
   IB_SA_MCMEMBER_REC_PORT_GID |
   IB_SA_MCMEMBER_REC_PKEY  |
   IB_SA_MCMEMBER_REC_JOIN_STATE;

This suggests that the create flag is not set which adds the options that are missing. OpenSM requires: 

#define REQUIRED_MC_CREATE_COMP_MASK (IB_MCR_COMPMASK_MGID | \
					IB_MCR_COMPMASK_PORT_GID | \
					IB_MCR_COMPMASK_JOIN_STATE | \
					IB_MCR_COMPMASK_QKEY | \
					IB_MCR_COMPMASK_TCLASS | \
					IB_MCR_COMPMASK_PKEY | \
					IB_MCR_COMPMASK_FLOW | \
					IB_MCR_COMPMASK_SL)

So: 

IB_MCR_COMPMASK_QKEY IB_MCR_COMPMASK_TCLASS IB_MCR_COMPMASK_FLOW IB_MCR_COMPMASK_SL

are missing.

ntpd's socket code is pretty normal. AIUI there isn't a way using the socket interface to explictly say, "I'd like to join this group if it already exists but if it doesn't exist, don't bother creating it." The socket inteface just has IP_ADD_MEMBERSHIP and IPV6_JOIN_GROUP so it seems to me that the create semantics must be added somewhere between the socket programming done in NTP and the ipoib_mcast_join(). I have yet to spot where the create flag is not passed to the ipoib_mcast_join() and so I have yet to ascertain why it is not getting passed. Since the code is very generic in ntpd, I think that this is looking increasingly like a problem in IPoIB which just happens to be triggered by the new ntpd.

At the moment, multiple workarounds exist but no root cause has been identified.
a) run 6.4's version of NTP
b) explict creation of the multicast group on the OpenSM host
c) (untested) excluding NTP from the infiniband network.

Comment 7 Doug Ledford 2015-02-02 16:09:08 UTC

Hi Ben,

The issue here is the difference between an explicit join of a multicast group, and an implicit join by an attempt to send a packet to a multicast group without joining the group as a full member.

In the IPoIB code, whenever we are asked to join a group as a full member, with the exception of the broadcast group that must already be defined and exist, we will always attempt to join with sufficient information to create the group if it does not exist. This happens when we get a call to set our multicast list. When we get that call from the core network code, we call ipoib_mcast_restart_task and it scans the list of multicast groups we are already a member of, and compares it to the list that the core network code says we should be subscribed to, and it removes ones that are no longer present and adds ones that are missing. In so doing, it calls ipoib_mcast_join with the create option set to 1.

If an application attemps to send a multicast packet without first joining the multicast group, then we get to the ipoib_mcast_send routine and the multicast group doesn't exist. In that circumstance, we autocreate a sendonly multicast join request. It is these sendonly join requests that are created on behalf of sends to a group we don't belong to that are being passed on to the SM without the sufficient elements to create the group.

I view this as a matter of policy really. I can see it being valid both ways. In some cases, you want the SM to deny these implicit group requests, and in others you want them to succeed. But, since the SM can't tell the difference between an implicit join with create options set or an explicit join with create options set, the policy must be administered from the joining machine. I could see adding an ipoib module option, something to the effect of "create_sendonly_mcast_groups" where the default is 0 and it would behave as it does now, but with the option to set this to 1 and our implicitly joined sendonly mcast groups would pass the proper create information to the SM.

The other option, of course, is to simply define the groups you want in the SM partitions.conf file.

As for why it happens with the latest ntp and not the older one, I'm guessing that the newer ntp is attempting to send out multicast packets on more than just the interface you have them routed to. Maybe it's opening up all of the interfaces and forcing a multicast send on them all in order to probe the possibility of using interfaces other than the default. I'm just not sure on that one. If you'd like me to add an ipoib module option to create sendonly groups, let me know.

Comment 8 Ben Woodard 2015-02-02 20:28:25 UTC

Doug, 
Based on the principle of least surprise, why would you not make the default to create an implicit group? In the ethernet world, implicitly creating a mcast group to which their are no subscribers, would not create an event or a warning. The switch would just drop the packets until it got the igmp request which informed it that thus and such mac addr was interested in that multicast. Is the IB switch so limited in resources that it can't handle having the likely few implicitly created send only mcast groups that are likely created. If users do run the IB switch out of these resources, the fabric administrator could turn off the capability to resolve the issue. Part of the problem here is one of sequencing, as we bring up a cluster we might have senders wanting to create the multicasts before the receivers appear and requiring the fabric administrator to statically predefine the expected groups seems like an unnecessary burden.

Comment 9 Doug Ledford 2015-02-10 01:59:12 UTC

(In reply to Ben Woodard from comment #8)
> Doug, 
> Based on the principle of least surprise,

The principle of least surprise doesn't apply here.  This is InfiniBand, not Ethernet, and design decisions were made in the InfiniBand link layer and switch routing areas that mean a computer can't just be sending packets to be dropped.  These design decisions were made to meet the goal of lowest, predictable latency and high speed.  We can't have it both ways.

> why would you not make the default
> to create an implicit group?

It's written into the spec that a group must either be statically defined, or the attempt to implicitly create it must contain sufficient information to create the group properly.  The sendonly joins you are seeing do not have that information.  We can make them have the information, but then we will be guessing at things that we might not ought to be guessing at.

> In the ethernet world, implicitly creating a
> mcast group to which their are no subscribers, would not create an event or
> a warning. The switch would just drop the packets until it got the igmp
> request which informed it that thus and such mac addr was interested in that
> multicast. Is the IB switch so limited in resources that it can't handle
> having the likely few implicitly created send only mcast groups that are
> likely created.

The switch doesn't have any say in this.  The switch has a linear fowarding table and a multicast forwarding table and they are both programmed by the subnet manager.  The switch does not run it's own routing protocol engine to generate these things, only the subnet manager does (although many switches have an embedded subnet manager and can do these things, it is not actually part of the switch per se).  And the spec says that a client may not send to a going until it has contacted the SM and joined the group at a minimum as a send only member.  That is the spec.

> If users do run the IB switch out of these resources, the
> fabric administrator could turn off the capability to resolve the issue.

This is a bad idea for determinism just like NMIs are bad ideas for real time jitter determinism.

> Part of the problem here is one of sequencing, as we bring up a cluster we
> might have senders wanting to create the multicasts before the receivers
> appear

This is moot.  The first time you send, if the receivers aren't ready, then the packet will be lost no matter whether the group is defined or not, there are no receivers to listen to it.  Once the receivers create the group, future sends will succeed in their join and send their packets.  In other words, it will do exactly as you expect, except that instead of silently dropping the packets and sending packets to an unknown group to the switch, we are dropping the packets on the host and alerting you to that fact.  One could argue this is much better than silent drops.

> and requiring the fabric administrator to statically predefine the
> expected groups seems like an unnecessary burden.

OpenSM used to create a set of default multicast groups in the default pkey.  It still defines the default pkey (the spec actually requires all machines to be full members of the default group, so it's easy to know how to create it), but it does not define a bunch of multicast groups any more.  The latest versions of opensm that we're shipping have a set of predefined common multicast groups in the default pkey definition in our default partitions.conf file.

But there are admin decisions that *should* be made about that default pkey definition.  The one I ship must to cater to the lowest common denominator.  So, for instance, the group rate of the default groups are all set at 10MBit/s.  This speed is ancient, but if you have a mix of speeds on your fabric, and you designate the rate at, say, 40GBit/s, then anything that can't support at *least* 40MBit/s will not be allowed to join the pkey these groups are defined in.  Likewise for the MTU of the pkey/mcast groups.  Admins should be looking at these things as they control how the hardware interacts with the group and things like both the packet size and rate they are allowed to attempt to use in that group.  If the admin is actually paying attention to these setup details, then principle of least surprise is not an issue, nor is any additional burden as the burden actually exists outside of just defining the groups.

So, going back to what I said in my earlier post:  We could add a module option or something like that to change whether or not we send enough information to create sendonly groups by default, but I can't change the spec and I can't make admin decisions for them.  There is a burden, that won't change, and making it work without putting any effort into things only hides what can end up being important details the admin should be taking care of.

Comment 11 Orion Poplawski 2015-09-30 16:24:57 UTC

Does running in IB connected mode have any bearing on this issue?

Sep 30 10:13:52 csdisk1 kernel: ib0: enabling connected mode will cause multicast packet drops

Comment 12 Doug Ledford 2015-09-30 16:34:58 UTC

Nah, the mcast packets sent by NTP are small enough to fit into the underlying IB MTU.  The source of that message is that in connected mode, we allow the TCP/IP MTU to be set all the way up to 65520.  However, the actual IB MTU will be lower (usually either 2K or 4K), we simply use an RC connection to enable large sends like this.  But, all UD sends are limited to at most the IB MTU - UD overhead in size, and multicast packets are sent as UD sends, so even though the TCP/IP interface MTU is >4K, we can't actually send >4K mcast packets.  This message is warning people about that fact.  If the program knows to check the underlying IB MTU, they can adjust their mcast sends to always fit in an UD - header sized chunk and it won't get dropped (unless the fabric is overloaded, but that's another issue).

Comment 13 Jeff Hanson 2015-10-28 21:06:42 UTC

The opensm rpms in RHEL 6.7 and 7.1 do not contain a partitions.conf file.  There is discussion here
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_the_Subnet_Manager.html
on what the configuration could be.  Is this what you mean?

We have a customer that is seeing up to 5000 1B11 messages per minute.

Comment 14 Doug Ledford 2015-10-29 01:24:10 UTC

(In reply to Jeff Hanson from comment #13)
> The opensm rpms in RHEL 6.7 and 7.1 do not contain a partitions.conf file. 
> There is discussion here
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/
> html/Networking_Guide/sec-Configuring_the_Subnet_Manager.html
> on what the configuration could be.  Is this what you mean?
> 
> We have a customer that is seeing up to 5000 1B11 messages per minute.

Jeff, you're right, I mention in my comment that we ship the partitions.conf file, but it is in fact missing from the rpms.  The link you posted above is what the sample partitions.conf file *should* look like, although the sample partitions.conf in the rpm would need to comment out the ib0.2 partition and only include the default pkey as uncommented.

Comment 15 Jeff Hanson 2015-10-29 03:03:29 UTC

Editing for QDR (rate of 40) would then be -

Default=0x7fff, rate=7 mtu=4 scope=2, defmember=full:
        ALL, ALL_SWITCHES=full;
Default=0x7fff, ipoib, rate=7 mtu=4 scope=2:
        mgid=ff12:401b::ffff:ffff       # IPv4 Broadcast address
        mgid=ff12:401b::1               # IPv4 All Hosts group
        mgid=ff12:401b::2               # IPv4 All Routers group
        mgid=ff12:401b::16              # IPv4 IGMP group
        mgid=ff12:401b::fb              # IPv4 mDNS group
        mgid=ff12:401b::fc              # IPv4 Multicast Link Local Name Resolution group
        mgid=ff12:401b::101             # IPv4 NTP group
        mgid=ff12:401b::202             # IPv4 Sun RPC
        mgid=ff12:601b::1               # IPv6 All Hosts group
        mgid=ff12:601b::2               # IPv6 All Routers group
        mgid=ff12:601b::16              # IPv6 MLDv2-capable Routers group
        mgid=ff12:601b::fb              # IPv6 mDNS group
        mgid=ff12:601b::101             # IPv6 NTP group
        mgid=ff12:601b::202             # IPv6 Sun RPC group
        mgid=ff12:601b::1:3             # IPv6 Multicast Link Local Name Resolution group
        ALL=full, ALL_SWITCHES=full;

Having run with this and rebooted nodes a few times in a much smaller cluster than our customer I don't
see any 1B11 messages.

Comment 16 Travis Gummels 2016-01-10 21:22:12 UTC

Created attachment 1113380 [details]
Test Attachment

Test attachment comment.

Comment 18 George Beshers 2016-12-22 16:16:09 UTC

Getting ready for OnBoarding first week of January.

Comment 19 Tomáš Hozza 2017-10-27 14:57:32 UTC

Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017.  During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:
http://redhat.com/rhel/lifecycle

This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification.  Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com

Note You need to log in before you can comment on or make changes to this bug.