Bug 1392565 - Incorrect calibration of Mellanox IB/Ethernet card PHC clock
Summary: Incorrect calibration of Mellanox IB/Ethernet card PHC clock
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: 2.5
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Clark Williams
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-07 19:06 UTC by Mikhail Kulemin
Modified: 2020-02-14 18:06 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-25 21:13:36 UTC
Target Upstream Version:


Attachments (Terms of Use)
Code to compare mlx4 PHC and CLOCK_REALTIME (1.48 KB, text/x-csrc)
2016-11-07 19:06 UTC, Mikhail Kulemin
no flags Details
PHC and CLOCK_REALTIME on non rt kernel (11.98 KB, image/png)
2016-11-07 19:08 UTC, Mikhail Kulemin
no flags Details
PHC and CLOCK_REALTIME on rt kernel (11.51 KB, image/png)
2016-11-07 19:09 UTC, Mikhail Kulemin
no flags Details
Dmesg with debug printf info about PHC init (61.74 KB, text/plain)
2016-11-07 19:16 UTC, Mikhail Kulemin
no flags Details
Test kernel module to check calculations (709 bytes, text/x-csrc)
2016-11-07 19:17 UTC, Mikhail Kulemin
no flags Details
Patch that fix issue and add debug printf (2.67 KB, patch)
2016-11-07 19:18 UTC, Mikhail Kulemin
no flags Details | Diff
Dmesg from patched kernel - issue fixed (76.57 KB, text/plain)
2016-11-07 19:19 UTC, Mikhail Kulemin
no flags Details

Description Mikhail Kulemin 2016-11-07 19:06:59 UTC
Created attachment 1218177 [details]
Code to compare mlx4 PHC and CLOCK_REALTIME

Description of problem:

We coincide with some difficulties when configuring PTP on our servers and after investigation discover strange behavior of PHC timer on Mellanox card when working with realtime kernel. On realtime kernel phc timer on Mellanox card ticks significantly faster then system CLOCK_REALTIME clock. Moreover Mellanox timer move faster with any frequency adjustment configuration available. 

To check timer behavior on rt and non-rt kernels we wrote a simple program  ptp_clock_check.c 
To compile program: gcc -o ptp_clock_check -lrt ptp_clock_check.c 
To run program:  sudo chrt -f 10 ./ptp_clock_check 100000 
I start program with fifo rt proirity for more accurate result. Program just read system CLOCK_REALTIME and phc /dev/ptp0 with clock_gettime() n times (n=100000 in example). Program print pairs of numbers - time (in ns) from starting point of CLOCK_REALTIME and /dev/ptp0. If /dev/ptp0 is absolutely in sync with /dev/ptp0 number in pairs should be equal in other cases we can determine is /dev/ptp0 move faster or slower then CLOCK_REALTIME.

I run this program with two frequency configurations for /dev/ptp0:
phc_ctl /dev/ptp0 freq -- 100000000
phc_ctl /dev/ptp0 freq -- -100000000
So we can discover all possible frequency ajustments for this device. We can not change frequency more then 100000000. I made test for rt and non rt kernel and discover strange behavior on with rt kernel. When using rt kernel we can not made /dev/ptp0  in sync with CLOCK_REALTIME with ptp0 freq adjustment - even with  -100000000 ptp0 move faster. 

There is no such problem on no rt kernel - with freq adjustment we can make ptp0 faster or slower then CLOCK_REALTIME. This situation is obvious from plots of  ptp_clock_check data.  Green line - ptp clock in sync with CLOCK_REALTIME. If line below green - ptp0 go slower, above green - ptp0 go faster.

Such behavior of Mellanox timer on rt kernel create a number of difficulties when configuring ptp on this host.


Version-Release number of selected component (if applicable):

Red Hat Enterprise Linux Server release 6.8 (Santiago)
Red Hat MRG Realtime release 2.5.29 (RHSA-2016:1883)
Linux 3.10.0-327.rt56.197.el6rt.x86_64


How reproducible:
100%

Comment 1 Mikhail Kulemin 2016-11-07 19:08:26 UTC
Created attachment 1218178 [details]
PHC and CLOCK_REALTIME on non rt kernel

Comment 2 Mikhail Kulemin 2016-11-07 19:09:02 UTC
Created attachment 1218179 [details]
PHC and CLOCK_REALTIME on rt kernel

Comment 3 Mikhail Kulemin 2016-11-07 19:15:41 UTC
Analysis of this issue below:

1) Mellanox ethernet card timer calibrated in mlx4_en_init_timestamp() (linux-3.10.0.x86_64/drivers/net/ethernet/mellanox/mlx4/en_clock.c):

    mdev->cycles.read = mlx4_en_read_clock;
    mdev->cycles.mask = CLOCKSOURCE_MASK(48);
    /* Using shift to make calculation more accurate. Since current HW
     * clock frequency is 427 MHz, and cycles are given using a 48 bits
     * register, the biggest shift when calculating using u64, is 14
     * (max_cycles * multiplier < 2^64)
     */
    mdev->cycles.shift = 14; 
    mdev->cycles.mult =
        clocksource_khz2mult(1000 * dev->caps.hca_core_clock, mdev->cycles.shift);
    mdev->nominal_c_mult = mdev->cycles.mult;

clocksource_khz2mult() is just simple math, so we us only dev->caps.hca_core_clock and mdev->cycles.shift to calculate mult variable. It should be noted that mult is used for freq adjustment (for example with phc_ctl), but it could be changed only in some rather small interval. dev->caps.hca_core_clock is a frequency of internal oscillator

2) We can use crash to examine actual values of mult(nominal_c_mult) after system boot. (I am not expert in crash, maybe there is more simple way to do it)
	crash> struct net_device
	<...>
	SIZE: 2368
	crash> mod -s mlx4_en
		 MODULE       NAME                    SIZE  OBJECT FILE
	ffffffffa03ae060  mlx4_en               103858  /lib/modules/3.10.0-327.rt56.198.mlx_debug.el6rt.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko 
	crash> net
	   NET_DEVICE     NAME   IP ADDRESS(ES)
	ffff880bbc5fc000  lo     127.0.0.1
	ffff8817b0640000  eth4   10.50.2.12
	ffff8817afe60000  eth5   10.50.3.12
	ffff8817bb9b8000  eth0   10.50.1.12
	ffff8817bbaa0000  eth1   
	ffff8817bb5ac000  eth2   
	ffff8817bbd2c000  eth3   
	ffff8817bb5f6000  ib0    
	ffff8817b9c8a000  ib1    
	ffff8817aec0e000  bond0  10.22.9.2
	crash> eval ffff8817b0640000 + 2368
	hexadecimal: ffff8817b0640940  
		decimal: 18446612234057812288  (-131839651739328)
		  octal: 1777774201366031004500
		 binary: 1111111111111111100010000001011110110000011001000000100101000000
	crash> struct mlx4_en_priv.mdev ffff8817b0640940
	  mdev = 0xffff880bb849fc00
	crash> struct mlx4_en_dev.cycles 0xffff880bb849fc00
	  cycles = {
		read = 0xffffffffa03a87f0 <mlx4_en_read_clock>, 
		mask = 281474976710655, 
		mult = 51685,        <----------------- Actual mult!
		shift = 14           <----------------- Shift is constant!
	  }
	crash> struct mlx4_en_dev.dev 0xffff880bb849fc00
	  dev = 0xffff8817b8f60000
	crash> struct mlx4_dev.caps.hca_core_clock 0xffff8817b8f60000
	  caps.hca_core_clock = 427,   <----------- frequency in MHz 

3) Let`s check correctness of clocksource_khz2mult(). It can be done manually, but I wrote a kernel module for it to be absolutely sure. (Full code in attachment.)
<...>
    u32 shift, result;
    u16 mhz;
    mhz = 427;
    shift = 14;
    result = clocksource_khz2mult(1000 * mhz, shift);
    pr_info("clocksource_khz2mult() result: %i\n", result);
        return 0;
<...>

Result of this test is very suprising (from dmesg):
	<...>
	Module for testing clocksource_khz2mult() started!
        clocksource_khz2mult() result: 38370
	<...>

So actual mult is 51685, but calculated - 38370!

4) Let`s suppose that "constants" are not constant. And after examine source code of mlx4 driver one can find that dev->caps.hca_core_clock саn be changed many times:
  - in function mlx4_QUERY_HCA() (linux-3.10.0.x86_64/drivers/net/ethernet/mellanox/mlx4/fw.c)
  - mlx4_hca_core_clock_update() (linux-3.10.0.x86_64/drivers/net/ethernet/mellanox/mlx4/fw.c)
I wrote a simple patch  that only add debug print at any point where dev->caps.hca_core_clock can changed. (Patch in attachment).

Results are interesting too (only some lines from dmesg, full dmesg in attachment):

   <...>
   mlx4_core 0000:0b:00.0: Bus: 11, Device: 0, Function: 0, Physical function: 0, Max EQs: 156, Reserved EQs: 28, Reserved UARs: 7
   hca_core_clock debug: 317        <--------------- Initial value of hca_core_clock
   mlx4_core 0000:0b:00.0: DMFS high rate steer mode is: disabled performance optimized steering
   mlx4_core 0000:0b:00.0: PCIe BW is different than device's capability
   mlx4_core 0000:0b:00.0: PCIe link speed is 5.0GT/s, device supports 8.0GT/s
   mlx4_core 0000:0b:00.0: PCIe link width is x4, device supports x8
   <...>
   mlx4_core 0000:0e:00.0: Bus: 14, Device: 0, Function: 0, Physical function: 0, Max EQs: 156, Reserved EQs: 28, Reserved UARs: 7
   hca_core_clock debug: 317
   mlx4_core 0000:0e:00.0: DMFS high rate steer mode is: disabled performance optimized steering
   mlx4_core 0000:0e:00.0: PCIe BW is different than device's capability
   mlx4_core 0000:0e:00.0: PCIe link speed is 5.0GT/s, device supports 8.0GT/s
   mlx4_core 0000:0e:00.0: PCIe link width is x4, device supports x8
  <...>

Great! We have two cards and both of them report 317 MHz at start!

    <...>
    mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014)
    clocksource_khz2mult debug: shift - 14
    clocksource_khz2mult debug: hca_core_clock - 317
    clocksource_khz2mult debug: mult - 51685
    mlx4_en 0000:0b:00.0: registered PHC clock
    mlx4_en 0000:0b:00.0: Activating port:2
    mlx4_core 0000:0b:00.0: Mapped 1 chunks/256 KB at 120040000 for ICM
    mlx4_en: 0000:0b:00.0: Port 2: Using 96 TX rings
    mlx4_en: 0000:0b:00.0: Port 2: Using 8 RX rings
    mlx4_en: 0000:0b:00.0: Port 2:   frag:0 - size:1518 prefix:0 stride:1536
    mlx4_en: 0000:0b:00.0: Port 2: Initializing port
    hca_core_clock update debug: 427
   <...>

4) We can see that PHC clock registered at a very early stage when ethernet driver loading with hca_core_clock = 317MHz. After that ethernet port initialized and value of hca_core_clock updated to 427MHz. Looks like we initialized PHC earlier than needed. we can fixed it with simple patch - just move PHC initialization to the last stage of eth port init (patch in mlx_fix_debug.patch)

It helps to fix this issue, in attached dmesg one can see that now PHC initialized with correct frequency. I also find upstream patch that can fix it https://patchwork.ozlabs.org/patch/592920/

I hope it can help.

Comment 4 Mikhail Kulemin 2016-11-07 19:16:47 UTC
Created attachment 1218193 [details]
Dmesg with debug printf info about PHC init

Comment 5 Mikhail Kulemin 2016-11-07 19:17:41 UTC
Created attachment 1218194 [details]
Test kernel module to check calculations

Comment 6 Mikhail Kulemin 2016-11-07 19:18:32 UTC
Created attachment 1218195 [details]
Patch that fix issue and add debug printf

Comment 7 Mikhail Kulemin 2016-11-07 19:19:28 UTC
Created attachment 1218196 [details]
Dmesg from patched kernel  - issue fixed

Comment 8 Doug Ledford 2016-12-23 03:18:09 UTC
This needs to have Mellanox look at it.  Adding on-site engineers to this bug.

Comment 12 Beth Uptagrafft 2017-01-25 21:13:36 UTC
Closing this BZ. It has been fixed in our current release, kernel-rt-3.10.0-514.rt56.210.el6rt.


Note You need to log in before you can comment on or make changes to this bug.