Hide Forgot
Created attachment 1218177 [details] Code to compare mlx4 PHC and CLOCK_REALTIME Description of problem: We coincide with some difficulties when configuring PTP on our servers and after investigation discover strange behavior of PHC timer on Mellanox card when working with realtime kernel. On realtime kernel phc timer on Mellanox card ticks significantly faster then system CLOCK_REALTIME clock. Moreover Mellanox timer move faster with any frequency adjustment configuration available. To check timer behavior on rt and non-rt kernels we wrote a simple program ptp_clock_check.c To compile program: gcc -o ptp_clock_check -lrt ptp_clock_check.c To run program: sudo chrt -f 10 ./ptp_clock_check 100000 I start program with fifo rt proirity for more accurate result. Program just read system CLOCK_REALTIME and phc /dev/ptp0 with clock_gettime() n times (n=100000 in example). Program print pairs of numbers - time (in ns) from starting point of CLOCK_REALTIME and /dev/ptp0. If /dev/ptp0 is absolutely in sync with /dev/ptp0 number in pairs should be equal in other cases we can determine is /dev/ptp0 move faster or slower then CLOCK_REALTIME. I run this program with two frequency configurations for /dev/ptp0: phc_ctl /dev/ptp0 freq -- 100000000 phc_ctl /dev/ptp0 freq -- -100000000 So we can discover all possible frequency ajustments for this device. We can not change frequency more then 100000000. I made test for rt and non rt kernel and discover strange behavior on with rt kernel. When using rt kernel we can not made /dev/ptp0 in sync with CLOCK_REALTIME with ptp0 freq adjustment - even with -100000000 ptp0 move faster. There is no such problem on no rt kernel - with freq adjustment we can make ptp0 faster or slower then CLOCK_REALTIME. This situation is obvious from plots of ptp_clock_check data. Green line - ptp clock in sync with CLOCK_REALTIME. If line below green - ptp0 go slower, above green - ptp0 go faster. Such behavior of Mellanox timer on rt kernel create a number of difficulties when configuring ptp on this host. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux Server release 6.8 (Santiago) Red Hat MRG Realtime release 2.5.29 (RHSA-2016:1883) Linux 3.10.0-327.rt56.197.el6rt.x86_64 How reproducible: 100%
Created attachment 1218178 [details] PHC and CLOCK_REALTIME on non rt kernel
Created attachment 1218179 [details] PHC and CLOCK_REALTIME on rt kernel
Analysis of this issue below: 1) Mellanox ethernet card timer calibrated in mlx4_en_init_timestamp() (linux-3.10.0.x86_64/drivers/net/ethernet/mellanox/mlx4/en_clock.c): mdev->cycles.read = mlx4_en_read_clock; mdev->cycles.mask = CLOCKSOURCE_MASK(48); /* Using shift to make calculation more accurate. Since current HW * clock frequency is 427 MHz, and cycles are given using a 48 bits * register, the biggest shift when calculating using u64, is 14 * (max_cycles * multiplier < 2^64) */ mdev->cycles.shift = 14; mdev->cycles.mult = clocksource_khz2mult(1000 * dev->caps.hca_core_clock, mdev->cycles.shift); mdev->nominal_c_mult = mdev->cycles.mult; clocksource_khz2mult() is just simple math, so we us only dev->caps.hca_core_clock and mdev->cycles.shift to calculate mult variable. It should be noted that mult is used for freq adjustment (for example with phc_ctl), but it could be changed only in some rather small interval. dev->caps.hca_core_clock is a frequency of internal oscillator 2) We can use crash to examine actual values of mult(nominal_c_mult) after system boot. (I am not expert in crash, maybe there is more simple way to do it) crash> struct net_device <...> SIZE: 2368 crash> mod -s mlx4_en MODULE NAME SIZE OBJECT FILE ffffffffa03ae060 mlx4_en 103858 /lib/modules/3.10.0-327.rt56.198.mlx_debug.el6rt.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko crash> net NET_DEVICE NAME IP ADDRESS(ES) ffff880bbc5fc000 lo 127.0.0.1 ffff8817b0640000 eth4 10.50.2.12 ffff8817afe60000 eth5 10.50.3.12 ffff8817bb9b8000 eth0 10.50.1.12 ffff8817bbaa0000 eth1 ffff8817bb5ac000 eth2 ffff8817bbd2c000 eth3 ffff8817bb5f6000 ib0 ffff8817b9c8a000 ib1 ffff8817aec0e000 bond0 10.22.9.2 crash> eval ffff8817b0640000 + 2368 hexadecimal: ffff8817b0640940 decimal: 18446612234057812288 (-131839651739328) octal: 1777774201366031004500 binary: 1111111111111111100010000001011110110000011001000000100101000000 crash> struct mlx4_en_priv.mdev ffff8817b0640940 mdev = 0xffff880bb849fc00 crash> struct mlx4_en_dev.cycles 0xffff880bb849fc00 cycles = { read = 0xffffffffa03a87f0 <mlx4_en_read_clock>, mask = 281474976710655, mult = 51685, <----------------- Actual mult! shift = 14 <----------------- Shift is constant! } crash> struct mlx4_en_dev.dev 0xffff880bb849fc00 dev = 0xffff8817b8f60000 crash> struct mlx4_dev.caps.hca_core_clock 0xffff8817b8f60000 caps.hca_core_clock = 427, <----------- frequency in MHz 3) Let`s check correctness of clocksource_khz2mult(). It can be done manually, but I wrote a kernel module for it to be absolutely sure. (Full code in attachment.) <...> u32 shift, result; u16 mhz; mhz = 427; shift = 14; result = clocksource_khz2mult(1000 * mhz, shift); pr_info("clocksource_khz2mult() result: %i\n", result); return 0; <...> Result of this test is very suprising (from dmesg): <...> Module for testing clocksource_khz2mult() started! clocksource_khz2mult() result: 38370 <...> So actual mult is 51685, but calculated - 38370! 4) Let`s suppose that "constants" are not constant. And after examine source code of mlx4 driver one can find that dev->caps.hca_core_clock саn be changed many times: - in function mlx4_QUERY_HCA() (linux-3.10.0.x86_64/drivers/net/ethernet/mellanox/mlx4/fw.c) - mlx4_hca_core_clock_update() (linux-3.10.0.x86_64/drivers/net/ethernet/mellanox/mlx4/fw.c) I wrote a simple patch that only add debug print at any point where dev->caps.hca_core_clock can changed. (Patch in attachment). Results are interesting too (only some lines from dmesg, full dmesg in attachment): <...> mlx4_core 0000:0b:00.0: Bus: 11, Device: 0, Function: 0, Physical function: 0, Max EQs: 156, Reserved EQs: 28, Reserved UARs: 7 hca_core_clock debug: 317 <--------------- Initial value of hca_core_clock mlx4_core 0000:0b:00.0: DMFS high rate steer mode is: disabled performance optimized steering mlx4_core 0000:0b:00.0: PCIe BW is different than device's capability mlx4_core 0000:0b:00.0: PCIe link speed is 5.0GT/s, device supports 8.0GT/s mlx4_core 0000:0b:00.0: PCIe link width is x4, device supports x8 <...> mlx4_core 0000:0e:00.0: Bus: 14, Device: 0, Function: 0, Physical function: 0, Max EQs: 156, Reserved EQs: 28, Reserved UARs: 7 hca_core_clock debug: 317 mlx4_core 0000:0e:00.0: DMFS high rate steer mode is: disabled performance optimized steering mlx4_core 0000:0e:00.0: PCIe BW is different than device's capability mlx4_core 0000:0e:00.0: PCIe link speed is 5.0GT/s, device supports 8.0GT/s mlx4_core 0000:0e:00.0: PCIe link width is x4, device supports x8 <...> Great! We have two cards and both of them report 317 MHz at start! <...> mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014) clocksource_khz2mult debug: shift - 14 clocksource_khz2mult debug: hca_core_clock - 317 clocksource_khz2mult debug: mult - 51685 mlx4_en 0000:0b:00.0: registered PHC clock mlx4_en 0000:0b:00.0: Activating port:2 mlx4_core 0000:0b:00.0: Mapped 1 chunks/256 KB at 120040000 for ICM mlx4_en: 0000:0b:00.0: Port 2: Using 96 TX rings mlx4_en: 0000:0b:00.0: Port 2: Using 8 RX rings mlx4_en: 0000:0b:00.0: Port 2: frag:0 - size:1518 prefix:0 stride:1536 mlx4_en: 0000:0b:00.0: Port 2: Initializing port hca_core_clock update debug: 427 <...> 4) We can see that PHC clock registered at a very early stage when ethernet driver loading with hca_core_clock = 317MHz. After that ethernet port initialized and value of hca_core_clock updated to 427MHz. Looks like we initialized PHC earlier than needed. we can fixed it with simple patch - just move PHC initialization to the last stage of eth port init (patch in mlx_fix_debug.patch) It helps to fix this issue, in attached dmesg one can see that now PHC initialized with correct frequency. I also find upstream patch that can fix it https://patchwork.ozlabs.org/patch/592920/ I hope it can help.
Created attachment 1218193 [details] Dmesg with debug printf info about PHC init
Created attachment 1218194 [details] Test kernel module to check calculations
Created attachment 1218195 [details] Patch that fix issue and add debug printf
Created attachment 1218196 [details] Dmesg from patched kernel - issue fixed
This needs to have Mellanox look at it. Adding on-site engineers to this bug.
Closing this BZ. It has been fixed in our current release, kernel-rt-3.10.0-514.rt56.210.el6rt.