Bug 1598752

Summary: Vhost-user: missing increment of log cache count (Regression)
Product: Red Hat Enterprise Linux 7 Reporter: Maxime Coquelin <maxime.coquelin>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
Status: CLOSED ERRATA QA Contact: Pei Zhang <pezhang>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: ailan, atragler, ctrautma, kfida, ktraynor, pezhang, pvauter, qding, tredaelli
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch-2.9.0-52.el7fdn Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-15 13:53:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Maxime Coquelin 2018-07-06 11:51:04 UTC
Description of problem:

Regression has been introduced with patch "vhost: improve dirty pages logging performance".

The patch introduces a cache for which the index increment is missing.
Thus, the live migration is broken: pages dirtied by the vhost-user backend
aren't logged, which is likely to result in packets loss during migration.

Version-Release number of selected component (if applicable):
2.9.0-38

How reproducible:

This has been catched by code review, and it is difficult to
detect the problem by running live-migration test cases.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Regression has been fixed upstream, but is not yet backported to v17.11 LTS
at the time of this writing.

This is the DPDK patch that fixes the regression:

commit 511b413bbcbd711dcf582485a72213fa90595cbe
Author: Maxime Coquelin <maxime.coquelin>
Date:   Fri Jun 15 15:48:46 2018 +0200

    vhost: fix missing increment of log cache count
    
    The log_cache_nb_elem was never incremented, resulting
    in all dirty pages to be missed during live migration.
    
    Fixes: c16915b87109 ("vhost: improve dirty pages logging performance")
    Cc: stable
    
    Reported-by: Peng He <xnhp0320>
    Signed-off-by: Maxime Coquelin <maxime.coquelin>
    Acked-by: Ilya Maximets <i.maximets>
    Reviewed-by: Tiwei Bie <tiwei.bie>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 528e01c8f..786a74f64 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -429,6 +429,7 @@ vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
        vq->log_cache[i].offset = offset;
        vq->log_cache[i].val = (1UL << bit_nr);
+       vq->log_cache_nb_elem++;
 }
 
 static __rte_always_inline void

Comment 5 Pei Zhang 2018-07-27 01:25:06 UTC
==Verification==

Test live migration with RHEL7.6 and RHEL7.5z system. Both work very well. 

(1) Test openvswitch-2.9.0-55.el7fdp.x86_64 over RHEL7.6: PASS

Versions:

    qemu-kvm-rhev-2.12.0-8.el7.x86_64
    libvirt-4.5.0-4.el7.x86_64
    tuned-2.9.0-1.el7.noarch
    dpdk-17.11-10.el7fdb.x86_64
    openvswitch-2.9.0-55.el7fdp.x86_64

Results:

    Scenario 1: live migration with vhost-user 2 queues: PASS

    =======================Stream Rate: 1Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       1Mpps      158     19813        15     451189.0
     1       1Mpps      158     22763        19     477425.0
     2       1Mpps      163     19838        17     482764.0
     3       1Mpps      158     21749        15     472611.0
     4       1Mpps      131     64385        13     106887.0
     5       1Mpps      164     20071        16     485374.0
     6       1Mpps      130     64380        15     102214.0
     7       1Mpps      155     19751        15     467876.0
     8       1Mpps      155     20121        16     465566.0
     9       1Mpps      166     19468        16     487194.0
    |------------------------Statistic------------------------|
       Max   1Mpps      166     64385        19       487194
       Min   1Mpps      130     19468        13       102214
      Mean   1Mpps      153     29233        15       399910
    Median   1Mpps      158     20096        15       470243
     Stdev       0    12.82  18553.47      1.57    156036.46
    
    =======================Stream Rate: 2Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       2Mpps      165     16224        16     991056.0
     1       2Mpps      163     17201        17    1000683.0
     2       2Mpps      168     21262        16    1158717.0
     3       2Mpps      167     16585        17    1011464.0
     4       2Mpps      166     16228        16     993732.0
     5       2Mpps      164     16541        17    1004944.0
     6       2Mpps      153     15825        18     955247.0
     7       2Mpps      161     15833        15     972007.0
     8       2Mpps      153     15436        16     945746.0
     9       2Mpps      161     19428        17    1021792.0
    |------------------------Statistic------------------------|
       Max   2Mpps      168     21262        18      1158717
       Min   2Mpps      153     15436        15       945746
      Mean   2Mpps      162     17056        16      1005538
    Median   2Mpps      163     16384        16       997207
     Stdev       0     5.32   1851.07      0.85     59033.65
    
    =======================Stream Rate: 3Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       3Mpps      127     64209        13    2269768.0
     1       3Mpps      157     18554        15    1550979.0
     2       3Mpps      177     17353        18    1580577.0
     3       3Mpps      170     17429        17    1551081.0
     4       3Mpps      165     17262        16    1511554.0
     5       3Mpps      150     17069        17    1443302.0
     6       3Mpps      123     64349        12    2220502.0
     7       3Mpps      131     64238        15    2166204.0
     8       3Mpps      162     16466        16    1475725.0
     9       3Mpps      151     19991        18    1517579.0
    |------------------------Statistic------------------------|
       Max   3Mpps      177     64349        18      2269768
       Min   3Mpps      123     16466        12      1443302
      Mean   3Mpps      151     31692        15      1728727
    Median   3Mpps      154     17991        16      1551030
     Stdev       0    18.71  22498.21       2.0    341285.16
    
    =======================Stream Rate: 4Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       4Mpps      164     18768        19    2113516.0
     1       4Mpps      163     17917        16    2046435.0
     2       4Mpps      163     18205        18    2079404.0
     3       4Mpps      162     22842        16    2389273.0
     4       4Mpps      161     17272        19    2039292.0
     5       4Mpps      171     17328        17    2089298.0
     6       4Mpps      166     16743        16    2010342.0
     7       4Mpps      167     16725        17    2040245.0
     8       4Mpps      161     16781        15    1977642.0
     9       4Mpps      164     18567        17    2093860.0
    |------------------------Statistic------------------------|
       Max   4Mpps      171     22842        19      2389273
       Min   4Mpps      161     16725        15      1977642
      Mean   4Mpps      164     18114        17      2087930
    Median   4Mpps      163     17622        17      2062919
     Stdev       0     3.08   1824.12      1.33    113586.32
    
    =======================Stream Rate: 5Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       5Mpps      127     64369        12    5211278.0
     1       5Mpps      124     64338        12    5173191.0
     2       5Mpps      163     16599        17    2478128.0
     3       5Mpps      127     64352        12    5145579.0
     4       5Mpps      131     64247        12    5214652.0
     5       5Mpps      135     64699        14    5166328.0
     6       5Mpps      139     64261        14    5172530.0
     7       5Mpps      133     64133        13    5146551.0
     8       5Mpps      125     64086        14    5180036.0
     9       5Mpps      136     64246        13    5234906.0
    |------------------------Statistic------------------------|
       Max   5Mpps      163     64699        17      5234906
       Min   5Mpps      124     16599        12      2478128
      Mean   5Mpps      134     59533        13      4912317
    Median   5Mpps      132     64254        13      5172860
     Stdev       0    11.35  15086.39      1.57    855788.25
    
    
    
    Scenario 2: live migration with vhost-user 1 queue: PASS
    
    =======================Stream Rate: 1Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       1Mpps      146     15641        17     493348.0
     1       1Mpps      143     13976        16     482615.0
     2       1Mpps      147     14018        16     495862.0
     3       1Mpps      138     13818        16     477795.0
     4       1Mpps      140     13762        16     480692.0
     5       1Mpps      127     13849        15     453224.0
     6       1Mpps      142     13836        14     485654.0
     7       1Mpps      133     13687        14     467472.0
     8       1Mpps      138     13656        16     478171.0
     9       1Mpps      143     13650        15     486969.0
    |------------------------Statistic------------------------|
       Max   1Mpps      147     15641        17       495862
       Min   1Mpps      127     13650        14       453224
      Mean   1Mpps      139     13989        15       480180
    Median   1Mpps      141     13827        16       481653
     Stdev       0     6.07    593.57      0.97     12469.57
    
    =======================Stream Rate: 2Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       2Mpps      142     13806        15     978357.0
     1       2Mpps      145     13694        16     982945.0
     2       2Mpps      135     13756        16     948356.0
     3       2Mpps      137     13638        16     952975.0
     4       2Mpps      144     13693        17     988426.0
     5       2Mpps      130     13633        14     927550.0
     6       2Mpps      136     14588        16     949244.0
     7       2Mpps      127     14459        15     916520.0
     8       2Mpps      129     13592        15     920487.0
     9       2Mpps      129     14542        14     924674.0
    |------------------------Statistic------------------------|
       Max   2Mpps      145     14588        17       988426
       Min   2Mpps      127     13592        14       916520
      Mean   2Mpps      135     13940        15       948953
    Median   2Mpps      135     13725        15       948800
     Stdev       0     6.62    412.52      0.97     26883.61
    
    =======================Stream Rate: 3Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       3Mpps      146     14412        17    1487473.0
     1       3Mpps      140     14103        16    1447916.0
     2       3Mpps      132     14177        15    1401315.0
     3       3Mpps      140     14214        16    1450662.0
     4       3Mpps      143     14067        16    1471002.0
     5       3Mpps      136     14114        16    1419882.0
     6       3Mpps      138     14045        16    1443870.0
     7       3Mpps      148     13834        17    1501742.0
     8       3Mpps      138     13741        16    1439716.0
     9       3Mpps      141     13742        16    1448684.0
    |------------------------Statistic------------------------|
       Max   3Mpps      148     14412        17      1501742
       Min   3Mpps      132     13741        15      1401315
      Mean   3Mpps      140     14044        16      1451226
    Median   3Mpps      140     14085        16      1448300
     Stdev       0     4.69    215.52      0.57     29692.28
    
    =======================Stream Rate: 4Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       4Mpps      141     13609        16    1949419.0
     1       4Mpps      142     13639        16    1971410.0
     2       4Mpps      145     13717        17    1973021.0
     3       4Mpps      131     14495        15    1875741.0
     4       4Mpps      139     13673        16    1941009.0
     5       4Mpps      129     14425        15    1856364.0
     6       4Mpps      132     14567        15    1873700.0
     7       4Mpps      135     14480        16    1914211.0
     8       4Mpps      136     14414        16    1910466.0
     9       4Mpps      134     14372        15    1895856.0
    |------------------------Statistic------------------------|
       Max   4Mpps      145     14567        17      1973021
       Min   4Mpps      129     13609        15      1856364
      Mean   4Mpps      136     14139        15      1916119
    Median   4Mpps      135     14393        16      1912338
     Stdev       0     5.21    416.87      0.67      41459.4
    
    =======================Stream Rate: 5Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       5Mpps      142     14399        15   12322273.0
     1       5Mpps      142     14350        16   12419495.0
     2       5Mpps      136     14126        15   10583626.0
     3       5Mpps      136     13788        16   12459350.0
     4       5Mpps      131     13840        15   12293454.0
     5       5Mpps      137     13762        16    6264477.0
     6       5Mpps      136     13791        16   10688540.0
     7       5Mpps      142     13790        16   12025734.0
     8       5Mpps      148     13842        17   10769242.0
     9       5Mpps      137     14774        15   12217261.0
    |------------------------Statistic------------------------|
       Max   5Mpps      148     14774        17     12459350
       Min   5Mpps      131     13762        15      6264477
      Mean   5Mpps      138     14046        15     11204345
    Median   5Mpps      137     13841        16     12121497
     Stdev       0     4.79    352.02      0.67   1898279.47




(2) Test openvswitch-2.9.0-55.el7fdp.x86_64 over RHEL7.5.z: PASS

Versions:

    3.10.0-862.11.1.el7.x86_64
    libvirt-3.9.0-14.el7_5.6.x86_64
    openvswitch-2.9.0-55.el7fdp.x86_64
    tuned-2.9.0-1.el7.noarch
    dpdk-17.11-10.el7fdb.x86_64
    qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64

Results:

    Scenario 1: live migration with vhost-user 2 queues: PASS
    =======================Stream Rate: 1Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       1Mpps      144     27265        18     357671.0
     1       1Mpps      152     17641        15     359873.0
     2       1Mpps      153     21895        15     363786.0
     3       1Mpps      150     21890        14     522820.0
     4       1Mpps      139     32116        17     369904.0
     5       1Mpps      159     19643        19     376134.0
     6       1Mpps      155     22023        15     367286.0
     7       1Mpps      164     19412        17     380800.0
     8       1Mpps      160     24038        16     378965.0
     9       1Mpps      115     64264        12        569.0
    |------------------------Statistic------------------------|
       Max   1Mpps      164     64264        19       522820
       Min   1Mpps      115     17641        12          569
      Mean   1Mpps      149     27018        15       347780
    Median   1Mpps      152     21959        15       368595
     Stdev       0     14.1  13743.11      2.04     131416.0
    
    =======================Stream Rate: 2Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       2Mpps      155     14612        16     748606.0
     1       2Mpps      156     14854        15    1080685.0
     2       2Mpps      148     16990        16     740497.0
     3       2Mpps      150     15983        16     734332.0
     4       2Mpps      153     16111        16     741944.0
     5       2Mpps      151     20197        14     776075.0
     6       2Mpps      152     15672        16     732666.0
     7       2Mpps      147     15165        15     712528.0
     8       2Mpps      153     19077        16     771798.0
     9       2Mpps      148     15517        16     724650.0
    |------------------------Statistic------------------------|
       Max   2Mpps      156     20197        16      1080685
       Min   2Mpps      147     14612        14       712528
      Mean   2Mpps      151     16417        15       776378
    Median   2Mpps      151     15827        16       741220
     Stdev       0     3.06   1844.15       0.7    108678.66
    
    =======================Stream Rate: 3Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       3Mpps      147     15393        16    1600504.0
     1       3Mpps      156     15048        16    1610687.0
     2       3Mpps      148     14861        16    1122905.0
     3       3Mpps      152     15075        19    1695779.0
     4       3Mpps      158     15380        16    1188020.0
     5       3Mpps      132     15750        14    1590752.0
     6       3Mpps      132     14882        16    1114096.0
     7       3Mpps      152     14774        16    1137946.0
     8       3Mpps      147     14713        14    1581406.0
     9       3Mpps      154     16707        16    1692790.0
    |------------------------Statistic------------------------|
       Max   3Mpps      158     16707        19      1695779
       Min   3Mpps      132     14713        14      1114096
      Mean   3Mpps      147     15258        15      1433488
    Median   3Mpps      150     15061        16      1586079
     Stdev       0      9.1    603.93      1.37    255606.53
    
    =======================Stream Rate: 4Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       4Mpps      149     15965        17    1621926.0
     1       4Mpps      156     17400        17    1738364.0
     2       4Mpps      147     16126        15    2191332.0
     3       4Mpps      157     20683        18    1900610.0
     4       4Mpps      154     15384        16    2213498.0
     5       4Mpps      161     15015        17    2245792.0
     6       4Mpps      152     15384        15    1576313.0
     7       4Mpps      156     14989        15    1711173.0
     8       4Mpps      147     15699        15    1560115.0
     9       4Mpps      155     15507        18    1594805.0
    |------------------------Statistic------------------------|
       Max   4Mpps      161     20683        18      2245792
       Min   4Mpps      147     14989        15      1560115
      Mean   4Mpps      153     16215        16      1835392
    Median   4Mpps      154     15603        16      1724768
     Stdev       0      4.6   1716.89      1.25     281569.6
    
    =======================Stream Rate: 5Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       5Mpps      156     14371        15    1960618.0
     1       5Mpps      145     16697        14    2045776.0
     2       5Mpps      149     21116        17    2474624.0
     3       5Mpps      157     14248        15    1960252.0
     4       5Mpps      148     14980        18    1957918.0
     5       5Mpps      156     14606        17    1992246.0
     6       5Mpps      151     15013        15    1908936.0
     7       5Mpps      163     14654        15    2046271.0
     8       5Mpps      159     15878        17    2873091.0
     9       5Mpps      147     14634        17    2111070.0
    |------------------------Statistic------------------------|
       Max   5Mpps      163     21116        18      2873091
       Min   5Mpps      145     14248        14      1908936
      Mean   5Mpps      153     15619        16      2133080
    Median   5Mpps      153     14817        16      2019011
     Stdev       0     5.92    2070.6      1.33    305555.95




    Scenario 2: live migration with vhost-user 1 queue: PASS
    =======================Stream Rate: 1Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       1Mpps      124     13399        14     287705.0
     1       1Mpps      127     13507        15     294062.0
     2       1Mpps      133     13954        14     374276.0
     3       1Mpps      131     13224        14     301607.0
     4       1Mpps      126     13748        15     314343.0
     5       1Mpps      128     13092        14     293806.0
     6       1Mpps      130     14259        15     459060.0
     7       1Mpps      128     13447        15     428637.0
     8       1Mpps      127     13632        15     295462.0
     9       1Mpps      127     13757        14     293573.0
    |------------------------Statistic------------------------|
       Max   1Mpps      133     14259        15       459060
       Min   1Mpps      124     13092        14       287705
      Mean   1Mpps      128     13601        14       334253
    Median   1Mpps      127     13569        14       298534
     Stdev       0      2.6    346.26      0.53     63356.78
    
    =======================Stream Rate: 2Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       2Mpps      125     13441        14     587439.0
     1       2Mpps      124     13722        14     822653.0
     2       2Mpps      135     13267        15     633917.0
     3       2Mpps      124     13507        14     594731.0
     4       2Mpps      137     13112        16     648588.0
     5       2Mpps      124     13319        15     590963.0
     6       2Mpps      120     12994        14     572861.0
     7       2Mpps      126     13312        15     600018.0
     8       2Mpps      137     13015        16     639411.0
     9       2Mpps      130     13268        14     615823.0
    |------------------------Statistic------------------------|
       Max   2Mpps      137     13722        16       822653
       Min   2Mpps      120     12994        14       572861
      Mean   2Mpps      128     13295        14       630640
    Median   2Mpps      125     13290        14       607920
     Stdev       0     6.14    224.46      0.82     71883.01
    
    =======================Stream Rate: 3Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       3Mpps      133     13422        15    1399797.0
     1       3Mpps      134     13221        15     960617.0
     2       3Mpps      120     12953        14     877478.0
     3       3Mpps      121     14332        14    1333289.0
     4       3Mpps      134     13470        15     960206.0
     5       3Mpps      119     13543        14     875418.0
     6       3Mpps      126     13391        14     908054.0
     7       3Mpps      123     13710        15     891404.0
     8       3Mpps      128     13327        15     920344.0
     9       3Mpps      126     13488        15     916400.0
    |------------------------Statistic------------------------|
       Max   3Mpps      134     14332        15      1399797
       Min   3Mpps      119     12953        14       875418
      Mean   3Mpps      126     13485        14      1004300
    Median   3Mpps      126     13446        15       918372
     Stdev       0     5.76    359.34      0.52    193787.49
    
    =======================Stream Rate: 4Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       4Mpps      131     13169        15    1272490.0
     1       4Mpps      135     13461        15    1313776.0
     2       4Mpps      128     13141        15    1249004.0
     3       4Mpps      130     13223        15    1263447.0
     4       4Mpps      123     13838        14    1765944.0
     5       4Mpps      128     13545        15    1852889.0
     6       4Mpps      128     13643        14    1253172.0
     7       4Mpps      127     14456        14    1864338.0
     8       4Mpps      134     13528        15    1298461.0
     9       4Mpps      131     14292        14    1869694.0
    |------------------------Statistic------------------------|
       Max   4Mpps      135     14456        15      1869694
       Min   4Mpps      123     13141        14      1249004
      Mean   4Mpps      129     13629        14      1500321
    Median   4Mpps      129     13536        15      1306118
     Stdev       0      3.5    450.61      0.52    292804.48
    
    =======================Stream Rate: 5Mpps=========================
    No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
     0       5Mpps      124     12988        15    4112925.0
     1       5Mpps      129     13213        15    6736262.0
     2       5Mpps      133     12889        16    4543194.0
     3       5Mpps      119     13981        13   12241181.0
     4       5Mpps      122     13404        13    6211496.0
     5       5Mpps      122     14401        15   12206880.0
     6       5Mpps      129     14110        15   10317434.0
     7       5Mpps      128     14379        14   12140129.0
     8       5Mpps      131     14095        15   11325710.0
     9       5Mpps      133     13313        15    6698748.0
    |------------------------Statistic------------------------|
       Max   5Mpps      133     14401        16     12241181
       Min   5Mpps      119     12889        13      4112925
      Mean   5Mpps      127     13677        14      8653395
    Median   5Mpps      128     13692        15      8526848
     Stdev       0     4.94    576.35      0.97   3308836.75

Comment 6 Pei Zhang 2018-07-27 01:41:01 UTC
As Maxime mentioned it's not easy to trigger this issue by running test case, so we did live migration sanity testing. In Comment 5, live migration with vhost-user 2 queues and 1 queue both work as expected.   

We noticed a new issue, the T-Rex packets loss is high(twice than expected value) when the stream rate is large ( >= 5Mpps), however I think it might be a new issue, we'll double confirm and file a new bz to track it.


Move this bug to 'VERIFIED'. 

Maxime, Timothy, please correct me if any wrong about the verification. Thanks.

Comment 7 Timothy Redaelli 2018-08-10 13:45:33 UTC
The openvwitch component is delivered through the fast datapath channel, it is not documented in release notes.

Comment 9 errata-xmlrpc 2018-08-15 13:53:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2432