关于RwLock并发读性能的疑问

wfxr 发表于 2021-01-21 23:22

Tags：问题

遇到一个问题想请教一下大家。我想改进一个项目中的缓存读取性能，将原来的Mutex替换为RwLock，结果发现即便无写入，只有并发读，性能也没有什么改观。于是作了下面这个测试，结果令我很惊讶：

use std::collections::HashMap;
use std::sync::Arc;
use std::sync::RwLock;
use std::thread;
use std::time;

fn main() {
    for i in 1..=8 {
        workload(i);
    }
}

fn workload(concurrency: usize) {
    let total = 1000 * 1000;
    let mut m = HashMap::new();
    for i in 0..total {
        m.insert(i, i);
    }
    let m = Arc::new(RwLock::new(m));

    let now = time::Instant::now();
    let threads: Vec<_> = (0..concurrency)
        .map(|_| {
            let m = m.clone();
            thread::spawn(move || {
                for i in 0..total {
                    let _x = m.read().unwrap().get(&i);
                }
            })
        })
        .collect();

    for t in threads {
        t.join().unwrap();
    }
    let t = now.elapsed();
    println!(
        "threads: {}; time used: {:?}; ips: {}",
        concurrency,
        t,
        (total * concurrency) as f64 / t.as_secs_f64()
    );
}

cargo run --release输出如下：

threads: 1; time used: 77.838377ms; ips: 12847133.23352053
threads: 2; time used: 205.569367ms; ips: 9729076.025223155
threads: 3; time used: 328.003797ms; ips: 9146235.584583797
threads: 4; time used: 415.737089ms; ips: 9621465.358362578
threads: 5; time used: 508.222261ms; ips: 9838215.252834035
threads: 6; time used: 586.550472ms; ips: 10229298.732880399
threads: 7; time used: 720.991697ms; ips: 9708849.67070571
threads: 8; time used: 856.792181ms; ips: 9337153.369750464

每个线程的负载是相同的，看起来增加线程明显只起到了负面作用。

我把代码翻译成Go，发现情况要好很多：

threads: 1; time used: 156.012685ms; ips: 6409735.208390
threads: 2; time used: 163.830266ms; ips: 12207756.532606
threads: 3; time used: 189.644867ms; ips: 15819041.387500
threads: 4; time used: 209.123695ms; ips: 19127435.559132
threads: 5; time used: 225.407194ms; ips: 22182078.181586
threads: 6; time used: 261.852325ms; ips: 22913678.539994
threads: 7; time used: 296.061541ms; ips: 23643732.908895
threads: 8; time used: 322.794129ms; ips: 24783598.217178

虽然性能不是线性提升的，但相比我写的rust版本要符合预期。

我觉得很可能是我写的rust版本有问题，想请教一下大家问题出在哪里，或者提高Hashmap的并发读性能的正确姿势是什么呢？

EDIT 1

其实我觉得重点并不是RwLock的加锁和解锁耗时长短，而是为什么线程数增加性能却不会增加甚至下降？线程内部只读不写的情况下，假设1个线程耗时100ms，如果读是并发的，核心数足够，理想情况下N个线程耗时应该还是100ms，总吞吐量是单线程的N倍。当然，实际情况效率不可能和线程数成正比的，线程切换和状态同步开销不可避免，但我觉得也不应该是单线程性能最高吧，毕竟RwLock存在的最大意义就是支持并发读。总感觉还是哪里没对

EDIT 2

@hr567 提供提了一个优化思路，将锁放到循环外面，性能确实会好很多，结果也比较符合预期。不过针对缓存场景，锁放到循环内部模拟可能更准确一些（循环模拟的是持续的请求，每次请求都需要访问一次缓存，但不能一直握着读锁不放，因为外部有时还需要写锁来更新缓存）。~~所以是不是说Rust中，对于读多写少的缓存来说，还是应该选择用Mutex更好呢？~~（之前测试参数写错了，实际除了单线程RwLock的表现还是好与Mutex的）

下面是同一台机器上Mutex和RwLock的测试结果，除了锁做了替换，其他代码完全一样：

Mutex：

threads: 1; time used: 78.146596ms; ips: 12796462.689174587
threads: 2; time used: 554.152693ms; ips: 3609113.562496032
threads: 3; time used: 417.027343ms; ips: 7193772.90327939
threads: 4; time used: 717.132682ms; ips: 5577768.38289459
threads: 5; time used: 1.701271272s; ips: 2938978.6815844146
threads: 6; time used: 1.817029184s; ips: 3302093.3581218696
threads: 7; time used: 2.372727488s; ips: 2950191.30321636
threads: 8; time used: 2.505103477s; ips: 3193480.857557406

RwLock：

threads: 1; time used: 107.624433ms; ips: 9291570.437355986
threads: 2; time used: 278.304096ms; ips: 7186383.631234806
threads: 3; time used: 406.974556ms; ips: 7371468.205496365
threads: 4; time used: 527.331438ms; ips: 7585362.282155459
threads: 5; time used: 618.426131ms; ips: 8085039.990006503
threads: 6; time used: 767.771963ms; ips: 7814820.401301891
threads: 7; time used: 830.143264ms; ips: 8432279.46736673
threads: 8; time used: 908.431399ms; ips: 8806388.692427836

评论区

写评论

simoin 2021-01-22 01:32

试试

[profile.release]
codegen-units = 1

eweca 2021-01-22 01:10

我试了下你的代码，发现实际上是可以增加效率啊。但是吃到3个线程以上时，几乎没有什么提升了。

ezlearning 2021-01-21 23:50

总工作量不变的情况，线程增加，耗时会减少：

let total = 1000 * 1000 / concurrency;

1 2 共 23 条评论, 2 页