遇到一个问题想请教一下大家。我想改进一个项目中的缓存读取性能,将原来的Mutex
替换为RwLock
,结果发现即便无写入,只有并发读,性能也没有什么改观。于是作了下面这个测试,结果令我很惊讶:
use std::collections::HashMap;
use std::sync::Arc;
use std::sync::RwLock;
use std::thread;
use std::time;
fn main() {
for i in 1..=8 {
workload(i);
}
}
fn workload(concurrency: usize) {
let total = 1000 * 1000;
let mut m = HashMap::new();
for i in 0..total {
m.insert(i, i);
}
let m = Arc::new(RwLock::new(m));
let now = time::Instant::now();
let threads: Vec<_> = (0..concurrency)
.map(|_| {
let m = m.clone();
thread::spawn(move || {
for i in 0..total {
let _x = m.read().unwrap().get(&i);
}
})
})
.collect();
for t in threads {
t.join().unwrap();
}
let t = now.elapsed();
println!(
"threads: {}; time used: {:?}; ips: {}",
concurrency,
t,
(total * concurrency) as f64 / t.as_secs_f64()
);
}
cargo run --release
输出如下:
threads: 1; time used: 77.838377ms; ips: 12847133.23352053
threads: 2; time used: 205.569367ms; ips: 9729076.025223155
threads: 3; time used: 328.003797ms; ips: 9146235.584583797
threads: 4; time used: 415.737089ms; ips: 9621465.358362578
threads: 5; time used: 508.222261ms; ips: 9838215.252834035
threads: 6; time used: 586.550472ms; ips: 10229298.732880399
threads: 7; time used: 720.991697ms; ips: 9708849.67070571
threads: 8; time used: 856.792181ms; ips: 9337153.369750464
每个线程的负载是相同的,看起来增加线程明显只起到了负面作用。
我把代码翻译成Go,发现情况要好很多:
threads: 1; time used: 156.012685ms; ips: 6409735.208390
threads: 2; time used: 163.830266ms; ips: 12207756.532606
threads: 3; time used: 189.644867ms; ips: 15819041.387500
threads: 4; time used: 209.123695ms; ips: 19127435.559132
threads: 5; time used: 225.407194ms; ips: 22182078.181586
threads: 6; time used: 261.852325ms; ips: 22913678.539994
threads: 7; time used: 296.061541ms; ips: 23643732.908895
threads: 8; time used: 322.794129ms; ips: 24783598.217178
虽然性能不是线性提升的,但相比我写的rust版本要符合预期。
我觉得很可能是我写的rust版本有问题,想请教一下大家问题出在哪里,或者提高Hashmap的并发读性能的正确姿势是什么呢?
EDIT 1
其实我觉得重点并不是RwLock的加锁和解锁耗时长短,而是为什么线程数增加性能却不会增加甚至下降?线程内部只读不写的情况下,假设1个线程耗时100ms,如果读是并发的,核心数足够,理想情况下N个线程耗时应该还是100ms,总吞吐量是单线程的N倍。当然,实际情况效率不可能和线程数成正比的,线程切换和状态同步开销不可避免,但我觉得也不应该是单线程性能最高吧,毕竟RwLock存在的最大意义就是支持并发读。总感觉还是哪里没对
EDIT 2
@hr567 提供提了一个优化思路,将锁放到循环外面,性能确实会好很多,结果也比较符合预期。不过针对缓存场景,锁放到循环内部模拟可能更准确一些(循环模拟的是持续的请求,每次请求都需要访问一次缓存,但不能一直握着读锁不放,因为外部有时还需要写锁来更新缓存)。所以是不是说Rust中,对于读多写少的缓存来说,还是应该选择用(之前测试参数写错了,实际除了单线程RwLock的表现还是好与Mutex的)Mutex
更好呢?
下面是同一台机器上Mutex和RwLock的测试结果,除了锁做了替换,其他代码完全一样:
Mutex:
threads: 1; time used: 78.146596ms; ips: 12796462.689174587
threads: 2; time used: 554.152693ms; ips: 3609113.562496032
threads: 3; time used: 417.027343ms; ips: 7193772.90327939
threads: 4; time used: 717.132682ms; ips: 5577768.38289459
threads: 5; time used: 1.701271272s; ips: 2938978.6815844146
threads: 6; time used: 1.817029184s; ips: 3302093.3581218696
threads: 7; time used: 2.372727488s; ips: 2950191.30321636
threads: 8; time used: 2.505103477s; ips: 3193480.857557406
RwLock:
threads: 1; time used: 107.624433ms; ips: 9291570.437355986
threads: 2; time used: 278.304096ms; ips: 7186383.631234806
threads: 3; time used: 406.974556ms; ips: 7371468.205496365
threads: 4; time used: 527.331438ms; ips: 7585362.282155459
threads: 5; time used: 618.426131ms; ips: 8085039.990006503
threads: 6; time used: 767.771963ms; ips: 7814820.401301891
threads: 7; time used: 830.143264ms; ips: 8432279.46736673
threads: 8; time used: 908.431399ms; ips: 8806388.692427836
评论区
写评论试试
我试了下你的代码,发现实际上是可以增加效率啊。但是吃到3个线程以上时,几乎没有什么提升了。
总工作量不变的情况,线程增加,耗时会减少: