以前初学的时候浅浅了解过原子操作,当时就很不理解原子操作里的Ordering
枚举参数是什么意思。
后来到处看了不少文章,但遗憾的是大多数文章都是拿着标准库的说明来译经念经而已,初学者看了还是云里雾里。
无非就是:Ordering::Relaxed
是编译优先顺序最弱的;Ordering::Acquire
和Ordering::Release
一般成对使用,关键字像加锁的API,大致就是确定了编译先后顺序;然后就是遇事不决就用Ordering::SeqCst
,可以保证原子操作顺序,但性能相对劣势。
然而疑问却很多,比如说:在绝大多数情况下,只要Ordering
参数合法,基本不会出现不符合预期的情况,那所谓的使用Ordering::Relaxed
到底在什么时候会出现不可预期的效果;Ordering::Acquire
和Ordering::Release
在一些文章里会出现在不同线程中,但明明是原子操作,不同线程其实并不会像加锁那样停下来等待另一个线程的原子操作,放在不同线程里有什么意义;Ordering::SeqCst
性能偏弱能不能举个例子。
前两天突然看到日报里又关于原子操作和锁的书,又大略看了看,读了读Github的代码,对于发现想了解原子操作还是得从自旋锁入手。
《Rust Atomics and Locks》: https://marabos.nl/atomics/
首先先实现一个最简单的布尔自旋锁吧。
思路也很简单,数据暂时先用static
全局变量存储要修改的数字R
,在unsafe
块下面修改R
,同时用一个AtomicBool
布尔类型的原子变量标记锁,用while
循环进行线程阻塞,如果有线程抢到了false
,则立即修改锁的值为true
代表上锁,其他线程则因为上锁陷入长循环,抢到锁的线程修改完数据后再把锁改成false
,所有线程继续抢锁。
开10个线程来批量处理数据,每个线程进行100000次自加操作。
然后再搞个标准库的Arc<Mutex<usize>>
来对比下时间和操作准确度,外部套个循环多重复几次看看多次运行的情况。
use std::thread;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::sync::Mutex;
use std::time::Instant;
// 10个线程,每个线程100000次处理
const N_THREADS: usize = 10;
const N_TIMES: usize = 100000;
// R是待修改变量,SPIN_LOCK则是标记锁
static mut R: usize = 0;
static SPIN_LOCK: AtomicBool = AtomicBool::new(false);
fn main() {
for t in 1..=10 {
unsafe {
R = 0;
}
let start_of_spin = Instant::now();
let handles = (0..N_THREADS).map(|i| {
thread::spawn(move || {
unsafe {
for j in i * N_TIMES.. (i + 1) * N_TIMES {
// 用while循环来阻塞线程,swap来保证判断和修改的原子性,此处用了最宽松的Relaxed
while SPIN_LOCK.swap(true, Ordering::Relaxed) { }
// 修改数据,本身并不是线程安全的
R += j;
// 把锁改回false让所有线程继续抢锁
SPIN_LOCK.store(false, Ordering::Relaxed);
}
}
})
}).collect::<Vec<_>>();
for handle in handles {
handle.join().unwrap();
}
let time_of_spin = start_of_spin.elapsed();
let r = Arc::new(Mutex::new(0));
let start_of_mutex = Instant::now();
// 标准的多线程修改数据方法
let handles = (0..N_THREADS).map(|i| {
let r = r.clone();
thread::spawn(move || {
for j in i * N_TIMES.. (i + 1) * N_TIMES {
*r.lock().unwrap() += j;
}
})
}).collect::<Vec<_>>();
for handle in handles {
handle.join().unwrap();
}
let time_of_mutex = start_of_mutex.elapsed();
println!("{t:3}: R = {}, r = {}, spin: {time_of_spin:?}, mutex: {time_of_mutex:?}", unsafe { R }, r.lock().unwrap());
}
}
运行得到结果:
1: R = 84999950000, r = 499999500000, spin: 116.8523ms, mutex: 20.406ms
2: R = 84999950000, r = 499999500000, spin: 115.8753ms, mutex: 23.8912ms
3: R = 84999950000, r = 499999500000, spin: 104.4092ms, mutex: 20.5238ms
4: R = 44999950000, r = 499999500000, spin: 115.2894ms, mutex: 22.1232ms
5: R = 14999950000, r = 499999500000, spin: 116.6161ms, mutex: 20.3407ms
6: R = 34999950000, r = 499999500000, spin: 110.5409ms, mutex: 20.7766ms
7: R = 14999950000, r = 499999500000, spin: 106.7984ms, mutex: 20.2581ms
8: R = 64999950000, r = 499999500000, spin: 116.2075ms, mutex: 22.8965ms
9: R = 94999950000, r = 499999500000, spin: 122.5616ms, mutex: 20.8712ms
10: R = 94999950000, r = 499999500000, spin: 132.7401ms, mutex: 25.2957ms
其中R
是加法的结果,r
则是标准多线程操作的结果,R
的最终数值与r
并不相同,很明显是R
出错了。
现在,复现了第一个疑惑,就是Ordering::Relaxed
确实有些问题,起码在自旋锁这里是有问题,虽然还是不太好说具体是什么问题。既然如此,根据一般的自旋锁教程,把Ordering::Relaxed
换成Ordering::Acquire
和Ordering::Release
。
// 核心修改部分换成了Acquire和Release
while SPIN_LOCK.swap(true, Ordering::Acquire) { }
R += j;
SPIN_LOCK.store(false, Ordering::Release);
再次运行得到结果:
1: R = 499999500000, r = 499999500000, spin: 247.6544ms, mutex: 20.9144ms
2: R = 499999500000, r = 499999500000, spin: 227.6381ms, mutex: 20.6151ms
3: R = 499999500000, r = 499999500000, spin: 227.2818ms, mutex: 25.3846ms
4: R = 499999500000, r = 499999500000, spin: 245.684ms, mutex: 20.8567ms
5: R = 499999500000, r = 499999500000, spin: 227.0974ms, mutex: 20.0019ms
6: R = 499999500000, r = 499999500000, spin: 228.9819ms, mutex: 20.8273ms
7: R = 499999500000, r = 499999500000, spin: 276.0922ms, mutex: 21.8605ms
8: R = 499999500000, r = 499999500000, spin: 233.6243ms, mutex: 25.1547ms
9: R = 499999500000, r = 499999500000, spin: 232.5839ms, mutex: 20.2137ms
10: R = 499999500000, r = 499999500000, spin: 218.4376ms, mutex: 23.0648ms
这次的结果果然正确了,Ordering::Acquire
和Ordering::Release
果然起了作用,这时候再试试Ordering::SeqCst
。
// 核心修改部分换成了SeqCst
while SPIN_LOCK.swap(true, Ordering::SeqCst) { }
R += j;
SPIN_LOCK.store(false, Ordering::SeqCst);
运行得到结果:
1: R = 499999500000, r = 499999500000, spin: 353.6353ms, mutex: 24.4979ms
2: R = 499999500000, r = 499999500000, spin: 334.631ms, mutex: 23.3446ms
3: R = 499999500000, r = 499999500000, spin: 348.2878ms, mutex: 23.5033ms
4: R = 499999500000, r = 499999500000, spin: 343.114ms, mutex: 21.7937ms
5: R = 499999500000, r = 499999500000, spin: 374.2087ms, mutex: 22.905ms
6: R = 499999500000, r = 499999500000, spin: 350.4543ms, mutex: 24.7428ms
7: R = 499999500000, r = 499999500000, spin: 358.6213ms, mutex: 23.8559ms
8: R = 499999500000, r = 499999500000, spin: 368.8898ms, mutex: 21.666ms
9: R = 499999500000, r = 499999500000, spin: 359.1947ms, mutex: 22.59ms
10: R = 499999500000, r = 499999500000, spin: 377.3886ms, mutex: 21.306ms
结果同样是正确,其实在这个例子里,只要不是两个Ordering
标记同时为Relaxed
都可以保证结果的正确性。
同时也可以验证出性能:Ordering::Relaxed
>Ordering::Release
,Ordering::Acquire
>Ordering::SeqCst
。
但是,明明是最简单的自旋锁,性能居然比系统默认的Mutex
互斥锁差这么多。通过查看源码发现,mutex
在阻塞时调用的系统的API,对于初学者来说实在太黑箱了。于是猜测性能损耗还可能发生在while
循环里,对于系统线程的调度来说,如果出现了这种长循环,很可能会把CPU跑满,抢占线程资源,其他线程切换不及时。
基于这个假设,可以考虑在核心循环里加点料,一种是让线程切换得更即时一些,另一种则是让线程暂停下来,等待别的线程的唤醒再循环抢锁。
首先找到的是标准库里的std::hint::spin_loop
,顾名思义的自旋循环,《Rust Atomics and Locks》的范例代码用的便是这个API,于是修改核心代码:
while SPIN_LOCK.swap(true, Ordering::Relaxed) {
// 在循环体内插入标准库自旋锁API
std::hint::spin_loop();
}
R += j;
SPIN_LOCK.store(false, Ordering::Relaxed);
运行得到结果:
1: R = 499999500000, r = 499999500000, spin: 207.5686ms, mutex: 22.0559ms
2: R = 499999500000, r = 499999500000, spin: 209.1491ms, mutex: 21.1985ms
3: R = 499999500000, r = 499999500000, spin: 207.2068ms, mutex: 21.6455ms
4: R = 499999500000, r = 499999500000, spin: 208.4946ms, mutex: 20.6212ms
5: R = 499999500000, r = 499999500000, spin: 181.2423ms, mutex: 23.6995ms
6: R = 499999500000, r = 499999500000, spin: 190.0333ms, mutex: 23.6426ms
7: R = 499999500000, r = 499999500000, spin: 224.0493ms, mutex: 20.0067ms
8: R = 499999500000, r = 499999500000, spin: 248.7846ms, mutex: 21.4922ms
9: R = 499999500000, r = 499999500000, spin: 207.7718ms, mutex: 20.822ms
10: R = 499999500000, r = 499999500000, spin: 225.4686ms, mutex: 20.8056ms
这里Ordering
又换回了性能最好的Relaxed
,其实一般来说在多个原子操作之间加一些复杂操作,尤其是涉及线程调度的操作,理论上编译时就很难重排出不符合预期的结果,所以大多数情况Relaxed
都是够用的,但是加锁后的性能并不太如人意,性能与Acquire
和Release
下空循环差不多,比SeqCst
要稍好。
但是如果查看CPU运行状态的话,可以发现CPU占用率比空循环会低不少,std::hint::spin_loop
确实一定程度上实现了线程的调度切换,查看其源码,发现确实也用了操作系统API进行线程调度。
这时候又发现另一个系列的API,std::thread::park
,用法在标准库里可以找到。大致就是直接调用thread::park()
就会让线程停下来,同时需要在别的线程获取当前线程的对象thread::Thread
,通过调用std::thread::Thread::unpark
来让该线程重启,为了防止线程永久沉睡,最好还是得用thread::park_timeout
操作,超时后自动唤醒线程进行循环。
基于这个API的特点程序就复杂了起来,我们需要另外建立一个线程数组,来获取spawn
出来的joinhandle
,然后再通过joinhandle
获取句柄对应线程,因为涉及到多线程调用,这个数组也得采取全局static
变量unsafe
修改的策略,然后还得在每次修改完变量后遍历唤醒其他的线程。
完整代码如下:
use std::thread;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::sync::Mutex;
use std::time::{Instant, Duration};
// 10个线程,每个线程100000次处理
const N_THREADS: usize = 10;
const N_TIMES: usize = 100000;
// R是待修改变量,SPIN_LOCK则是标记锁
static mut R: usize = 0;
static SPIN_LOCK: AtomicBool = AtomicBool::new(false);
// 全局存储线程对象
static mut THREADS: Vec<thread::Thread> = Vec::new();
fn main() {
for t in 1..=10 {
unsafe {
R = 0;
}
let start_of_spin = Instant::now();
let handles = (0..N_THREADS).map(|i| {
thread::spawn(move || {
unsafe {
for j in i * N_TIMES.. (i + 1) * N_TIMES {
while SPIN_LOCK.swap(true, Ordering::Relaxed) {
// 线程暂停,如果1微秒内不被重启就继续循环
thread::park_timeout(Duration::from_micros(1));
}
R += j;
SPIN_LOCK.store(false, Ordering::Relaxed);
// 修改完成后唤醒所有线程
for thread in THREADS.iter() {
thread.unpark();
}
}
}
})
}).collect::<Vec<_>>();
unsafe {
// 通过句柄克隆出对应的线程对象并装入全局数组
THREADS = handles.iter().map(|handle| handle.thread().clone()).collect::<Vec<_>>();
// 初始化结束后唤醒各个线程
for thread in THREADS.iter() {
thread.unpark();
}
}
for handle in handles {
handle.join().unwrap();
}
let time_of_spin = start_of_spin.elapsed();
let r = Arc::new(Mutex::new(0));
let start_of_mutex = Instant::now();
// 标准的多线程修改数据方法
let handles = (0..N_THREADS).map(|i| {
let r = r.clone();
thread::spawn(move || {
for j in i * N_TIMES.. (i + 1) * N_TIMES {
*r.lock().unwrap() += j;
}
})
}).collect::<Vec<_>>();
for handle in handles {
handle.join().unwrap();
}
let time_of_mutex = start_of_mutex.elapsed();
println!("{t:3}: R = {}, r = {}, spin: {time_of_spin:?}, mutex: {time_of_mutex:?}", unsafe { R }, r.lock().unwrap());
}
}
最后得到结果:
1: R = 499999500000, r = 499999500000, spin: 96.835ms, mutex: 23.4733ms
2: R = 499999500000, r = 499999500000, spin: 106.3914ms, mutex: 19.5431ms
3: R = 499999500000, r = 499999500000, spin: 97.4153ms, mutex: 24.657ms
4: R = 499999500000, r = 499999500000, spin: 99.7176ms, mutex: 20.3693ms
5: R = 499999500000, r = 499999500000, spin: 102.2228ms, mutex: 24.1728ms
6: R = 499999500000, r = 499999500000, spin: 101.3587ms, mutex: 23.8349ms
7: R = 499999500000, r = 499999500000, spin: 104.0861ms, mutex: 22.4304ms
8: R = 499999500000, r = 499999500000, spin: 99.5858ms, mutex: 21.9831ms
9: R = 499999500000, r = 499999500000, spin: 101.6998ms, mutex: 20.7136ms
10: R = 499999500000, r = 499999500000, spin: 104.7081ms, mutex: 19.793ms
可以看出,比标准库的自旋锁API又快不少,CPU消耗也降低不少,如果完全停止的话,CPU甚至可以和一般的阻塞一样没有消耗。
但是优化还没结束,既然归根结底在与线程切换的效率,于是可以发现标准库里有个专门切换线程的API叫thread::yield_now
,直译便是出让线程,于是修改掉复杂thread::park
API,只需要把最初版本核心部分稍微修改一下就可以了。
while SPIN_LOCK.swap(true, Ordering::Relaxed) {
// 循环体内加一句出让线程
thread::yield_now();
}
R += j;
SPIN_LOCK.store(false, Ordering::Relaxed);
最后得到结果:
1: R = 499999500000, r = 499999500000, spin: 19.0725ms, mutex: 21.4468ms
2: R = 499999500000, r = 499999500000, spin: 19.1969ms, mutex: 24.8842ms
3: R = 499999500000, r = 499999500000, spin: 18.9377ms, mutex: 24.6436ms
4: R = 499999500000, r = 499999500000, spin: 17.1335ms, mutex: 20.1235ms
5: R = 499999500000, r = 499999500000, spin: 17.7659ms, mutex: 21.0874ms
6: R = 499999500000, r = 499999500000, spin: 18.8314ms, mutex: 24.8805ms
7: R = 499999500000, r = 499999500000, spin: 19.9078ms, mutex: 22.75ms
8: R = 499999500000, r = 499999500000, spin: 18.755ms, mutex: 21.2979ms
9: R = 499999500000, r = 499999500000, spin: 19.7315ms, mutex: 23.4285ms
10: R = 499999500000, r = 499999500000, spin: 18.1363ms, mutex: 24.1529ms
终于的终于,我们战胜了标准库的互斥锁。看起来yield_now
确实是有效的,时间波动也很小。不过缺点也很明显,高频循环和线程切换会瞬间冲爆所有CPU。
根据其他的一些测试,其实纯粹的睡眠thread::sleep
也能起到线程切换的效果,
while SPIN_LOCK.swap(true, Ordering::Relaxed) {
// 循环体内通过睡眠1微秒出让线程
thread::sleep(Duration::from_micros(1));
}
R += j;
SPIN_LOCK.store(false, Ordering::Relaxed);
1: R = 499999500000, r = 499999500000, spin: 117.4703ms, mutex: 23.7323ms
2: R = 499999500000, r = 499999500000, spin: 133.6754ms, mutex: 21.4141ms
3: R = 499999500000, r = 499999500000, spin: 136.0562ms, mutex: 43.2387ms
4: R = 499999500000, r = 499999500000, spin: 128.3459ms, mutex: 24.3047ms
5: R = 499999500000, r = 499999500000, spin: 130.6175ms, mutex: 25.2669ms
6: R = 499999500000, r = 499999500000, spin: 113.7597ms, mutex: 32.0094ms
7: R = 499999500000, r = 499999500000, spin: 136.6121ms, mutex: 21.7237ms
8: R = 499999500000, r = 499999500000, spin: 136.7076ms, mutex: 62.4027ms
9: R = 499999500000, r = 499999500000, spin: 140.1101ms, mutex: 32.1292ms
10: R = 499999500000, r = 499999500000, spin: 123.1714ms, mutex: 20.7013ms
可以看出,这个数量级的操作性能仅比thread::park
略弱。
然而加大到1000000次操作后的sleep
:
1: R = 49999995000000, r = 49999995000000, spin: 152.4242ms, mutex: 206.6386ms
2: R = 49999995000000, r = 49999995000000, spin: 152.5491ms, mutex: 208.9147ms
3: R = 49999995000000, r = 49999995000000, spin: 197.8593ms, mutex: 241.7115ms
4: R = 49999995000000, r = 49999995000000, spin: 149.9495ms, mutex: 252.3702ms
5: R = 49999995000000, r = 49999995000000, spin: 193.2647ms, mutex: 238.9686ms
6: R = 49999995000000, r = 49999995000000, spin: 157.2534ms, mutex: 232.8807ms
7: R = 49999995000000, r = 49999995000000, spin: 152.7905ms, mutex: 239.2438ms
8: R = 49999995000000, r = 49999995000000, spin: 154.7708ms, mutex: 234.9019ms
9: R = 49999995000000, r = 49999995000000, spin: 145.1315ms, mutex: 225.6531ms
10: R = 49999995000000, r = 49999995000000, spin: 186.7429ms, mutex: 258.4023ms
加大到1000000次操作后的yield_now
:
1: R = 49999995000000, r = 49999995000000, spin: 161.9026ms, mutex: 233.4809ms
2: R = 49999995000000, r = 49999995000000, spin: 154.3257ms, mutex: 241.573ms
3: R = 49999995000000, r = 49999995000000, spin: 190.7096ms, mutex: 218.3715ms
4: R = 49999995000000, r = 49999995000000, spin: 193.1071ms, mutex: 250.1671ms
5: R = 49999995000000, r = 49999995000000, spin: 188.5109ms, mutex: 224.2478ms
6: R = 49999995000000, r = 49999995000000, spin: 189.241ms, mutex: 252.0153ms
7: R = 49999995000000, r = 49999995000000, spin: 188.9498ms, mutex: 244.9358ms
8: R = 49999995000000, r = 49999995000000, spin: 190.654ms, mutex: 219.5048ms
9: R = 49999995000000, r = 49999995000000, spin: 194.7326ms, mutex: 209.2761ms
10: R = 49999995000000, r = 49999995000000, spin: 188.9025ms, mutex: 251.199ms
可以看出如果操作次数进一步加大数量级,sleep
反而会体现出优势,不仅强过Mutex
,甚至能超过thread::yield_now
。thread::park_timeout
如果不添加唤醒其他线程的步骤也能达到类似的效果,这两种sleep/timeout
的CPU消耗会明显比thread::yield_now
更低。
总结:
关于测试,我测试机子是2.6GHz12线程win10,在不同配置和操作系统下,得出的结果不一定相同。比如最初的Relaxed
加法出错的情况,我在官方的playground就无法复现出来,playground里怎么加结果都是对的,而yield_now
在playground里性能比Mutex
要快好3-4倍。
这一系列的实验,可以基本了解原子操作和自旋锁的一些功能特性。thread::yield_now
在中小规模的数据处理上更有效,thread::sleep
则在较大的数据上也能有性能优势,thread::park
的线程暂停功能,则能更轻易的实现标准库里各种锁和通道,尽管性能想超越标准库还是不太容易。至于hint::spin_lock
,感觉完全没有任何优势。
最后说明一下,这篇文章也仅仅是测试而已,至于为什么会有这些性能差异,还得需要诸位大佬们来解答。
评论区
写评论补了一个纯
thread::park
的版本,这个已经称不上自旋锁了,就是执行完一次操作后启动相邻的下一个线程操作,线程切换百万次,性能可以说十分低下了,不过CPU也不太高。R只是模拟一般数据,还可以改成Vec,HashMap,用原子变量自加的话肯定最快无疑。我还测了vec.push和hashmap.insert的,相同数量级的操作下vec版yield_now仍然比mutex快,但hashmap就比mutex慢。
--
👇
HC97: 我的理解是,Ordering是对单个线程内指令执行顺序的约束,其中Relaxed内存序只保证对自身读写的相对顺序,所以第一个例子里对R的读写可能会被重排序到SPIN_LOCK锁区域之外。 感觉这里最快的方法应该是不用锁,把R定义为原子变量,用Relaxed内存序调用fetch_add方法。但是在测试中出现了两种情况:笔记本(x86)上Debug模式fetch_add快,release模式mutex快;手机(arm)上两种模式都是fetch_add快。这个东西和平台相关性比较强,具体原因估计得看源代码是怎么实现的。
我的理解是,Ordering是对单个线程内指令执行顺序的约束,其中Relaxed内存序只保证对自身读写的相对顺序,所以第一个例子里对R的读写可能会被重排序到SPIN_LOCK锁区域之外。 感觉这里最快的方法应该是不用锁,把R定义为原子变量,用Relaxed内存序调用fetch_add方法。但是在测试中出现了两种情况:笔记本(x86)上Debug模式fetch_add快,release模式mutex快;手机(arm)上两种模式都是fetch_add快。这个东西和平台相关性比较强,具体原因估计得看源代码是怎么实现的。
6