< 返回版块

lithbitren 发表于 2023-01-08 14:15

Tags:自旋锁,spinlock,原子操作,atomic,ordering

以前初学的时候浅浅了解过原子操作,当时就很不理解原子操作里的Ordering枚举参数是什么意思。

后来到处看了不少文章,但遗憾的是大多数文章都是拿着标准库的说明来译经念经而已,初学者看了还是云里雾里。

无非就是:Ordering::Relaxed是编译优先顺序最弱的;Ordering::AcquireOrdering::Release一般成对使用,关键字像加锁的API,大致就是确定了编译先后顺序;然后就是遇事不决就用Ordering::SeqCst,可以保证原子操作顺序,但性能相对劣势。

然而疑问却很多,比如说:在绝大多数情况下,只要Ordering参数合法,基本不会出现不符合预期的情况,那所谓的使用Ordering::Relaxed到底在什么时候会出现不可预期的效果;Ordering::AcquireOrdering::Release在一些文章里会出现在不同线程中,但明明是原子操作,不同线程其实并不会像加锁那样停下来等待另一个线程的原子操作,放在不同线程里有什么意义;Ordering::SeqCst性能偏弱能不能举个例子。

前两天突然看到日报里又关于原子操作和锁的书,又大略看了看,读了读Github的代码,对于发现想了解原子操作还是得从自旋锁入手。

《Rust Atomics and Locks》: https://marabos.nl/atomics/

首先先实现一个最简单的布尔自旋锁吧。

思路也很简单,数据暂时先用static全局变量存储要修改的数字R,在unsafe块下面修改R,同时用一个AtomicBool布尔类型的原子变量标记锁,用while循环进行线程阻塞,如果有线程抢到了false,则立即修改锁的值为true代表上锁,其他线程则因为上锁陷入长循环,抢到锁的线程修改完数据后再把锁改成false,所有线程继续抢锁。

开10个线程来批量处理数据,每个线程进行100000次自加操作。

然后再搞个标准库的Arc<Mutex<usize>>来对比下时间和操作准确度,外部套个循环多重复几次看看多次运行的情况。

use std::thread;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::sync::Mutex;
use std::time::Instant;
 
// 10个线程,每个线程100000次处理
const N_THREADS: usize = 10;
const N_TIMES: usize = 100000;

// R是待修改变量,SPIN_LOCK则是标记锁
static mut R: usize = 0;
static SPIN_LOCK: AtomicBool = AtomicBool::new(false);
 
 
fn main() {

    for t in 1..=10 {

        unsafe {
            R = 0;
        }
        
        let start_of_spin = Instant::now();

        let handles = (0..N_THREADS).map(|i| {
            thread::spawn(move || {
                unsafe {
                    for j in i * N_TIMES.. (i + 1) * N_TIMES {
                        // 用while循环来阻塞线程,swap来保证判断和修改的原子性,此处用了最宽松的Relaxed
                        while SPIN_LOCK.swap(true, Ordering::Relaxed) { }
                        // 修改数据,本身并不是线程安全的
                        R += j;
                        // 把锁改回false让所有线程继续抢锁
                        SPIN_LOCK.store(false, Ordering::Relaxed);
                    }
                }
            })
        }).collect::<Vec<_>>();
 
        for handle in handles {
            handle.join().unwrap();
        }

        let time_of_spin = start_of_spin.elapsed();

        let r = Arc::new(Mutex::new(0));
        
        let start_of_mutex = Instant::now();
        
        // 标准的多线程修改数据方法
        let handles = (0..N_THREADS).map(|i| {
            let r = r.clone();
            thread::spawn(move || {
                for j in i * N_TIMES.. (i + 1) * N_TIMES {
                    *r.lock().unwrap() += j;
                }
            })
        }).collect::<Vec<_>>();
 
        for handle in handles {
            handle.join().unwrap();
        }

        let time_of_mutex = start_of_mutex.elapsed();
 
        println!("{t:3}: R = {}, r = {}, spin: {time_of_spin:?}, mutex: {time_of_mutex:?}", unsafe { R }, r.lock().unwrap());
    }
}

运行得到结果:

  1: R = 84999950000, r = 499999500000, spin: 116.8523ms, mutex: 20.406ms
  2: R = 84999950000, r = 499999500000, spin: 115.8753ms, mutex: 23.8912ms
  3: R = 84999950000, r = 499999500000, spin: 104.4092ms, mutex: 20.5238ms
  4: R = 44999950000, r = 499999500000, spin: 115.2894ms, mutex: 22.1232ms
  5: R = 14999950000, r = 499999500000, spin: 116.6161ms, mutex: 20.3407ms
  6: R = 34999950000, r = 499999500000, spin: 110.5409ms, mutex: 20.7766ms
  7: R = 14999950000, r = 499999500000, spin: 106.7984ms, mutex: 20.2581ms
  8: R = 64999950000, r = 499999500000, spin: 116.2075ms, mutex: 22.8965ms
  9: R = 94999950000, r = 499999500000, spin: 122.5616ms, mutex: 20.8712ms
 10: R = 94999950000, r = 499999500000, spin: 132.7401ms, mutex: 25.2957ms

其中R是加法的结果,r则是标准多线程操作的结果,R的最终数值与r并不相同,很明显是R出错了。

现在,复现了第一个疑惑,就是Ordering::Relaxed确实有些问题,起码在自旋锁这里是有问题,虽然还是不太好说具体是什么问题。既然如此,根据一般的自旋锁教程,把Ordering::Relaxed换成Ordering::AcquireOrdering::Release

// 核心修改部分换成了Acquire和Release
while SPIN_LOCK.swap(true, Ordering::Acquire) { }
R += j;
SPIN_LOCK.store(false, Ordering::Release);

再次运行得到结果:

  1: R = 499999500000, r = 499999500000, spin: 247.6544ms, mutex: 20.9144ms
  2: R = 499999500000, r = 499999500000, spin: 227.6381ms, mutex: 20.6151ms
  3: R = 499999500000, r = 499999500000, spin: 227.2818ms, mutex: 25.3846ms
  4: R = 499999500000, r = 499999500000, spin: 245.684ms, mutex: 20.8567ms
  5: R = 499999500000, r = 499999500000, spin: 227.0974ms, mutex: 20.0019ms
  6: R = 499999500000, r = 499999500000, spin: 228.9819ms, mutex: 20.8273ms
  7: R = 499999500000, r = 499999500000, spin: 276.0922ms, mutex: 21.8605ms
  8: R = 499999500000, r = 499999500000, spin: 233.6243ms, mutex: 25.1547ms
  9: R = 499999500000, r = 499999500000, spin: 232.5839ms, mutex: 20.2137ms
 10: R = 499999500000, r = 499999500000, spin: 218.4376ms, mutex: 23.0648ms

这次的结果果然正确了,Ordering::AcquireOrdering::Release果然起了作用,这时候再试试Ordering::SeqCst

// 核心修改部分换成了SeqCst
while SPIN_LOCK.swap(true, Ordering::SeqCst) { }
R += j;
SPIN_LOCK.store(false, Ordering::SeqCst);

运行得到结果:

  1: R = 499999500000, r = 499999500000, spin: 353.6353ms, mutex: 24.4979ms
  2: R = 499999500000, r = 499999500000, spin: 334.631ms, mutex: 23.3446ms
  3: R = 499999500000, r = 499999500000, spin: 348.2878ms, mutex: 23.5033ms
  4: R = 499999500000, r = 499999500000, spin: 343.114ms, mutex: 21.7937ms
  5: R = 499999500000, r = 499999500000, spin: 374.2087ms, mutex: 22.905ms
  6: R = 499999500000, r = 499999500000, spin: 350.4543ms, mutex: 24.7428ms
  7: R = 499999500000, r = 499999500000, spin: 358.6213ms, mutex: 23.8559ms
  8: R = 499999500000, r = 499999500000, spin: 368.8898ms, mutex: 21.666ms
  9: R = 499999500000, r = 499999500000, spin: 359.1947ms, mutex: 22.59ms
 10: R = 499999500000, r = 499999500000, spin: 377.3886ms, mutex: 21.306ms

结果同样是正确,其实在这个例子里,只要不是两个Ordering标记同时为Relaxed都可以保证结果的正确性。

同时也可以验证出性能:Ordering::Relaxed>Ordering::ReleaseOrdering::Acquire>Ordering::SeqCst

但是,明明是最简单的自旋锁,性能居然比系统默认的Mutex互斥锁差这么多。通过查看源码发现,mutex在阻塞时调用的系统的API,对于初学者来说实在太黑箱了。于是猜测性能损耗还可能发生在while循环里,对于系统线程的调度来说,如果出现了这种长循环,很可能会把CPU跑满,抢占线程资源,其他线程切换不及时。

基于这个假设,可以考虑在核心循环里加点料,一种是让线程切换得更即时一些,另一种则是让线程暂停下来,等待别的线程的唤醒再循环抢锁。

首先找到的是标准库里的std::hint::spin_loop,顾名思义的自旋循环,《Rust Atomics and Locks》的范例代码用的便是这个API,于是修改核心代码:

while SPIN_LOCK.swap(true, Ordering::Relaxed) {
	// 在循环体内插入标准库自旋锁API
	std::hint::spin_loop();
}
R += j;
SPIN_LOCK.store(false, Ordering::Relaxed);

运行得到结果:

  1: R = 499999500000, r = 499999500000, spin: 207.5686ms, mutex: 22.0559ms
  2: R = 499999500000, r = 499999500000, spin: 209.1491ms, mutex: 21.1985ms
  3: R = 499999500000, r = 499999500000, spin: 207.2068ms, mutex: 21.6455ms
  4: R = 499999500000, r = 499999500000, spin: 208.4946ms, mutex: 20.6212ms
  5: R = 499999500000, r = 499999500000, spin: 181.2423ms, mutex: 23.6995ms
  6: R = 499999500000, r = 499999500000, spin: 190.0333ms, mutex: 23.6426ms
  7: R = 499999500000, r = 499999500000, spin: 224.0493ms, mutex: 20.0067ms
  8: R = 499999500000, r = 499999500000, spin: 248.7846ms, mutex: 21.4922ms
  9: R = 499999500000, r = 499999500000, spin: 207.7718ms, mutex: 20.822ms
 10: R = 499999500000, r = 499999500000, spin: 225.4686ms, mutex: 20.8056ms

这里Ordering又换回了性能最好的Relaxed,其实一般来说在多个原子操作之间加一些复杂操作,尤其是涉及线程调度的操作,理论上编译时就很难重排出不符合预期的结果,所以大多数情况Relaxed都是够用的,但是加锁后的性能并不太如人意,性能与AcquireRelease下空循环差不多,比SeqCst要稍好。

但是如果查看CPU运行状态的话,可以发现CPU占用率比空循环会低不少,std::hint::spin_loop确实一定程度上实现了线程的调度切换,查看其源码,发现确实也用了操作系统API进行线程调度。

这时候又发现另一个系列的API,std::thread::park,用法在标准库里可以找到。大致就是直接调用thread::park()就会让线程停下来,同时需要在别的线程获取当前线程的对象thread::Thread,通过调用std::thread::Thread::unpark来让该线程重启,为了防止线程永久沉睡,最好还是得用thread::park_timeout操作,超时后自动唤醒线程进行循环。

基于这个API的特点程序就复杂了起来,我们需要另外建立一个线程数组,来获取spawn出来的joinhandle,然后再通过joinhandle获取句柄对应线程,因为涉及到多线程调用,这个数组也得采取全局static变量unsafe修改的策略,然后还得在每次修改完变量后遍历唤醒其他的线程。

完整代码如下:

use std::thread;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::sync::Mutex;
use std::time::{Instant, Duration};
 
// 10个线程,每个线程100000次处理
const N_THREADS: usize = 10;
const N_TIMES: usize = 100000;

// R是待修改变量,SPIN_LOCK则是标记锁
static mut R: usize = 0;
static SPIN_LOCK: AtomicBool = AtomicBool::new(false);

// 全局存储线程对象
static mut THREADS: Vec<thread::Thread> = Vec::new();
 
 
fn main() {

    for t in 1..=10 {

        unsafe {
            R = 0;
        }
        
        let start_of_spin = Instant::now();

        let handles = (0..N_THREADS).map(|i| {
            thread::spawn(move || {
                unsafe {
                    for j in i * N_TIMES.. (i + 1) * N_TIMES {
                        while SPIN_LOCK.swap(true, Ordering::Relaxed) {
			    // 线程暂停,如果1微秒内不被重启就继续循环
                            thread::park_timeout(Duration::from_micros(1));
                        }
                        R += j;
                        SPIN_LOCK.store(false, Ordering::Relaxed);
			    // 修改完成后唤醒所有线程
                        for thread in THREADS.iter() {
                            thread.unpark();
                        }
                    }
                }
            })
        }).collect::<Vec<_>>();

        unsafe {
            // 通过句柄克隆出对应的线程对象并装入全局数组
            THREADS = handles.iter().map(|handle| handle.thread().clone()).collect::<Vec<_>>();    
            // 初始化结束后唤醒各个线程
            for thread in THREADS.iter() {
                thread.unpark();
            }
        }

        for handle in handles {
            handle.join().unwrap();
        }

        let time_of_spin = start_of_spin.elapsed();

        let r = Arc::new(Mutex::new(0));
        
        let start_of_mutex = Instant::now();
        
        // 标准的多线程修改数据方法
        let handles = (0..N_THREADS).map(|i| {
            let r = r.clone();
            thread::spawn(move || {
                for j in i * N_TIMES.. (i + 1) * N_TIMES {
                    *r.lock().unwrap() += j;
                }
            })
        }).collect::<Vec<_>>();
 
        for handle in handles {
            handle.join().unwrap();
        }

        let time_of_mutex = start_of_mutex.elapsed();
 
        println!("{t:3}: R = {}, r = {}, spin: {time_of_spin:?}, mutex: {time_of_mutex:?}", unsafe { R }, r.lock().unwrap());
    }
}

最后得到结果:

  1: R = 499999500000, r = 499999500000, spin: 96.835ms, mutex: 23.4733ms
  2: R = 499999500000, r = 499999500000, spin: 106.3914ms, mutex: 19.5431ms
  3: R = 499999500000, r = 499999500000, spin: 97.4153ms, mutex: 24.657ms
  4: R = 499999500000, r = 499999500000, spin: 99.7176ms, mutex: 20.3693ms
  5: R = 499999500000, r = 499999500000, spin: 102.2228ms, mutex: 24.1728ms
  6: R = 499999500000, r = 499999500000, spin: 101.3587ms, mutex: 23.8349ms
  7: R = 499999500000, r = 499999500000, spin: 104.0861ms, mutex: 22.4304ms
  8: R = 499999500000, r = 499999500000, spin: 99.5858ms, mutex: 21.9831ms
  9: R = 499999500000, r = 499999500000, spin: 101.6998ms, mutex: 20.7136ms
 10: R = 499999500000, r = 499999500000, spin: 104.7081ms, mutex: 19.793ms

可以看出,比标准库的自旋锁API又快不少,CPU消耗也降低不少,如果完全停止的话,CPU甚至可以和一般的阻塞一样没有消耗。

但是优化还没结束,既然归根结底在与线程切换的效率,于是可以发现标准库里有个专门切换线程的API叫thread::yield_now,直译便是出让线程,于是修改掉复杂thread::parkAPI,只需要把最初版本核心部分稍微修改一下就可以了。

while SPIN_LOCK.swap(true, Ordering::Relaxed) {
    // 循环体内加一句出让线程
    thread::yield_now();
}
R += j;
SPIN_LOCK.store(false, Ordering::Relaxed);

最后得到结果:

  1: R = 499999500000, r = 499999500000, spin: 19.0725ms, mutex: 21.4468ms
  2: R = 499999500000, r = 499999500000, spin: 19.1969ms, mutex: 24.8842ms
  3: R = 499999500000, r = 499999500000, spin: 18.9377ms, mutex: 24.6436ms
  4: R = 499999500000, r = 499999500000, spin: 17.1335ms, mutex: 20.1235ms
  5: R = 499999500000, r = 499999500000, spin: 17.7659ms, mutex: 21.0874ms
  6: R = 499999500000, r = 499999500000, spin: 18.8314ms, mutex: 24.8805ms
  7: R = 499999500000, r = 499999500000, spin: 19.9078ms, mutex: 22.75ms
  8: R = 499999500000, r = 499999500000, spin: 18.755ms, mutex: 21.2979ms
  9: R = 499999500000, r = 499999500000, spin: 19.7315ms, mutex: 23.4285ms
 10: R = 499999500000, r = 499999500000, spin: 18.1363ms, mutex: 24.1529ms

终于的终于,我们战胜了标准库的互斥锁。看起来yield_now确实是有效的,时间波动也很小。不过缺点也很明显,高频循环和线程切换会瞬间冲爆所有CPU。

根据其他的一些测试,其实纯粹的睡眠thread::sleep也能起到线程切换的效果,

while SPIN_LOCK.swap(true, Ordering::Relaxed) {
    // 循环体内通过睡眠1微秒出让线程
    thread::sleep(Duration::from_micros(1));
}
R += j;
SPIN_LOCK.store(false, Ordering::Relaxed);
  1: R = 499999500000, r = 499999500000, spin: 117.4703ms, mutex: 23.7323ms
  2: R = 499999500000, r = 499999500000, spin: 133.6754ms, mutex: 21.4141ms
  3: R = 499999500000, r = 499999500000, spin: 136.0562ms, mutex: 43.2387ms
  4: R = 499999500000, r = 499999500000, spin: 128.3459ms, mutex: 24.3047ms
  5: R = 499999500000, r = 499999500000, spin: 130.6175ms, mutex: 25.2669ms
  6: R = 499999500000, r = 499999500000, spin: 113.7597ms, mutex: 32.0094ms
  7: R = 499999500000, r = 499999500000, spin: 136.6121ms, mutex: 21.7237ms
  8: R = 499999500000, r = 499999500000, spin: 136.7076ms, mutex: 62.4027ms
  9: R = 499999500000, r = 499999500000, spin: 140.1101ms, mutex: 32.1292ms
 10: R = 499999500000, r = 499999500000, spin: 123.1714ms, mutex: 20.7013ms

可以看出,这个数量级的操作性能仅比thread::park略弱。

然而加大到1000000次操作后的sleep

  1: R = 49999995000000, r = 49999995000000, spin: 152.4242ms, mutex: 206.6386ms
  2: R = 49999995000000, r = 49999995000000, spin: 152.5491ms, mutex: 208.9147ms
  3: R = 49999995000000, r = 49999995000000, spin: 197.8593ms, mutex: 241.7115ms
  4: R = 49999995000000, r = 49999995000000, spin: 149.9495ms, mutex: 252.3702ms
  5: R = 49999995000000, r = 49999995000000, spin: 193.2647ms, mutex: 238.9686ms
  6: R = 49999995000000, r = 49999995000000, spin: 157.2534ms, mutex: 232.8807ms
  7: R = 49999995000000, r = 49999995000000, spin: 152.7905ms, mutex: 239.2438ms
  8: R = 49999995000000, r = 49999995000000, spin: 154.7708ms, mutex: 234.9019ms
  9: R = 49999995000000, r = 49999995000000, spin: 145.1315ms, mutex: 225.6531ms
 10: R = 49999995000000, r = 49999995000000, spin: 186.7429ms, mutex: 258.4023ms

加大到1000000次操作后的yield_now

  1: R = 49999995000000, r = 49999995000000, spin: 161.9026ms, mutex: 233.4809ms
  2: R = 49999995000000, r = 49999995000000, spin: 154.3257ms, mutex: 241.573ms
  3: R = 49999995000000, r = 49999995000000, spin: 190.7096ms, mutex: 218.3715ms
  4: R = 49999995000000, r = 49999995000000, spin: 193.1071ms, mutex: 250.1671ms
  5: R = 49999995000000, r = 49999995000000, spin: 188.5109ms, mutex: 224.2478ms
  6: R = 49999995000000, r = 49999995000000, spin: 189.241ms, mutex: 252.0153ms
  7: R = 49999995000000, r = 49999995000000, spin: 188.9498ms, mutex: 244.9358ms
  8: R = 49999995000000, r = 49999995000000, spin: 190.654ms, mutex: 219.5048ms
  9: R = 49999995000000, r = 49999995000000, spin: 194.7326ms, mutex: 209.2761ms
 10: R = 49999995000000, r = 49999995000000, spin: 188.9025ms, mutex: 251.199ms

可以看出如果操作次数进一步加大数量级,sleep反而会体现出优势,不仅强过Mutex,甚至能超过thread::yield_nowthread::park_timeout如果不添加唤醒其他线程的步骤也能达到类似的效果,这两种sleep/timeout的CPU消耗会明显比thread::yield_now更低。

总结:

关于测试,我测试机子是2.6GHz12线程win10,在不同配置和操作系统下,得出的结果不一定相同。比如最初的Relaxed加法出错的情况,我在官方的playground就无法复现出来,playground里怎么加结果都是对的,而yield_now在playground里性能比Mutex要快好3-4倍。

这一系列的实验,可以基本了解原子操作和自旋锁的一些功能特性。thread::yield_now在中小规模的数据处理上更有效,thread::sleep则在较大的数据上也能有性能优势,thread::park的线程暂停功能,则能更轻易的实现标准库里各种锁和通道,尽管性能想超越标准库还是不太容易。至于hint::spin_lock,感觉完全没有任何优势。

最后说明一下,这篇文章也仅仅是测试而已,至于为什么会有这些性能差异,还得需要诸位大佬们来解答。

评论区

写评论
作者 lithbitren 2023-01-09 18:59

补了一个纯thread::park的版本,这个已经称不上自旋锁了,就是执行完一次操作后启动相邻的下一个线程操作,线程切换百万次,性能可以说十分低下了,不过CPU也不太高。

use std::thread;
use std::sync::Arc;
use std::sync::Mutex;
use std::time::Instant;
 
const N_TIMES: usize = 100000;
const N_THREADS: usize = 10;
    
static mut R: usize = 0;
static mut THREADS: Vec<thread::Thread> = Vec::new();
 
 
fn main() {

    for t in 1..=10 {

        unsafe {
            R = 0;
        }
        
        let start_of_spin = Instant::now();

        let handles = (0..N_THREADS).map(|i| {
            thread::spawn(move || {
                unsafe {
                    for j in i * N_TIMES.. (i + 1) * N_TIMES {
                        thread::park();
                        R += j;
                        THREADS[(i + 1) % N_THREADS].unpark();
                    }
                }
            })
        }).collect::<Vec<_>>();

        unsafe {
            THREADS = handles.iter().map(|handle| handle.thread().clone()).collect::<Vec<_>>();
            THREADS[0].unpark();
        }
 
        for handle in handles {
            handle.join().unwrap();
        }

        let time_of_spin = start_of_spin.elapsed();

        let r = Arc::new(Mutex::new(0));
        
        let start_of_mutex = Instant::now();

        let handles = (0..N_THREADS).map(|i| {
            let r = r.clone();
            thread::spawn(move || {
                for j in i * N_TIMES.. (i + 1) * N_TIMES {
                    *r.lock().unwrap() += j;
                }
            })
        }).collect::<Vec<_>>();
 
        for handle in handles {
            handle.join().unwrap();
        }

        let time_of_mutex = start_of_mutex.elapsed();
 
        println!("{t:3}: R = {}, r = {}, spin: {time_of_spin:?}, mutex: {time_of_mutex:?}", unsafe { R }, r.lock().unwrap());
    }
}
  1: R = 499999500000, r = 499999500000, spin: 5.9935465s, mutex: 23.3459ms
  2: R = 499999500000, r = 499999500000, spin: 5.9843572s, mutex: 22.7821ms
  3: R = 499999500000, r = 499999500000, spin: 5.9588765s, mutex: 24.255ms
  4: R = 499999500000, r = 499999500000, spin: 5.874003s, mutex: 22.4214ms
  5: R = 499999500000, r = 499999500000, spin: 5.8963121s, mutex: 23.077ms
  6: R = 499999500000, r = 499999500000, spin: 5.9309485s, mutex: 26.7566ms
  7: R = 499999500000, r = 499999500000, spin: 5.9089393s, mutex: 26.3796ms
  8: R = 499999500000, r = 499999500000, spin: 5.8640329s, mutex: 25.303ms
  9: R = 499999500000, r = 499999500000, spin: 5.8959607s, mutex: 25.8509ms
 10: R = 499999500000, r = 499999500000, spin: 5.9625304s, mutex: 24.8613ms
作者 lithbitren 2023-01-09 17:33

R只是模拟一般数据,还可以改成Vec,HashMap,用原子变量自加的话肯定最快无疑。我还测了vec.push和hashmap.insert的,相同数量级的操作下vec版yield_now仍然比mutex快,但hashmap就比mutex慢。

--
👇
HC97: 我的理解是,Ordering是对单个线程内指令执行顺序的约束,其中Relaxed内存序只保证对自身读写的相对顺序,所以第一个例子里对R的读写可能会被重排序到SPIN_LOCK锁区域之外。 感觉这里最快的方法应该是不用锁,把R定义为原子变量,用Relaxed内存序调用fetch_add方法。但是在测试中出现了两种情况:笔记本(x86)上Debug模式fetch_add快,release模式mutex快;手机(arm)上两种模式都是fetch_add快。这个东西和平台相关性比较强,具体原因估计得看源代码是怎么实现的。

HC97 2023-01-09 13:26

我的理解是,Ordering是对单个线程内指令执行顺序的约束,其中Relaxed内存序只保证对自身读写的相对顺序,所以第一个例子里对R的读写可能会被重排序到SPIN_LOCK锁区域之外。 感觉这里最快的方法应该是不用锁,把R定义为原子变量,用Relaxed内存序调用fetch_add方法。但是在测试中出现了两种情况:笔记本(x86)上Debug模式fetch_add快,release模式mutex快;手机(arm)上两种模式都是fetch_add快。这个东西和平台相关性比较强,具体原因估计得看源代码是怎么实现的。

‘static 2023-01-08 20:27

6

1 共 4 条评论, 1 页