bug复查流程

这次碰到的bug比较隐蔽，如果没有做好git管理的话，在历史记录上二分查找bug，那么这个bug估计很难查出来。原因是实现逻辑是正确的，但是因为各个线程使用的种子是完全相同的，导致最终结果出现“共振”现象，导致结果出错。排查流程就是在git上面做二分查找，最后找到出错的提交记录。

bug的最小示例

下面是用于说明 bug 的最小示例。

#include <bits/stdc++.h>
#include <omp.h>
using namespace std;
class Rng {
  static thread_local int seed;
public:
  static void init(int tid){
    seed = tid * 1000 + 12345; // Example initialization
  }
  static int get_seed()  {
    return seed;
  }
};
thread_local int Rng::seed = 0; // Initialize the thread-local variable
int main(){
    int threads = 4;
    omp_set_num_threads(threads);
    // 第一个并行块
    #pragma omp parallel num_threads(threads)
    {
      #pragma omp critical
      cout << "Thread " << omp_get_thread_num() 
           << " no initialized with seed: "
           << Rng::get_seed() << endl;
      Rng::init(omp_get_thread_num());
      #pragma omp critical
      cout << "Thread " << omp_get_thread_num() 
           << " initialized with seed: "
           << Rng::get_seed() << endl;
    }
    // 第二个并行块
    #pragma omp parallel num_threads(threads)
    {
      #pragma omp critical
      cout << "Thread " << omp_get_thread_num() 
           << " has seed: "
           << Rng::get_seed() << endl;
    }
    // 第三个并行块：带有嵌套并行块的并行块
    #pragma omp parallel num_threads(1)
    {
      #pragma omp parallel num_threads(threads)
      {
        #pragma omp critical
        cout << "Thread " << omp_get_thread_num() 
             << " in nested parallel region has seed: "
             << Rng::get_seed() << endl;
      }
    }
}

使用的是g++实现的openmp。现在请推测：最后有几条输出seed的结果是0？

编译器：g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
g++ test.cpp -fopenmp -o test

控制台输出

其中一次的输出结果：

Thread 0 no initialized with seed: 0
Thread 0 initialized with seed: 12345
Thread 2 no initialized with seed: 0
Thread 2 initialized with seed: 14345
Thread 3 no initialized with seed: 0
Thread 3 initialized with seed: 15345
Thread 1 no initialized with seed: 0
Thread 1 initialized with seed: 13345
Thread 0 has seed: 12345
Thread 3 has seed: 15345
Thread 2 has seed: 14345
Thread 1 has seed: 13345
Thread 0 in nested parallel region has seed: 12345
Thread 1 in nested parallel region has seed: 0
Thread 2 in nested parallel region has seed: 0
Thread 3 in nested parallel region has seed: 0

答案是七个：包括最开始的 4 个，以及嵌套并行块的 3 个。

下面首先介绍相关的知识点，为最后的问题分析做铺垫。如果想先看分析，可以跳转到这部分。如果看不懂，再看下面两节。

thread_local

在c++11标准里面引入了thread_local的关键字。其生命周期和它的宿主线程的生命周期相绑定。一个需要额外注意的点是，thread_local限定的变量，只会初始化一次，也就是说，对于线程而言，其更像是一个静态变量。好，那么对于static thread_local定义的变量呢？其生命周期是怎样的？我们先来看一下cppreference的这一段话：

Storage duration

Every object has a property called storage duration, which limits the object lifetime. There are four kinds of storage duration in C:
1.automatic storage duration. The storage is allocated when the block in which the object was declared is entered and deallocated when it is exited by any means (goto, return, reaching the end). One exception is the VLAs; their storage is allocated when the declaration is executed, not on block entry, and deallocated when the declaration goes out of scope, not when the block is exited(since C99). If the block is entered recursively, a new allocation is performed for every recursion level. All function parameters and non-static block-scope objects have this storage duration, as well as compound literals used at block scope(until C23)
2.static storage duration. The storage duration is the entire execution of the program, and the value stored in the object is initialized only once, prior to main function. All objects declared static and all objects with either internal or external linkage that aren't declared _Thread_local(until C23) / thread_local(since C11) have this storage duration.
3.thread storage duration. The storage duration is the entire execution of the thread in which it was created, and the value stored in the object is initialized when the thread is started. Each thread has its own, distinct, object. If the thread that executes the expression that accesses this object is not the thread that executed its initialization, the behavior is implementation-defined. All objects declared _Thread_local(until C23)thread_local(since C23) have this storage duration.(since C11)
4.allocated storage duration. The storage is allocated and deallocated on request, using dynamic memory allocation functions.

从对static storage duration的描述可以推知：static thread_local在生命周期的管理上等同于thread_local。如果还是不信，stackoverflow上面也有人问过同样的问题。

libgomp的嵌套并行

libgomp是gcc对openmp的一个实现。具体细节可以参考这里.现在需要重点介绍的是libgomp的线程重用机制：其对于最外层的并行区域可以实现线程重用，以减少线程创建和销毁的开销；但对于内层的并行区域，则是动态地进行线程的创建和销毁。具体的，我们可以查看源码:6441eb6:team.c:gomp_thread_start.

小练习

使用g++ -fopenmp编译这段程序，输出结果是什么？可以尝试自行编译检查结果。

#include <bits/stdc++.h>
#include <omp.h>
using namespace std;
void foo(){
  thread_local int x = 0;
  x++;
  #pragma omp critical
  cout << "Thread " << omp_get_thread_num() 
       << " has x = " << x << endl;
}
int main(){
  #pragma omp parallel num_threads(1)
  {
    foo();
  }
  #pragma omp parallel num_threads(1)
  {
    foo();
  }
}

为什么会得到这样的输出

TODO

如果我需要嵌套并行也能实现线程重用，怎么办？

采用intel提供的libiomp实现。

其他

关于c++的生命周期

TODO

关于嵌套并行

其实关于嵌套并行的问题相对来说还是很棘手，因为很多第三方库都可能使用各种各样的线程并行库。如果在你的并行块里面需要调用这些库，那么就会出现套娃的情况，很有可能会创建特别多的线程数，从而影响性能。这里提供一篇有意思的文献：

总结

1.项目管理：这次是 git 立了大功，能够在历史记录上进行二分查找。虽然但是，应该有更好的做法：比如在提交前进行正确性测试，测试正确再进行提交。并且测试应该是可以实现自动化的，这个需要回头了解。

2.对于静态变量的生命周期，需要仔细考虑；尤其是碰上static thread_local这种组合的时候，需要更加谨慎的处理。

3.对于g++实现的openmp，其在外层并行块实现了线程重用，但是在内层并行块并没有实现线程重用。更加需要注意的是，因为内层并行块没有实现线程重用，因此内层的线程创建和销毁都有一定开销，在编写高性能计算程序的时候尤其要注意。测试结果是，这个开销大概在1~10ms这个量级。如果需要内层并行块的线程重用，可以使用intel的openmp，应该在intel的oneAPI套件里面，本人并没有调查。编译选项是-kopenmp.

【bug沉思录】1 - 当 thread_local 碰上 libgomp 的嵌套并行