parallel algorithmが使えるようになったと聞いてubuntu18.04(WSL)にgcc9.2.0とIntelTBBを導入

parallel algorithmが実装された！

GCC 9 Release Series — Changes, New Features, and Fixes - GNU Project - Free Software Foundation (FSF)

より

Runtime Library (libstdc++)
Improved support for C++17, including:
The C++17 implementation is no longer experimental.
Parallel algorithms and <execution> (requires Thread Building Blocks 2018 or newer).
<memory_resource>.
Using the types and functions in <filesystem> does not require linking with -lstdc++fs now.

並列アルゴリズムはこれまでintel コンパイラでしか使えなかったが、gccで使えるようになった！

結論

f:id:yuji-dis:20190826012938p:plain

ソートは、vectorが程度大きければ3倍程度に速くなる。(PCのコアは4つ)

gcc9.2.0のビルド

makeにすごく時間かかるので、時間があるときに行うこと。

$ wget http://ftp.tsukuba.wide.ad.jp/software/gcc/releases/gcc-9.2.0/gcc-9.2.0.tar.gz
$ tar xvzf gcc-9.2.0.tar.gz
$ cd gcc-9.2.0
$ ./contrib/download_prerequisites
$ ./configure --enable-languages=c,c++ --prefix=/usr/local/lib/gcc-9.2.0 --disable-bootstrap --disable-multilib
$ make
$ make install

make installされたものを確認する。

$ /usr/local/lib/gcc-9.2.0/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/usr/local/lib/gcc-9.2.0/bin/gcc
COLLECT_LTO_WRAPPER=/usr/local/lib/gcc-9.2.0/libexec/gcc/x86_64-pc-linux-gnu/9.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --enable-languages=c,c++ --prefix=/usr/local/lib/gcc-9.2.0 --disable-bootstrap --disable-multilib
Thread model: posix
gcc version 9.2.0 (GCC)


$ /usr/local/lib/gcc-9.2.0/bin/g++ -v

Using built-in specs.
COLLECT_GCC=/usr/local/lib/gcc-9.2.0/bin/g++
COLLECT_LTO_WRAPPER=/usr/local/lib/gcc-9.2.0/libexec/gcc/x86_64-pc-linux-gnu/9.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --enable-languages=c,c++ --prefix=/usr/local/lib/gcc-9.2.0 --disable-bootstrap --disable-multilib
Thread model: posix
gcc version 9.2.0 (GCC)

自分の場合は~/bin/を作って後入れのものはそこにシンボリックリンクを張るようにしている。 .bashrcで

export PATH=~/bin:$PATH

とかしてあればおｋ

$ ln -s /usr/local/lib/gcc-9.2.0/bin/g++ ~/bin/g++   
$ ln -s /usr/local/lib/gcc-9.2.0/bin/gcc ~/bin/gcc

リンクが生きているか確認。

~$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/local/lib/gcc-9.2.0/libexec/gcc/x86_64-pc-linux-gnu/9.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --enable-languages=c,c++ --prefix=/usr/local/lib/gcc-9.2.0 --disable-bootstrap --disable-multilib
Thread model: posix
gcc version 9.2.0 (GCC)
~$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/lib/gcc-9.2.0/libexec/gcc/x86_64-pc-linux-gnu/9.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --enable-languages=c,c++ --prefix=/usr/local/lib/gcc-9.2.0 --disable-bootstrap --disable-multilib
Thread model: posix
gcc version 9.2.0 (GCC)

libの入れかえ。

$ ls -l /usr/local/lib/gcc-9.2.0/lib64/libstd*
-rw-r--r-- 1 root root 58776946 Aug 25 22:53 /usr/local/lib/gcc-9.2.0/lib64/libstdc++.a
-rwxr-xr-x 1 root root      979 Aug 25 22:53 /usr/local/lib/gcc-9.2.0/lib64/libstdc++.la
lrwxrwxrwx 1 root root       19 Aug 25 22:53 /usr/local/lib/gcc-9.2.0/lib64/libstdc++.so -> libstdc++.so.6.0.27
lrwxrwxrwx 1 root root       19 Aug 25 22:53 /usr/local/lib/gcc-9.2.0/lib64/libstdc++.so.6 -> libstdc++.so.6.0.27
-rwxr-xr-x 1 root root 22553280 Aug 25 22:53 /usr/local/lib/gcc-9.2.0/lib64/libstdc++.so.6.0.27
-rw-r--r-- 1 root root     2425 Aug 25 22:53 /usr/local/lib/gcc-9.2.0/lib64/libstdc++.so.6.0.27-gdb.py
-rw-r--r-- 1 root root 18290658 Aug 25 22:53 /usr/local/lib/gcc-9.2.0/lib64/libstdc++fs.a
-rwxr-xr-x 1 root root      919 Aug 25 22:53 /usr/local/lib/gcc-9.2.0/lib64/libstdc++fs.la

$ cd /usr/lib32/
## ここlib32なのはWSL上のUbuntuだからだと思うので、実環境に合わせて調整が必要
## 以下必要に応じてsudo
/usr/lib32$ sudo mv libstdc++.so.6 libstdc++.so.6.bak


/usr/lib32$ sudo ln -s libstdc++.so.6.0.27 libstdc++.so.6
/usr/lib32$ sudo cp /usr/local/lib/gcc-9.2.0/lib64/libstdc++.so.6.0.27 ./
/usr/lib32$ sudo ln -s libstdc++.so.6.0.27 libstdc++.so.6
/usr/lib32$ ls -l
total 24064
drwxr-xr-x 1 root root      512 Jun  2  2018 gconv
-rw-r--r-- 1 root root   116256 Apr 17  2018 libgcc_s.so.1
lrwxrwxrwx 1 root root       19 Aug 25 23:24 libstdc++.so.6 -> libstdc++.so.6.0.27
-rw-r--r-- 1 root root  1582060 Apr 17  2018 libstdc++.so.6.0.25
-rwxr-xr-x 1 root root 22553280 Aug 25 23:24 libstdc++.so.6.0.27
lrwxrwxrwx 1 root root       19 Apr 17  2018 libstdc++.so.6.bak -> libstdc++.so.6.0.25

intel TBBの導入

tbbのビルド

$ cd temp
$ git clone https://github.com/intel/tbb
$ cd tbb
$ make
$ cd build/linux_intel64_gcc_cc7.4.0_libc2.27_kernel4.4.0_release/
$ ls
arena.d                    frontend.d               proxy.d               task_v2.d
arena.o                    frontend.o               proxy.o               task_v2.o
backend.d                  governor.d               queuing_mutex.d       tbb.def
backend.o                  governor.o               queuing_mutex.o       tbb_function_replacement.d
backref.d                  itt_notify.d             queuing_rw_mutex.d    tbb_function_replacement.o
backref.o                  itt_notify.o             queuing_rw_mutex.o    tbb_main.d
cache_aligned_allocator.d  itt_notify_malloc.d      reader_writer_lock.d  tbb_main.o
cache_aligned_allocator.o  itt_notify_malloc.o      reader_writer_lock.o  tbb_misc.d
concurrent_hash_map.d      large_objects.d          recursive_mutex.d     tbb_misc.o
concurrent_hash_map.o      large_objects.o          recursive_mutex.o     tbb_misc_ex.d
concurrent_monitor.d       libtbb.so                rml_tbb.d             tbb_misc_ex.o
concurrent_monitor.o       libtbb.so.2              rml_tbb.o             tbb_statistics.d
concurrent_queue.d         libtbbmalloc.so          scheduler.d           tbb_statistics.o
concurrent_queue.o         libtbbmalloc.so.2        scheduler.o           tbb_thread.d
concurrent_queue_v2.d      libtbbmalloc_proxy.so    semaphore.d           tbb_thread.o
concurrent_queue_v2.o      libtbbmalloc_proxy.so.2  semaphore.o           tbbmalloc.d
concurrent_vector.d        market.d                 spin_mutex.d          tbbmalloc.def
concurrent_vector.o        market.o                 spin_mutex.o          tbbmalloc.o
concurrent_vector_v2.d     mutex.d                  spin_rw_mutex.d       tbbmallocproxy.def
concurrent_vector_v2.o     mutex.o                  spin_rw_mutex.o       tbbvars.csh
condition_variable.d       observer_proxy.d         spin_rw_mutex_v2.d    tbbvars.sh
condition_variable.o       observer_proxy.o         spin_rw_mutex_v2.o    version_string.ver
critical_section.d         pipeline.d               task.d                x86_rtm_rw_mutex.d
critical_section.o         pipeline.o               task.o                x86_rtm_rw_mutex.o
dynamic_link.d             private_server.d         task_group_context.d
dynamic_link.o             private_server.o         task_group_context.o

tbbのインスト―ル

$ cd /usr/lib/x86_64-linux-gnu
$ ls -l |grep tbb
lrwxrwxrwx 1 root root       11 Oct  4  2017 libtbb.so -> libtbb.so.2
-rw-r--r-- 1 root root   235056 Oct  4  2017 libtbb.so.2
lrwxrwxrwx 1 root root       17 Oct  4  2017 libtbbmalloc.so -> libtbbmalloc.so.2
-rw-r--r-- 1 root root   112120 Oct  4  2017 libtbbmalloc.so.2
lrwxrwxrwx 1 root root       23 Oct  4  2017 libtbbmalloc_proxy.so -> libtbbmalloc_proxy.so.2
-rw-r--r-- 1 root root    10368 Oct  4  2017 libtbbmalloc_proxy.so.2

$ sudo mv libtbb.so.2 libtbb.so.2.bak
$ sudo mv libtbbmalloc.so.2 libtbbmalloc.so.2.bak
$ sudo mv libtbbmalloc_proxy.so.2 libtbbmalloc_proxy.so.2.bak

$ sudo cp ~/temp/tbb/build/linux_intel64_gcc_cc7.4.0_libc2.27_kernel4.4.0_release/libtbb.so.2 ./
$ sudo cp ~/temp/tbb/build/linux_intel64_gcc_cc7.4.0_libc2.27_kernel4.4.0_release/libtbbmalloc.so.2 ./
$ sudo cp ~/temp/tbb/build/linux_intel64_gcc_cc7.4.0_libc2.27_kernel4.4.0_release/libtbbmalloc_proxy.so.2 ./

$ ls -l |grep tbb
lrwxrwxrwx 1 root root       11 Oct  4  2017 libtbb.so -> libtbb.so.2
-rwxr-xr-x 1 root root  2765432 Aug 26 00:43 libtbb.so.2
-rw-r--r-- 1 root root   235056 Oct  4  2017 libtbb.so.2.bak
lrwxrwxrwx 1 root root       17 Oct  4  2017 libtbbmalloc.so -> libtbbmalloc.so.2
-rwxr-xr-x 1 root root   856888 Aug 26 00:43 libtbbmalloc.so.2
-rw-r--r-- 1 root root   112120 Oct  4  2017 libtbbmalloc.so.2.bak
lrwxrwxrwx 1 root root       23 Oct  4  2017 libtbbmalloc_proxy.so -> libtbbmalloc_proxy.so.2
-rwxr-xr-x 1 root root    56768 Aug 26 00:43 libtbbmalloc_proxy.so.2
-rw-r--r-- 1 root root    10368 Oct  4  2017 libtbbmalloc_proxy.so.2.bak

$ cd /usr/include/tbb
$ sudo mv tbb tbb_bak
$ sudo cp -r ~/temp/tbb/include/tbb ./

parallel algorithm入りプログラムのコンパイル＆run

とりあえずsortで見てみる。ノーマル、seq, par, par_unseqの5つで確認

#include <algorithm>
#include <deque>
#include <execution>
#include <iostream>
#include <list>
#include <map>
#include <random>
#include <unordered_map>
#include <vector>

#include "../timer/timer_class.hpp"

// 10^7
const uint64_t vector_num = 1000 * 1000 * 10;

std::random_device seed_gen;
auto engine = std::mt19937_64(seed_gen());
auto int_rand = std::uniform_int_distribution<>(0, vector_num);

int main() {
  timer t;
  for (uint64_t size = 100; size <= vector_num; size *= 10) {
    std::cout << "===" << size << "=== " << std::endl;
    std::vector<int> vec;
    std::vector<int> vec_for_seq;
    std::vector<int> vec_for_par;
    std::vector<int> vec_for_par_unseq;

    // 中身が同じ4つのvectorを作る
    // memcpyとかでもおｋ
    for (size_t s = 0; s < size; ++s) {
      int r = int_rand(engine);
      vec.push_back(r);
      vec_for_seq.push_back(r);
      vec_for_par.push_back(r);
      vec_for_par_unseq.push_back(r);
    }

    t.restart();
    std::sort(vec.begin(), vec.end());

    t.print("vector_sort");
    t.restart();

    std::sort(std::execution::seq, vec_for_seq.begin(), vec_for_seq.end());

    t.print("vector_seq_sort");
    t.restart();

    std::sort(std::execution::par, vec_for_par.begin(), vec_for_par.end());

    t.print("vector_par_sort");
    t.restart();

    std::sort(std::execution::par_unseq, vec_for_par_unseq.begin(),
              vec_for_par_unseq.end());

    t.print("vector_par_unseq_sort");
  }
  return 0;
}

コンパイルのコマンド

$ g++ -std=c++17 -ltbb -O3 parallel_sort.cpp

スペック

f:id:yuji-dis:20190826013403p:plain

ソートの計測時間　単位はns

vectorのsize	100	1000	10000	100000	1000000	10000000
vector_sort	3800	37600	472000	5823300	64813800	758304100
vector_seq_sort	3900	37200	463000	6000900	63663000	743859300
vector_par_sort	1398300	262200	252400	1904600	20200300	245639800
vector_par_unseq_sort	12700	30800	222400	1701900	19388900	239521300

f:id:yuji-dis:20190826012933p:plain

f:id:yuji-dis:20190826012938p:plain

サイズが小さいときのparがクソでかなので無視するとして…。ある程度vectorのサイズが大きければ3分の1程度になる。

par_unseqの効果がほぼ無いので、sortでは意味ないかも。別アルゴリズムで確認したい。

という内容を技術書典で出す本に書きたい。