perf
listing#
It is pretty painful to guess what is being called, and stepping through via
gdb
is a bit timestaking. The simplest approach is to use perf
.
To ensure the calls are traced, we will need to temporarily disable some of kernel restrictions.
sudo sh -c 'echo 0 >/proc/sys/kernel/kptr_restrict'
sudo sh -c 'echo 0 >/proc/sys/kernel/perf_event_paranoid'
sudo sh -c 'echo 0 >/proc/sys/kernel/yama/ptrace_scope'
Which we can then use with the standard CPython instructions.
make
output#
For make
(or system numpy
) the output should be something like:
➜ perf record -F 9999 -g -o perf.data \
python -c "import numpy as np; np.random.seed(128); C_float = np.random.rand(4,4).astype(np.float32); np.dot(C_float, C_float.T)"
➜ perf report -n -g | grep openblas | sed 's/---//'
2.93% 0.00% 0 python libopenblas.so.0.3 [.] gotoblas_init
2.93% 0.00% 0 python libopenblas.so.0.3 [.] blas_thread_init
2.93% 0.00% 0 python libopenblas.so.0.3 [.] 0x000071c858f34167
2.93% 0.00% 0 python libopenblas.so.0.3 [.] blas_memory_alloc
2.93% 0.00% 0 python libopenblas.so.0.3 [.] 0x000071c858f320f7
0.53% 0.00% 0 python libopenblas.so.0.3 [.] ssyrk_LT
0.35% 0.00% 0 python libopenblas.so.0.3 [.] cblas_ssyrk
0.35% 0.00% 0 python libopenblas.so.0.3 [.] sgemm_incopy_HASWELL
Where the columns are basically:
the overall percentage of samples,
the time taken within each function not including child calls,
the number of samples collected directly,
the executable,
the library or location of the definition, and,
finally the symbol iteself, with the spaces showing where in the call stack it is supposed to be.
Perhaps more illuminating are the calls to OpenBLAS itself, while also filtering out the noise to leave only the overall percentage and the symbol itself:
➜ perf report --stdio -n -g | grep openblas | awk '{printf "%-10s %s\n", $1, $7}'
2.93% gotoblas_init
2.93% blas_thread_init
2.93% 0x000071c858f34167
2.93% blas_memory_alloc
2.93% 0x000071c858f320f7
0.53% ssyrk_LT
0.35% cblas_ssyrk
0.35% sgemm_incopy_HASWELL
This makes it pretty easy to figure out what is called and when:
➜ perf record -F 9999 -g -o perf.data python -c "import numpy as np; np.random.seed(128); C_float = np.random.rand(4,4).astype(np.float32); np.dot(C_float, C_float)"
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.132 MB perf.data (714 samples) ]
Which will then correspond to sgemm_nn
:
➜ perf report --stdio -n -g | grep openblas | awk '{printf "%-10s %s\n", $1, $7}'
0.38% cblas_sgemm
0.38% sgemm_nn
0.38% sgemm_itcopy_HASWELL
0.19% gotoblas_init
0.19% blas_thread_init
0.19% 0x000072c4f4134167
0.19% blas_memory_alloc
0.19% gotoblas_quit
0.19% blas_shutdown