`perf` listing#

It is pretty painful to guess what is being called, and stepping through via gdb is a bit timestaking. The simplest approach is to use perf.

To ensure the calls are traced, we will need to temporarily disable some of kernel restrictions.

sudo sh -c 'echo 0 >/proc/sys/kernel/kptr_restrict'
sudo sh -c 'echo 0 >/proc/sys/kernel/perf_event_paranoid'
sudo sh -c 'echo 0 >/proc/sys/kernel/yama/ptrace_scope'

Which we can then use with the standard CPython instructions.

`make` output#

For make (or system numpy) the output should be something like:

➜ perf record -F 9999 -g -o perf.data \
    python -c "import numpy as np; np.random.seed(128); C_float = np.random.rand(4,4).astype(np.float32); np.dot(C_float, C_float.T)"
➜ perf report -n -g | grep openblas | sed 's/---//'

93%     0.00%             0  python   libopenblas.so.0.3    [.] gotoblas_init
93%     0.00%             0  python   libopenblas.so.0.3    [.] blas_thread_init
93%     0.00%             0  python   libopenblas.so.0.3    [.] 0x000071c858f34167
93%     0.00%             0  python   libopenblas.so.0.3    [.] blas_memory_alloc
93%     0.00%             0  python   libopenblas.so.0.3    [.] 0x000071c858f320f7
53%     0.00%             0  python   libopenblas.so.0.3                                 [.] ssyrk_LT
35%     0.00%             0  python   libopenblas.so.0.3                                 [.] cblas_ssyrk
35%     0.00%             0  python   libopenblas.so.0.3                                 [.] sgemm_incopy_HASWELL

Where the columns are basically:

the overall percentage of samples,
the time taken within each function not including child calls,
the number of samples collected directly,
the executable,
the library or location of the definition, and,
finally the symbol iteself, with the spaces showing where in the call stack it is supposed to be.

Perhaps more illuminating are the calls to OpenBLAS itself, while also filtering out the noise to leave only the overall percentage and the symbol itself:

➜ perf report --stdio -n -g | grep openblas | awk '{printf "%-10s %s\n", $1, $7}'
93%      gotoblas_init
93%      blas_thread_init
93%      0x000071c858f34167
93%      blas_memory_alloc
93%      0x000071c858f320f7
53%      ssyrk_LT
35%      cblas_ssyrk
35%      sgemm_incopy_HASWELL

This makes it pretty easy to figure out what is called and when:

➜ perf record -F 9999 -g -o perf.data python -c "import numpy as np; np.random.seed(128); C_float = np.random.rand(4,4).astype(np.float32); np.dot(C_float, C_float)"
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.132 MB perf.data (714 samples) ]

Which will then correspond to sgemm_nn:

➜ perf report --stdio -n -g | grep openblas | awk '{printf "%-10s %s\n", $1, $7}'
38%      cblas_sgemm
38%      sgemm_nn
38%      sgemm_itcopy_HASWELL
19%      gotoblas_init
19%      blas_thread_init
19%      0x000072c4f4134167
19%      blas_memory_alloc
19%      gotoblas_quit
19%      blas_shutdown

Remember to wipe the older perf.data and perf.data.old files between new calls to record, i.e. one call to record can be coupled with any number of report calls, but wipe the files between changes to the functions being called.

perf listing#

make output#

`perf` listing#

`make` output#