systemtap是一个基于linux的性能诊断工具,能够对linux内核函数和linux应用的运行细节进行诊断的工具,最近遇到一个函数调用延迟比较大的问题,针对此问题,如果从庞大的代码中按照行分析,可能花费时间成本较大。我建议可以使用systemtap来定位进程的任意代码的运行耗时。这样就很轻松的定位性能问题了。下面就来介绍如何在arm64下使用systemtap定位程序的函数执行性能问题
apt install systemtap-sdt-dev libdw-dev
这里注意,默认系统的systemtap版本过旧,我们需要从systemtap官网下载,当前最新的是systemtap-5.3。
git clone git://sourceware.org/git/systemtap.git
如果sourceware速度慢,可以换清华源镜像站
wget https://mirrors.tuna.tsinghua.edu.cn/sourceware/systemtap/releases/systemtap-5.3.tar.gz
默认最新的代码是基于6.15-rc的内核,我们实际上内核使用的5.10。所以有一些代码需要稍微适配一下。
arm64内核从5之后都开启了vfs的namespace,如果不做额外修改,在arm64上,如下两个函数会报命名空间问题
kernel_read filp_open
报错信息如下
ERROR: modpost: module stap_ce5dbcb79b603543094c4f68fb16a1a8_943 uses symbol kernel_read from namespace VFS_internal_I_am_really_a_filesystem_and_am_NOT_a_driver, but does not import it. ERROR: modpost: module stap_ce5dbcb79b603543094c4f68fb16a1a8_943 uses symbol filp_open from namespace VFS_internal_I_am_really_a_filesystem_and_am_NOT_a_driver, but does not import it.
对于此问题,我们需要为systemtap在runtime的代码上声明一下命名空间,如下
# vim runtime/transport/symbols.c MODULE_IMPORT_NS(VFS_internal_I_am_really_a_filesystem_and_am_NOT_a_driver);
对于6的内核默认实现timer_delete_sync函数,但是我们还是在5.10的内核,我们使用的是del_timer_sync函数,所以需要针对就内核修改systemtap的代码,位置在runtime/transport/relay_compat.h。
未修改代码如下
#ifdef STAPCONF_DEL_TIMER_SYNC #define STP_TIMER_DELETE_SYNC(a) del_timer_sync(a) #else #define STP_TIMER_DELETE_SYNC(a) timer_delete_sync(a) #endif
这里宏STAPCONF_DEL_TIMER_SYNC会决定具体的函数实现,我们为了代码修改最小化,如下修改
#ifndef STAPCONF_DEL_TIMER_SYNC #define STP_TIMER_DELETE_SYNC(a) del_timer_sync(a) #else #define STP_TIMER_DELETE_SYNC(a) timer_delete_sync(a) #endif
这里简单的修改了宏定义的作用范围
编译方法之前提过,如下
./configure make all -j8 make install -j8
systemtap代码量不算很大,可以之间在机器里面编译。这样make install就直接安装成功了
如果安装成功,那么我们可以看到如下信息
# stap --version Systemtap translator/driver (version 5.3/0.176, non-git sources) Copyright (C) 2005-2025 Red Hat, Inc. and others This is free software; see the source for copying conditions. tested kernel versions: 3.10 ... 6.15-rc enabled features: BPF LIBSQLITE3 LIBXML2 NLS JSON_C
我们需要使用systemtap,那么内核需要打开调试功能,整理如下
CONFIG_DEBUG_INFO=y CONFIG_KPROBES=y CONFIG_UPROBES=y CONFIG_RELAY=y CONFIG_DEBUG_FS=y CONFIG_MODULES=y CONFIG_TRACEPOINTS=y CONFIG_FUNCTION_TRACER=y
systemtap的原理是通过在系统中安插一个ko,通过此ko获取系统的详细信息,所以我们需要在内核中预置头文件。
手动预置的办法如下
make headers_install INSTALL_HDR_PATH=/tmp/kernel-header/ make firmware_install INSTALL_MOD_PATH=/tmp/kernel-header/ make modules_install INSTALL_MOD_PATH=/tmp/kernel-header/ cp --parents `find -type f -name "Makefile*" -o -name "Kconfig*"` /tmp/kernel-header/ cp Module.symvers /tmp/kernel-header/ cp System.map /tmp/kernel-header/ cp -rf scripts/ /tmp/kernel-header/ # arm bin cp -rf include/ /tmp/kernel-header/ cp -rf --parents arch/arm64/include /tmp/kernel-header cp -rf --parents arch/arm/include /tmp/kernel-header cp .config /tmp/kernel-header/ tar cvzf /tmp/kernel-header.tar.gz /tmp/kernel-header
此时我们将/tmp/kernel-header.tar.gz解压为目录/usr/src/linux-headers-$(uname -r)
然后我们建立头文件链接如下
mkdir /lib/modules/${uname -r}/ ln -sf /usr/src/linux-headers-$(uname -r) build
此时,我们可以在机器中编译ko文件了
如果觉得上述手动预置不方便,那么可以自己从内核打包headers的deb,如下
make bindeb-pkg -j256
注意,上面是在已经构建过内核的情况下,这里只打包。如果没构建过内核,建议从头开始
make deb-pkg -j256
此时我们获得如下文件安装
dpkg -i linux-headers-5.10.198_5.10.198-69_arm64.deb linux-image-5.10.198_5.10.198-69_arm64.deb linux-image-5.10.198-dbg_5.10.198-69_arm64.deb linux-libc-dev_5.10.198-69_arm64.deb
为了能够获取应用程序的符号用于调试,我们需要安装对应应用程序的符号包,如下
# dpkg -i kylin-nm-dbgsym_3.20.1.7_arm64.ddeb
此时我们stap可以获取函数的符号如下
# stap -l 'process("/usr/bin/kylin-nm").function("*")' process("/usr/bin/kylin-nm").function("onShowControlCenter@frontend/tab-pages/lanpage.cpp:723") process("/usr/bin/kylin-nm").function("onSwithGsettingsChanged@frontend/tab-pages/lanpage.cpp:173") process("/usr/bin/kylin-nm").function("onUpdateConnection@frontend/tab-pages/lanpage.cpp:1150") process("/usr/bin/kylin-nm").function("onWiredEnabledChanged@frontend/tab-pages/lanpage.cpp:1233") ......
至此,我们可以开始调试了。
内核头文件完成之后,我们可以编写stap文件来进行调试。对于当前的需求是
所以代码如下:
# cat kylin.stp global entry_times probe process("/usr/bin/kylin-nm").function("LanPage::onWiredEnabledChanged") { entry_time = gettimeofday_us() entry_times[pid()] = entry_time } probe process("/usr/bin/kylin-nm").function("LanPage::onWiredEnabledChanged").return { if (entry_times[pid()] != 0) { exit_time = gettimeofday_us() elapsed = exit_time - entry_times[pid()] printf("[PID %d] [%s]: Took %ld us \n", pid(), "LanPage::onWiredEnabledChanged", elapsed) delete entry_times[pid()] } }
根据上面内容我们可以知道,这里我想定位/usr/bin/kylin-nm的LanPage::onWiredEnabledChanged函数的耗时。
我们如下方式运行
# stap -v ./kylin.stp Pass 1: parsed user script and 467 library scripts using 590948virt/103640res/5892shr/130664data kb, in 530usr/200sys/247real ms. Pass 2: analyzed script: 2 probes, 3 functions, 1 embed, 1 global using 598732virt/112916res/7120shr/138448data kb, in 330usr/10sys/339real ms. Pass 3: translated to C into "/tmp/stapbjEzxz/stap_44d2d295e641497a8b3e84b98ae25516_2499_src.c" using 598732virt/113108res/7312shr/138448data kb, in 10usr/240sys/252real ms. Pass 4: compiled C into "stap_44d2d295e641497a8b3e84b98ae25516_2499.ko" in 43320usr/8190sys/12830real ms. Pass 5: starting run.
可以看到有[PID 89425] [LanPage::onWiredEnabledChanged]: Took 174 us 的日志,这里可以看到按钮的响应时间是174us。
我们再点击关闭网络,如下

再点击打开网络,如下

反复10次,此时日志如下
[PID 89425] [LanPage::onWiredEnabledChanged]: Took 279 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 502 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 632 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 617 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 616 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 656 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 146 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 280 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 640 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 576 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 501 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 642 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 647 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 670 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 289 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 603 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 619 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 453 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 594 us [PID 89425] [LanPage::onWiredEnabledChanged]: Took 276 us
至此,我们可以监控任意的函数的执行时间。
如果需要定位其他的程序,就打上其他程序符号,找到需要定位的函数,修改kylin.stp即可
本文演示了在arm64上使用systemtap定位任意函数的耗时问题。systemtap还可以定位内核和其他问题。这里就不额外解释了。有兴趣可以自己研究,参与开源。
值得注意的是,如果systemtap在你的内核环境上运行不起来,你可能需要根据其分析兼容性问题。本文为了让systemtap-5.3在linux 5.10上运行,修改了两处代码的兼容问题。实际需要修复的内容通常是版本演变带来的问题,不会太困难。
asan提供了定位全局对象的构造顺序相关的方法,本文详细了解一下关于c/c++全局变量的构造顺序带来的问题
为了实施这个bug,我们需要两个cpp文件,其代码如下。
# cat initialization_order_fiasco_1.cpp #include <stdio.h> extern int extern_global; int __attribute__((noinline)) read_extern_global() { return extern_global; } int x = read_extern_global() + 1; int main() { printf("%d\n", x); return 0; } # cat initialization_order_fiasco_2.cpp int foo() { return 2; } int extern_global = foo();
根据上面的代码,我们知道有两个全局变量extern_global和x
此时我们通过修改编译顺序来复现问题
g++ -g initialization_order_fiasco_1.cpp initialization_order_fiasco_2.cpp -o initialization_order_fiasco_1_2 g++ -g initialization_order_fiasco_2.cpp initialization_order_fiasco_1.cpp -o initialization_order_fiasco_2_1
此时运行initialization_order_fiasco_1_2,如下
# ./initialization_order_fiasco_1_2 1
如果运行initialization_order_fiasco_2_1,如下
# ./initialization_order_fiasco_2_1 3
可以发现,因为我们g++传入文件的顺序不一致,而两个cpp中关于全局变量extern_global存在依赖,所以就导致了问题产生,按照我们理想的想法,这里的值应该是3,但是有可能是1
我们将得出正常结论的编译方式加入asan来检测,如下
g++ -fsanitize=address -g initialization_order_fiasco_2.cpp initialization_order_fiasco_1.cpp -o asan_2_1
注意,这里先加入initialization_order_fiasco_2.cpp后加入initialization_order_fiasco_1.cpp,也就是说先声明extern_global,再使用extern_global和声明x。
此时运行asan检测,没有上报问题。
现在我们将编译顺序调换,先声明x和使用extern_global,再声明extern_global
g++ -fsanitize=address -g initialization_order_fiasco_1.cpp initialization_order_fiasco_2.cpp -o asan_1_2
此时运行程序
# LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libasan.so.5.0.0 ASAN_OPTIONS=check_initialization_order=true ./asan_1_2 ================================================================= ==133524==ERROR: AddressSanitizer: initialization-order-fiasco on address 0x0000004121e0 at pc 0x000000400948 bp 0x007fc3aff370 sp 0x007fc3aff390 READ of size 4 at 0x0000004121e0 thread T0 #0 0x400944 in read_extern_global() /root/asan/initialization_order/initialization_order_fiasco_1.cpp:4 #1 0x400a10 in __static_initialization_and_destruction_0 /root/asan/initialization_order/initialization_order_fiasco_1.cpp:6 #2 0x400a90 in _GLOBAL__sub_I__Z18read_extern_globalv /root/asan/initialization_order/initialization_order_fiasco_1.cpp:10 #3 0x400c54 in __libc_csu_init (/root/asan/initialization_order/asan_1_2+0x400c54) #4 0x7fa15efd34 in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x20d34) #5 0x400828 (/root/asan/initialization_order/asan_1_2+0x400828) 0x0000004121e0 is located 0 bytes inside of global variable 'extern_global' defined in 'initialization_order_fiasco_2.cpp:2:5' (0x4121e0) of size 4 registered at: #0 0x7fa17b7a10 (/usr/lib/aarch64-linux-gnu/libasan.so.5.0.0+0x3aa10) #1 0x400bec in _sub_I_00099_1 (/root/asan/initialization_order/asan_1_2+0x400bec) #2 0x400c54 in __libc_csu_init (/root/asan/initialization_order/asan_1_2+0x400c54) #3 0x7fa15efd34 in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x20d34) #4 0x400828 (/root/asan/initialization_order/asan_1_2+0x400828) SUMMARY: AddressSanitizer: initialization-order-fiasco /root/asan/initialization_order/initialization_order_fiasco_1.cpp:4 in read_extern_global() Shadow bytes around the buggy address: 0x0010000823e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0010000823f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x001000082400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x001000082410: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 0x001000082420: 00 00 f9 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 =>0x001000082430: 04 f9 f9 f9 f9 f9 f9 f9 00 00 00 00[f6]f6 f6 f6 0x001000082440: f6 f6 f6 f6 00 00 00 00 00 00 00 00 00 00 00 00 0x001000082450: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x001000082460: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x001000082470: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x001000082480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Shadow byte legend (one shadow byte represents 8 application bytes):
根据上面的信息,得出如下结论
根据上面的信息,我们返回代码补充一个信息
int extern_global = foo();默认情况下其值是2所以我们根据上面已有信息,我们就清晰的定位出来extern_global的初始化顺序存在问题。
补充一下,关于0x0000004121e0,因为是全局变量,所以我们在运行前就可以获取验证一下, 确定asan报错是正常的。
# objdump -d -j .bss ./asan_1_2 00000000004121e0 <extern_global>:
通过上面的内容,我们定位了extern_global的初始化顺序存在问题。其问题出现的原因在于g++的编译顺序。
我们先拆解g++的编译步骤,我们先编译.o文件
g++ -g -c initialization_order_fiasco_1.cpp -o 1.o g++ -g -c initialization_order_fiasco_2.cpp -o 2.o
此时针对1.o和2.o,我们读取其bss端的值如下
# objdump -d -j .bss 1.o 0000000000000000 <x>: 0: 00 00 00 00 # objdump -d -j .bss 2.o 0000000000000000 <extern_global>: 0: 00 00 00 00
可以看到,在编译阶段,全局变量加载地址还是0,它需要在链接阶段由ld填充实际地址。问题不出在编译阶段,那么接下来我们链接两个o文件
g++ 1.o 2.o -g -o 1_2 g++ 2.o 1.o -g -o 2_1
此时我们得到两个文件1_2/2_1。 对于libc函数调用的流程可以查看文章《程序的启动过程浅析》,静态变量会运行__static_initialization_and_destruction_0函数来初始化静态全局变量的值,对于glibc的流程如下
_start __libc_start_main __libc_csu_init _GLOBAL__sub_I_XXX (gcc) __static_initialization_and_destruction_0
这里需要注意的是_GLOBAL__sub_I_XXX是gcc为每个cpp文件生成的用于构造静态全局变量的构造函数,它的运行顺序就是每个cpp的添加顺序。
对于1_2程序,根据上面的推论,我们可以猜测其运行顺序如下
__libc_csu_init _GLOBAL__sub_I__Z18read_extern_globalv __static_initialization_and_destruction_0 _GLOBAL__sub_I__Z3foov __static_initialization_and_destruction_0
那么就是
int x = read_extern_global() + 1;完成了x的赋值int extern_global = foo();完成extern_global的赋值。那么这个BUG就出现了。
而对于2_1程序,其正常的原因是如下,我们照常推理其运行顺序
__libc_csu_init _GLOBAL__sub_I__Z3foov __static_initialization_and_destruction_0 _GLOBAL__sub_I__Z18read_extern_globalv __static_initialization_and_destruction_0
int extern_global = foo();完成extern_global的赋值int x = read_extern_global() + 1;完成了x的赋值。此时代码正常运行。
根据上面的信息,我们重点在于__libc_csu_init调用CRT的顺序问题,我们可以对1_2程序打印断点。如下
# gdb ./1_2 (gdb) b _GLOBAL__sub_I__Z3foov (gdb) b _GLOBAL__sub_I__Z18read_extern_globalv (gdb) b __static_initialization_and_destruction_0
为了更好的判断extern_global的变量,这里挂上awatch,如下
(gdb) awatch extern_global Hardware access (read/write) watchpoint 4: extern_global
此时运行结果如下
(gdb) r Starting program: /root/asan/initialization_order/1_2 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". Breakpoint 2, _GLOBAL__sub_I__Z18read_extern_globalv () at initialization_order_fiasco_1.cpp:10 10 } (gdb) c Continuing. Breakpoint 3, __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at initialization_order_fiasco_1.cpp:10 10 } (gdb) c Continuing. Hardware access (read/write) watchpoint 4: extern_global Value = 0 read_extern_global () at initialization_order_fiasco_1.cpp:5 5 } (gdb) c Continuing. Breakpoint 1, 0x00000000004006a8 in _GLOBAL__sub_I__Z3foov () at initialization_order_fiasco_2.cpp:2 2 int extern_global = foo(); (gdb) c Continuing. Breakpoint 3, __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at initialization_order_fiasco_2.cpp:2 2 int extern_global = foo(); (gdb) c Continuing. Hardware access (read/write) watchpoint 4: extern_global Old value = 0 New value = 2 0x000000000040068c in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at initialization_order_fiasco_2.cpp:2 2 int extern_global = foo(); (gdb) c Continuing. 1 [Inferior 1 (process 137076) exited normally]
可以看到,出问题的程序1_2,在_GLOBAL__sub_I__Z18read_extern_globalv中extern_global的值默认是0,由运行时默认赋值,然后在_GLOBAL__sub_I__Z3foov中修改值为2,由代码初始化赋值,但是此时x的值已经在GLOBAL__sub_I__Z18read_extern_globalv中取到为1了,所以问题出现了。
本文讨论了全局变量的顺序问题导致的BUG,此问题的原因主要在于gcc的链接过程中cpp的顺序会导致__libc_csu_init运行_GLOBAL__sub_I_XXX的运行顺序问题,从而导致问题的出现。
关于此问题的官方解释,有兴趣的可以翻阅,本文做简单的介绍:
"static initialization order problem":这里描述了多个静态类初始化如果存在依赖问题,则程序50%概率崩溃,解决办法是:"Construct On First Use"
"Why doesn’t the Construct On First Use Idiom use a static object instead of a static pointer":这里描述首次构造时应该使用对象而不是指针,因为指针会内存泄漏,但是如果使用对象,需要注意静态对象的销毁顺序问题。
https://isocpp.org/wiki/faq/ctors#static-init-order-on-first-use
"What is a technique to guarantee both static initialization and static deinitialization? ":静态对象构造和析构的技术要点
"How do I prevent the “static initialization order problem” for my static data members?":通过static Fred& x = X::x();防止初始化问题
简单来说就说,如果非要使用静态全局对象,那么需要注意上述的每个要点,且都有利弊,否则不推荐频繁使用静态全局对象。