编辑
2025-01-22
记录知识
0

在vDSO--示例之将__do_sys_kylin加入vDSO中我们实现了vdso调用自定义的syscall,但是缺点是我们还是通过ld来链接的vdso.so,这种情况下还没有完全达到libc实现的vdso功能。因为我们所有程序在编译的时候并没有-lvdso去链接。本文基于libc去链接vdso的理解,实现一个vdso程序,这样无需-lvdso就能直接使用vdso的程序

一、libc如何调用的vdso

根据文档,我们可以知道如下:

image.png 我们关注libc init相关的代码,直接下载源码即可分析,本文基于内核提供的示例来解析,就不去翻libc的代码了,意思是一样的。

二、内核测试程序

代码位置如下:

tools/testing/selftests/vDSO

我们关注两个文件:

parse_vdso.c vdso_test_gettimeofday.c

此时我们运行make,则可以获取二进制vdso_test_gettimeofday

将其拿到系统运行即可。

三、修改内核测试程序

内核的vdso_test_gettimeofday.c不是我们的目的,我们需要调用自己的函数"__kernel_kylin",所以我们新建一个文件如下:

vdso_test_kylin.c

代码内容如下:

// SPDX-License-Identifier: GPL-2.0-only /* * vdso_test_kylin.c: Sample code to test parse_vdso.c and vDSO kylin() */ #include <stdint.h> #include <elf.h> #include <stdio.h> #include <sys/auxv.h> #include "../kselftest.h" #include "parse_vdso.h" const char *version = "LINUX_2.6.39"; const char *name = "__kernel_kylin"; typedef long (*kylin_t)(char* words); int main(int argc, char **argv) { unsigned long sysinfo_ehdr; char* words = "Userspace say:hello kylin!"; long ret; sysinfo_ehdr = getauxval(AT_SYSINFO_EHDR); vdso_init_from_sysinfo_ehdr(getauxval(AT_SYSINFO_EHDR)); kylin_t kylin = (kylin_t)vdso_sym(version, name); if(!kylin){ printf("Could not find %s\n", name); return KSFT_SKIP; } ret = kylin(words); return 0; }

此时我们修改Makefile如下:

# git diff Makefile diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile index 0069f2f83f86..9ffeaded2168 100644 --- a/tools/testing/selftests/vDSO/Makefile +++ b/tools/testing/selftests/vDSO/Makefile @@ -4,7 +4,7 @@ include ../lib.mk uname_M := $(shell uname -m 2>/dev/null || echo not) ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/) -TEST_GEN_PROGS := $(OUTPUT)/vdso_test_gettimeofday $(OUTPUT)/vdso_test_getcpu +TEST_GEN_PROGS := $(OUTPUT)/vdso_test_gettimeofday $(OUTPUT)/vdso_test_getcpu $(OUTPUT)/vdso_test_kylin ifeq ($(ARCH),x86) TEST_GEN_PROGS += $(OUTPUT)/vdso_standalone_test_x86 endif @@ -19,6 +19,7 @@ endif all: $(TEST_GEN_PROGS) $(OUTPUT)/vdso_test_gettimeofday: parse_vdso.c vdso_test_gettimeofday.c $(OUTPUT)/vdso_test_getcpu: parse_vdso.c vdso_test_getcpu.c +$(OUTPUT)/vdso_test_kylin: parse_vdso.c vdso_test_kylin.c $(OUTPUT)/vdso_standalone_test_x86: vdso_standalone_test_x86.c parse_vdso.c $(CC) $(CFLAGS) $(CFLAGS_vdso_standalone_test_x86) \ vdso_standalone_test_x86.c parse_vdso.c \

然后make即可生成文件vdso_test_kylin

此时我们运行vdso_test_kylin来验证是否调用,如下:

# ./vdso_test_kylin root@kylin:~/vdso# dmesg [72296.088102] kylin: Get sys_kylin call:[Userspace say:hello kylin!]. ret=0

可以发现代码正常调用了syscall,我们通过getauxval(AT_SYSINFO_EHDR);获取了vdso的代码地址,然后通过函数vdso_sym获取了"__kernel_kylin"的函数地址,然后直接运行kylin(words);,这样就实现了vdso的调用,这里我们没有-lvdso去编译。故已经完全实现了vdso的功能

四、原理解析

关于vdso的原理,我们需要具备一点elf的知识,这里elf的知识就不重复了。

首先我们通过命令获取以下信息

.dynamic .dynstr .dynsym

我们还需要知道一个知识如下:

符号地址是dynsym地址获取到st_name,然后通过dynstr的首地址+st_name获得 下面围绕这一个知识来进行验证

4.1 命令获取信息

首先获取dynamic地址,值为0x0000000000000860如下:

# readelf -l vdso.so | grep DYNAMIC DYNAMIC 0x0000000000000860 0x0000000000000860 0x0000000000000860

然后获取dynstr地址,值为0x00000000000001f8,如下:

# readelf -S vdso.so | grep dynstr -A 1 [ 3] .dynstr STRTAB 00000000000001f8 000001f8 0000000000000086 0000000000000000 A 0 0 1

然后获取dynsym地址,值为0x0000000000000150,如下:

# readelf -S vdso.so | grep dynsym -A 1 [ 2] .dynsym DYNSYM 0000000000000150 00000150 00000000000000a8 0000000000000018 A 3 1 8

此时我们获取符号表,如下:

# readelf -s vdso.so Symbol table '.dynsym' contains 7 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6.39 2: 0000000000000780 108 FUNC GLOBAL DEFAULT 7 __kernel_clock_getres@@LINUX_2.6.39 3: 00000000000007f0 8 NOTYPE GLOBAL DEFAULT 7 __kernel_rt_sigreturn@@LINUX_2.6.39 4: 00000000000005c0 424 FUNC GLOBAL DEFAULT 7 __kernel_gettimeofday@@LINUX_2.6.39 5: 0000000000000770 12 FUNC GLOBAL DEFAULT 7 __kernel_kylin@@LINUX_2.6.39 6: 0000000000000320 664 FUNC GLOBAL DEFAULT 7 __kernel_clock_gettime@@LINUX_2.6.39

目的符号是序号为5的__kernel_kylin函数

然后我们获取dynsym的size,如下:

# readelf -S vdso.so | grep dynsym -A 1 [ 2] .dynsym DYNSYM 0000000000000150 00000150 00000000000000a8 0000000000000018 A 3 1 8

这里注意size为0x0000000000000018,我们计算如下:

offset = size * index

这样的到

0x18 * 5 = 0x78

然后与dynsym的起始地址相加,如下:

0x0000000000000150 + 0x78 = 0x00000000000001c8

这里,我们获取到了dynsym里面关于符号__kernel_kylin的结构体Elf64_Sym,我们可以到elf-64-gen.pdf找到定义如下:

image.png

我们需要获取st_name的值。我们借助hexdump,如下:

# hexdump -s $((0x150)) -n $((0xa8)) vdso.so -C 00000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000160 00 00 00 00 00 00 00 00 79 00 00 00 11 00 f1 ff |........y.......| 00000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000180 3d 00 00 00 12 00 07 00 80 07 00 00 00 00 00 00 |=...............| 00000190 6c 00 00 00 00 00 00 00 53 00 00 00 10 00 07 00 |l.......S.......| 000001a0 f0 07 00 00 00 00 00 00 08 00 00 00 00 00 00 00 |................| 000001b0 18 00 00 00 12 00 07 00 c0 05 00 00 00 00 00 00 |................| 000001c0 a8 01 00 00 00 00 00 00 2e 00 00 00 12 00 07 00 |................| 000001d0 70 07 00 00 00 00 00 00 0c 00 00 00 00 00 00 00 |p...............| 000001e0 01 00 00 00 12 00 07 00 20 03 00 00 00 00 00 00 |........ .......| 000001f0 98 02 00 00 00 00 00 00 |........| 000001f8

此时我们查看0x1c8的值是0x0000002e,所以我们知道st_name是0x2e。

我们这时候计算符号位置即可,如下:

__kernel_kylin = dynstr + st_name

所以如下运算

0x00000000000001f8 + 0x2e = 0x0000000000000226

此时我们拿到了0x226的地址,然后计算.dynstr偏移看看对不对,如下:

# readelf -x .dynstr vdso.so “.dynstr”节的十六进制输出: 0x000001f8 005f5f6b 65726e65 6c5f636c 6f636b5f .__kernel_clock_ 0x00000208 67657474 696d6500 5f5f6b65 726e656c gettime.__kernel 0x00000218 5f676574 74696d65 6f666461 79005f5f _gettimeofday.__ 0x00000228 6b65726e 656c5f6b 796c696e 005f5f6b kernel_kylin.__k 0x00000238 65726e65 6c5f636c 6f636b5f 67657472 ernel_clock_getr 0x00000248 6573005f 5f6b6572 6e656c5f 72745f73 es.__kernel_rt_s 0x00000258 69677265 7475726e 006c696e 75782d76 igreturn.linux-v 0x00000268 64736f2e 736f2e31 004c494e 55585f32 dso.so.1.LINUX_2 0x00000278 2e362e33 3900 .6.39.

我们可以发现0x226的位置就是__kernel_kylin

至此通过命令计算完成。

4.2 代码打印

为了更好的分析代码,我们需要为parse_vdso.c添加print如下:

# git diff parse_vdso.c diff --git a/tools/testing/selftests/vDSO/parse_vdso.c b/tools/testing/selftests/vDSO/parse_vdso.c index 413f75620a35..a327a85879dc 100644 --- a/tools/testing/selftests/vDSO/parse_vdso.c +++ b/tools/testing/selftests/vDSO/parse_vdso.c @@ -20,6 +20,7 @@ #include <string.h> #include <limits.h> #include <elf.h> +#include <stdio.h> #include "parse_vdso.h" @@ -98,8 +99,10 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base) vdso_info.load_offset = base + (uintptr_t)pt[i].p_offset - (uintptr_t)pt[i].p_vaddr; + printf("kylin: vdso load_offset=%p\n", vdso_info.load_offset); } else if (pt[i].p_type == PT_DYNAMIC) { dyn = (ELF(Dyn)*)(base + pt[i].p_offset); + printf("kylin: dynamic=%p\n", dyn); } } @@ -120,11 +123,13 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base) vdso_info.symstrings = (const char *) ((uintptr_t)dyn[i].d_un.d_ptr + vdso_info.load_offset); + printf("kylin: dynstr=%p\n", vdso_info.symstrings); break; case DT_SYMTAB: vdso_info.symtab = (ELF(Sym) *) ((uintptr_t)dyn[i].d_un.d_ptr + vdso_info.load_offset); + printf("kylin: dynsym=%p\n", vdso_info.symtab); break; case DT_HASH: hash = (ELF(Word) *) @@ -217,6 +222,7 @@ void *vdso_sym(const char *version, const char *name) continue; if (sym->st_shndx == SHN_UNDEF) continue; + printf("kylin: dynsym[%d]=%p dynsym_name=%s[%p]\n", chain, sym, vdso_info.symstrings + sym->st_name, vdso_info.symstrings + sym->st_name); if (strcmp(name, vdso_info.symstrings + sym->st_name)) continue;

此时我们运行程序得到输出如下:

# ./vdso_test_kylin kylin: vdso load_offset=0x7f94e97000 kylin: dynamic=0x7f94e97860 kylin: dynstr=0x7f94e971f8 kylin: dynsym=0x7f94e97150 kylin: dynsym[5]=0x7f94e971c8 dynsym_name=__kernel_kylin[0x7f94e97226]

我们可以轻松的得到如下信息:

  • vdso代码加载地址是0x7f94e97000
  • dynamic节加载地址是0x7f94e97860
  • dynstr节加载地址是0x7f94e971f8
  • dynsym节加载地址是0x7f94e97150
  • 通过计算dynsym的第5个符号的加载地址是0x7f94e971c8
  • 符号__kernel_kylin的地址是0x7f94e97226

可以发现,和我们命令计算的完全一致。

五、结论

本文模拟了libc如何实施的vdso功能,这样所有程序都可以自动加载vdso代码段,希望能够加深大家对vdso的印象

编辑
2025-01-22
记录知识
0

根据vDSO--内核原理我们可以对一个syscall来进行vDSO的优化,再根据vDSO--示例之实现系统调用我们实现了一个自己的系统调用__do_sys_kylin,这里我们将__do_sys_kylin加入到vDSO中

一、vDSO的符号

我们从lds可以发现需要导出符号,所以我们添加一个__kernel_kylin的符号给应用,代码如下:

# git diff vdso/vdso.lds.S diff --git a/arch/arm64/kernel/vdso/vdso.lds.S b/arch/arm64/kernel/vdso/vdso.lds.S index b840ab1b705c..c766afed0ec8 100644 --- a/arch/arm64/kernel/vdso/vdso.lds.S +++ b/arch/arm64/kernel/vdso/vdso.lds.S @@ -84,6 +84,7 @@ VERSION global: __kernel_rt_sigreturn; __kernel_gettimeofday; + __kernel_kylin; __kernel_clock_gettime; __kernel_clock_getres; local: *;

二、实现vDSO的调用

为了完成vDSO的实验,我没有去故意设置vvar的值来获取,而是在vDSO中嵌套了一个syscall,这样方便大家理解,如下:

# git diff vdso/vgettimeofday.c diff --git a/arch/arm64/kernel/vdso/vgettimeofday.c b/arch/arm64/kernel/vdso/vgettimeofday.c index 4236cf34d7d9..0005f42565d9 100644 --- a/arch/arm64/kernel/vdso/vgettimeofday.c +++ b/arch/arm64/kernel/vdso/vgettimeofday.c @@ -18,6 +18,22 @@ int __kernel_gettimeofday(struct __kernel_old_timeval *tv, return __cvdso_gettimeofday(tv, tz); } +int __kernel_kylin(char* words) +{ + register char *word asm("x0") = words; + register long ret asm("x0"); +#define __NR_sys_kylin 449 + register long nr asm("x8") = __NR_sys_kylin; + + asm volatile( + " svc #0\n" + : "=r" (ret) + : "r" (word), "r" (nr) + : "memory"); + + return ret; +} +

三、生成vdso.so

至此vDSO的函数实现已经完成了,我们编译vdso.so即可。编译后文件如下:

arch/arm64/kernel/vdso/vdso.so

此时我们看看符号是否增加,如下:

# readelf -s arch/arm64/kernel/vdso/vdso.so Symbol table '.dynsym' contains 7 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6.39 2: 0000000000000780 108 FUNC GLOBAL DEFAULT 7 __kernel_clock_getres@@LINUX_2.6.39 3: 00000000000007f0 8 NOTYPE GLOBAL DEFAULT 7 __kernel_rt_sigreturn@@LINUX_2.6.39 4: 00000000000005c0 424 FUNC GLOBAL DEFAULT 7 __kernel_gettimeofday@@LINUX_2.6.39 5: 0000000000000770 12 FUNC GLOBAL DEFAULT 7 __kernel_kylin@@LINUX_2.6.39 6: 0000000000000320 664 FUNC GLOBAL DEFAULT 7 __kernel_clock_gettime@@LINUX_2.6.39

我们看一下符号的代码段内容

# objdump --disassemble=__kernel_kylin arch/arm64/kernel/vdso/vdso.so arch/arm64/kernel/vdso/vdso.so: 文件格式 elf64-littleaarch64 Disassembly of section .text: 0000000000000770 <__kernel_kylin@@LINUX_2.6.39>: 770: d2803828 mov x8, #0x1c1 // #449 774: d4000001 svc #0x0 778: d65f03c0 ret

以上完全正确

四、测试vDSO调用

我们将arch/arm64/kernel/vdso/vdso.so作为标准动态库看待,直接复制到机器中,编写测试代码如下:

#include <sys/syscall.h> #include <stdio.h> extern int __kernel_kylin(char* words); int main(int argc, char *argv[]) { int ret = 0; char* words = "Userspace say:hello kylin!"; ret = __kernel_kylin(words); printf("vdso ret=%d \n", ret); return 0; }

此时我们构建的时候应该把vdso当作标准动态库,如下编译

gcc test_kylin_vdso.c -o test_kylin_vdso -L. -lvdso

此时我们运行:

# ./test_kylin_vdso vdso ret=0

然后查看日志

# dmesg [ 5586.379813] kylin: Get sys_kylin call:[Userspace say:hello kylin!]. ret=0

可以发现能够正常的syscall到我的系统调用,我们ltrace看看

# ltrace ./test_kylin_vdso __libc_start_main(0x5569b3083c, 1, 0x7fdfd074f8, 0x5569b30880 <unfinished ...> __kernel_kylin(0x5569b30920, 0x7fdfd074f8, 0x7fdfd07508, 0x5569b3083c) = 0 printf("vdso ret=%d \n", 0vdso ret=0 ) = 12 exit(0 <unfinished ...> __cxa_finalize(0x5569b41008, 0x5569b307f0, 0x10d68, 1) = 0x7fba4bec70 +++ exited (status 0) +++

正常调用__kernel_kylin,至此我们可以正常通过vdso来调用syscall的内容。

编辑
2025-01-22
记录知识
0

我们知道vDSO这个东西的作用之后,为了加深了解vDSO的作用,这里先以编写一个系统调用为实验,逐步了解vDSO的基本作用。

一、定义系统调用号

首先我们找到系统调用头文件:

include/uapi/asm-generic/unistd.h

里面总共449个系统调用,如下:

#define __NR_syscalls 449

为了实现我们自己的系统调用,我们扩大一个系统调用号,如下:

# git diff include/uapi/asm-generic/unistd.h diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index f7b735dabf35..ae1050e62691 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -862,8 +862,11 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise) #define __NR_process_mrelease 448 __SYSCALL(__NR_process_mrelease, sys_process_mrelease) +#define __NR_kylin 449 + __SYSCALL(__NR_kylin, sys_kylin) + #undef __NR_syscalls -#define __NR_syscalls 449 +#define __NR_syscalls 450 /* * 32 bit systems traditionally used different

二、声明系统调用函数

我们知道系统调用号对应内核需要实现系统调用的函数实现,所以需要声明,我们找到函数声明的头文件,如下:

include/linux/syscalls.h

然后声明一个自己的系统调用的声明,这里我只传入一个参数char* words,如下:

# git diff include/linux/syscalls.h diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 1c170be3f746..64ff37b08e9d 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -309,6 +309,7 @@ static inline void addr_limit_user_check(void) * include the prototypes if CONFIG_ARCH_HAS_SYSCALL_WRAPPER is enabled. */ #ifndef CONFIG_ARCH_HAS_SYSCALL_WRAPPER +asmlinkage long sys_kylin(char* words); asmlinkage long sys_io_setup(unsigned nr_reqs, aio_context_t __user *ctx); asmlinkage long sys_io_destroy(aio_context_t ctx); asmlinkage long sys_io_submit(aio_context_t, long,

三、实现系统调用函数

函数声明好了,我们需要实现这个系统调用函数,这里我跑到了常规系统调用的c文件,如下:

kernel/sys.c

我们可以发现linux的syscall的声明都是宏定义,所以我们先简单了解一下SYSCALL_DEFINEx的逻辑

还是在文件:

include/linux/syscalls.h

我们可以发现

#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__) #define SYSCALL_DEFINEx(x, sname, ...) \ SYSCALL_METADATA(sname, x, __VA_ARGS__) \ __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) #define SYSCALL_METADATA(sname, nb, ...) \ static const char *types_##sname[] = { \ __MAP(nb,__SC_STR_TDECL,__VA_ARGS__) \ }; \ static const char *args_##sname[] = { \ __MAP(nb,__SC_STR_ADECL,__VA_ARGS__) \ }; \ SYSCALL_TRACE_ENTER_EVENT(sname); \ SYSCALL_TRACE_EXIT_EVENT(sname); \ static struct syscall_metadata __used \ __syscall_meta_##sname = { \ .name = "sys"#sname, \ .syscall_nr = -1, /* Filled in at boot */ \ .nb_args = nb, \ .types = nb ? types_##sname : NULL, \ .args = nb ? args_##sname : NULL, \ .enter_event = &event_enter_##sname, \ .exit_event = &event_exit_##sname, \ .enter_fields = LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \ }; \ static struct syscall_metadata __used \ __section("__syscalls_metadata") \ *__p_syscall_meta_##sname = &__syscall_meta_##sname; #ifndef __SYSCALL_DEFINEx #define __SYSCALL_DEFINEx(x, name, ...) \ __diag_push(); \ __diag_ignore(GCC, 8, "-Wattribute-alias", \ "Type aliasing is used to sanitize syscall arguments");\ asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \ __attribute__((alias(__stringify(__se_sys##name)))); \ ALLOW_ERROR_INJECTION(sys##name, ERRNO); \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));\ asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \ asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ { \ long ret = __do_sys##name(__MAP(x,__SC_CAST,__VA_ARGS__));\ __MAP(x,__SC_TEST,__VA_ARGS__); \ __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \ return ret; \ } \ __diag_pop(); \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) #endif /* __SYSCALL_DEFINEx */

这里我们基本能够看到其如何定义的,但是为了避免计算,我们可以直接预编译得到结果,如下:

3.1 预编译c

我们知道kernel/sys.c在编译之后会生成kernel/.sys.o.cmd,这是留给我们调试用的,如下:

cmd_kernel/sys.o := /root/kernel/roc-rk3588s-pc/kernel/scripts/gcc-wrapper.py gcc -Wp,-MMD,kernel/.sys.o.d -nostdinc -isystem /opt/kpgcc_release/bin/../lib/gcc/aarch64-unknown-linux-gnu/9.3.1/include -I./arch/arm64/include -I./arch/arm64/include/generated -I./include -I./arch/arm64/include/uapi -I./arch/arm64/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -mlittle-endian -DCC_USING_PATCHABLE_FUNCTION_ENTRY -DKASAN_SHADOW_SCALE_SHIFT= -fmacro-prefix-map=./= -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Wno-format-security -std=gnu89 -mgeneral-regs-only -DCONFIG_CC_HAS_K_CONSTRAINT=1 -Wno-psabi -mabi=lp64 -fno-asynchronous-unwind-tables -fno-unwind-tables -mbranch-protection=none -Wa,-march=armv8.5-a -DARM64_ASM_ARCH='"armv8.5-a"' -DKASAN_SHADOW_SCALE_SHIFT= -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -fno-allow-store-data-races -Wframe-larger-than=2048 -fstack-protector-strong -Werror -Wno-unused-but-set-variable -Wno-unused-const-variable -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -fpatchable-function-entry=2 -Wdeclaration-after-statement -Wno-pointer-sign -Wno-stringop-truncation -Wno-array-bounds -Wno-stringop-overflow -Wno-restrict -Wno-maybe-uninitialized -fno-strict-overflow -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -Wno-packed-not-aligned -mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=1344 -DKBUILD_MODFILE='"kernel/sys"' -DKBUILD_BASENAME='"sys"' -DKBUILD_MODNAME='"sys"' -D__KBUILD_MODNAME=kmod_sys -c -o kernel/sys.o kernel/sys.c

我们可以利用这个文件,直接预编译这个c,如下:

/root/kernel/roc-rk3588s-pc/kernel/scripts/gcc-wrapper.py gcc -Wp,-MMD,kernel/.sys.o.d -nostdinc -isystem /opt/kpgcc_release/bin/../lib/gcc/aarch64-unknown-linux-gnu/9.3.1/include -I./arch/arm64/include -I./arch/arm64/include/generated -I./include -I./arch/arm64/include/uapi -I./arch/arm64/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -mlittle-endian -DCC_USING_PATCHABLE_FUNCTION_ENTRY -DKASAN_SHADOW_SCALE_SHIFT= -fmacro-prefix-map=./= -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Wno-format-security -std=gnu89 -mgeneral-regs-only -DCONFIG_CC_HAS_K_CONSTRAINT=1 -Wno-psabi -mabi=lp64 -fno-asynchronous-unwind-tables -fno-unwind-tables -mbranch-protection=none -Wa,-march=armv8.5-a -DARM64_ASM_ARCH='"armv8.5-a"' -DKASAN_SHADOW_SCALE_SHIFT= -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -fno-allow-store-data-races -Wframe-larger-than=2048 -fstack-protector-strong -Werror -Wno-unused-but-set-variable -Wno-unused-const-variable -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -fpatchable-function-entry=2 -Wdeclaration-after-statement -Wno-pointer-sign -Wno-stringop-truncation -Wno-array-bounds -Wno-stringop-overflow -Wno-restrict -Wno-maybe-uninitialized -fno-strict-overflow -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -Wno-packed-not-aligned -mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=1344 -DKBUILD_MODFILE='"kernel/sys"' -DKBUILD_BASENAME='"sys"' -DKBUILD_MODNAME='"sys"' -D__KBUILD_MODNAME=kmod_sys -E -o kernel/sys.i kernel/sys.c

我们运行上述的编译指令,可以得到预编译的kernel/sys.i文件,打开此文件,找到我们的宏展开的内容如下:

long __arm64_sys_kylin(const struct pt_regs *regs); static struct error_injection_entry __attribute__((__used__)) __attribute__((__section__("_error_injection_whitelist"))) _eil_addr___arm64_sys_kylin = { .addr = (unsigned long)__arm64_sys_kylin, .etype = EI_ETYPE_ERRNO, };; static long __se_sys_kylin(__typeof(__builtin_choose_expr((__builtin_types_compatible_p(typeof(( char*)0), typeof(0LL)) || __builtin_types_compatible_p(typeof(( char*)0), typeof(0ULL))), 0LL, 0L)) words); static inline __attribute__((__gnu_inline__)) __attribute__((__unused__)) __attribute__((patchable_function_entry(0, 0))) long __do_sys_kylin(char* words); long __arm64_sys_kylin(const struct pt_regs *regs) { return __se_sys_kylin(regs->regs[0]); } static long __se_sys_kylin(__typeof(__builtin_choose_expr((__builtin_types_compatible_p(typeof(( char*)0), typeof(0LL)) || __builtin_types_compatible_p(typeof(( char*)0), typeof(0ULL))), 0LL, 0L)) words) { long ret = __do_sys_kylin(( char*) words); (void)((int)(sizeof(struct { int:(-!!(!(__builtin_types_compatible_p(typeof(( char*)0), typeof(0LL)) || __builtin_types_compatible_p(typeof(( char*)0), typeof(0ULL))) && sizeof(char*) > sizeof(long))); }))); do { } while (0); return ret; } static inline __attribute__((__gnu_inline__)) __attribute__((__unused__)) __attribute__((patchable_function_entry(0, 0))) long __do_sys_kylin(char* words) { char buffer[1024]; int ret; ret = copy_from_user(buffer, words, 1024); printk("kylin: Get sys_kylin call:[%s]. ret=%d \n", buffer, ret); return 0; }

这里我简化一下,如下:

long __do_sys_kylin(char* words) { char buffer[1024]; int ret; ret = copy_from_user(buffer, words, 1024); printk("kylin: Get sys_kylin call:[%s]. ret=%d \n", buffer, ret); return 0; }

我们实现系统调用的代码如下:

# git diff kernel/sys.c diff --git a/kernel/sys.c b/kernel/sys.c index 4b0232713a90..e0cc65d5500c 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2794,6 +2794,17 @@ static int do_sysinfo(struct sysinfo *info) return 0; } +SYSCALL_DEFINE1(kylin, char*, words) +{ + char buffer[1024]; + int ret; + + ret = copy_from_user(buffer, words, 1024); + + printk("kylin: Get sys_kylin call:[%s]. ret=%d \n", buffer, ret); + return 0; +} + SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info) { struct sysinfo val;

四、运行

根据上面的内容,我们实现了一个系统调用__do_sys_kylin,现在我们需要在应用层测试一下,代码如下:

#include <sys/syscall.h> #include <stdio.h> #include <unistd.h> #define __NR_kylin 449 int main(int argc, char *argv[]) { int ret = 0; char* words = "Userspace say:hello kylin!"; ret = syscall(__NR_kylin, words); printf("syscall ret=%d \n", ret); }

此时我们运行代码:

# ./test_kylin_syscall syscall ret=0

dmesg可以看到信息如下:

# dmesg [ 4404.069267] kylin: Get sys_kylin call:[Userspace say:hello kylin!]. ret=0

然后我们strace看看是否下发的syscall

# strace ./test_kylin_syscall 2>&1 | grep syscall_ syscall_0x1c1(0x555ad40898, 0x7fe8349d68, 0x555ad407ac, 0, 0x501e2032a017357a, 0) = 0

至此,我们在arm64下实现了一个基本的syscall。后面基于此syscall来实现vDSO调用

编辑
2025-01-22
记录知识
0

根据vDSO--什么是vDSO的实验,我们知道了啥是vDSO,本文基于内核的实现,简单介绍一下vDSO的内核原理。

一、vdso的初始化

在代码arch/arm64/kernel/vdso.c中,我们可以看到vdso的初始化如下:

static struct vm_special_mapping aarch64_vdso_maps[] __ro_after_init = { [AA64_MAP_VVAR] = { .name = "[vvar]", .fault = vvar_fault, .mremap = vvar_mremap, }, [AA64_MAP_VDSO] = { .name = "[vdso]", .mremap = vdso_mremap, }, }; static int __init vdso_init(void) { vdso_info[VDSO_ABI_AA64].dm = &aarch64_vdso_maps[AA64_MAP_VVAR]; vdso_info[VDSO_ABI_AA64].cm = &aarch64_vdso_maps[AA64_MAP_VDSO]; return __vdso_init(VDSO_ABI_AA64); } arch_initcall(vdso_init);

可以看到vdso默认通过arch_initcall拉起来,然后,默认初始化了两个特殊页映射的结构体aarch64_vdso_maps,我们关注__vdso_init如下

static int __vdso_init(enum vdso_abi abi) { int i; struct page **vdso_pagelist; unsigned long pfn; if (memcmp(vdso_info[abi].vdso_code_start, "\177ELF", 4)) { pr_err("vDSO is not a valid ELF object!\n"); return -EINVAL; } vdso_info[abi].vdso_pages = ( vdso_info[abi].vdso_code_end - vdso_info[abi].vdso_code_start) >> PAGE_SHIFT; vdso_pagelist = kcalloc(vdso_info[abi].vdso_pages, sizeof(struct page *), GFP_KERNEL); if (vdso_pagelist == NULL) return -ENOMEM; /* Grab the vDSO code pages. */ pfn = sym_to_pfn(vdso_info[abi].vdso_code_start); for (i = 0; i < vdso_info[abi].vdso_pages; i++) vdso_pagelist[i] = pfn_to_page(pfn + i); vdso_info[abi].cm->pages = vdso_pagelist; return 0; }

这里看到计算了vdso的代码所需页数,然后为其kcalloc申请了页,然后通过页地址找到页帧号,然后再找到物理的页地址。

这里我们完成了vdso的整个初始化过程

二、vdso插入用户内存空间

首先我们留意到一个函数:

arch_setup_additional_pages

此时我们关注fs/binfmt_elf.c的如下函数

static int load_elf_binary(struct linux_binprm *bprm)

它有如下代码:

#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES retval = arch_setup_additional_pages(bprm, !!interpreter); if (retval < 0) goto out; #endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */

这里就清楚了,当我们执行一个elf文件的时候,或通过load_elf_binary来解析elf,在这个过程中,我们调用arch_setup_additional_pages将其安插在用户的内存空间布局中。

主要操作如下:

ret = _install_special_mapping(mm, vdso_base, VVAR_NR_PAGES * PAGE_SIZE, VM_READ|VM_MAYREAD|VM_PFNMAP, vdso_info[abi].dm); ret = _install_special_mapping(mm, vdso_base, vdso_text_len, VM_READ|VM_EXEC|gp_flags| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, vdso_info[abi].cm);

这里和进程maps对应上了,如下:

7f93766000-7f93768000 r--p 00000000 00:00 0 [vvar] 7f93768000-7f93769000 r-xp 00000000 00:00 0 [vdso]

三、程序使用vdso

根据上面的信息,我们知道了vdso的初始化,vdso在elf加载的时候默认map到程序内存空间,但是具体的,我们需要知道vdso如何优化syscall的调用的,首先,我们得知道如下图:

image.png 这里以gettimeofday为例,我们需要先关注链接脚本文件vdso.lds.S

SECTIONS { PROVIDE(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE); #ifdef CONFIG_TIME_NS PROVIDE(_timens_data = _vdso_data + PAGE_SIZE); #endif . = VDSO_LBASE + SIZEOF_HEADERS; .hash : { *(.hash) } :text .gnu.hash : { *(.gnu.hash) } .dynsym : { *(.dynsym) } .dynstr : { *(.dynstr) } .gnu.version : { *(.gnu.version) } .gnu.version_d : { *(.gnu.version_d) } .gnu.version_r : { *(.gnu.version_r) } /* * Discard .note.gnu.property sections which are unused and have * different alignment requirement from vDSO note sections. */ /DISCARD/ : { *(.note.GNU-stack .note.gnu.property) } .note : { *(.note.*) } :text :note . = ALIGN(16); .text : { *(.text*) } :text =0xd503201f PROVIDE (__etext = .); PROVIDE (_etext = .); PROVIDE (etext = .); .eh_frame_hdr : { *(.eh_frame_hdr) } :text :eh_frame_hdr .eh_frame : { KEEP (*(.eh_frame)) } :text .dynamic : { *(.dynamic) } :text :dynamic .rodata : { *(.rodata*) } :text _end = .; PROVIDE(end = .); /DISCARD/ : { *(.data .data.* .gnu.linkonce.d.* .sdata*) *(.bss .sbss .dynbss .dynsbss) } }

此时我们查看导出符号

VERSION { LINUX_2.6.39 { global: __kernel_rt_sigreturn; __kernel_gettimeofday; __kernel_clock_gettime; __kernel_clock_getres; local: *; }; }

这里我们知道,用户想要调用gettimeofday,实际上vdso是调用实现的__kernel_gettimeofday,我们追踪此程序的实现:

arch/arm64/kernel/vdso/vgettimeofday.c

int __kernel_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) { return __cvdso_gettimeofday(tv, tz); }

然后我们找到`__cvdso_gettimeofday`的实现在:

lib/vdso/gettimeofday.c

static __maybe_unused int __cvdso_gettimeofday_data(const struct vdso_data *vd, struct __kernel_old_timeval *tv, struct timezone *tz) { if (likely(tv != NULL)) { struct __kernel_timespec ts; if (do_hres(&vd[CS_HRES_COARSE], CLOCK_REALTIME, &ts)) return gettimeofday_fallback(tv, tz); tv->tv_sec = ts.tv_sec; tv->tv_usec = (u32)ts.tv_nsec / NSEC_PER_USEC; } if (unlikely(tz != NULL)) { if (IS_ENABLED(CONFIG_TIME_NS) && vd->clock_mode == VDSO_CLOCKMODE_TIMENS) vd = __arch_get_timens_vdso_data(); tz->tz_minuteswest = vd[CS_HRES_COARSE].tz_minuteswest; tz->tz_dsttime = vd[CS_HRES_COARSE].tz_dsttime; } return 0; } static __maybe_unused int __cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) { return __cvdso_gettimeofday_data(__arch_get_vdso_data(), tv, tz); }

这里我们关注函数do_hres,其实现如下:

路径:lib/vdso/gettimeofday.c
ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); ts->tv_nsec = ns;

这里直接给ts赋值即可,如果vdso的实现失效,则切回syscall,如下:

arch/arm64/include/asm/vdso/gettimeofday.h
static __always_inline int gettimeofday_fallback(struct __kernel_old_timeval *_tv, struct timezone *_tz) { register struct timezone *tz asm("x1") = _tz; register struct __kernel_old_timeval *tv asm("x0") = _tv; register long ret asm ("x0"); register long nr asm("x8") = __NR_gettimeofday; asm volatile( " svc #0\n" : "=r" (ret) : "r" (tv), "r" (tz), "r" (nr) : "memory"); return ret; }

这里就存在一个疑问点,我们直接赋值的数据从哪里来。

3.1 vdso的数据vvar

我们已经知道了代码通过vdso下发到直接去数据,我们稍微留意一下就知道这个数据是

__cvdso_gettimeofday_data(__arch_get_vdso_data(), tv, tz);

也就是

static __always_inline const struct vdso_data *__arch_get_vdso_data(void) { return _vdso_data; }

也就是

/* * The vDSO data page. */ static union { struct vdso_data data[CS_BASES]; u8 page[PAGE_SIZE]; } vdso_data_store __page_aligned_data; struct vdso_data *vdso_data = vdso_data_store.data;

这里可以知道了,这个数据来源vvar里面,但是数据如何更新的呢

3.2 vvar的数据更新

对于gettimeofday的函数的实现,我们需要关注timer的核心函数timekeeping_update,代码位置如下:

kernel/time/timekeeping.c

我们关心这句话

update_vsyscall(tk);

其实现在如下:

void update_vsyscall(struct timekeeper *tk) { struct vdso_data *vdata = __arch_get_k_vdso_data(); struct vdso_timestamp *vdso_ts; s32 clock_mode; u64 nsec; /* copy vsyscall data */ vdso_write_begin(vdata); clock_mode = tk->tkr_mono.clock->vdso_clock_mode; vdata[CS_HRES_COARSE].clock_mode = clock_mode; vdata[CS_RAW].clock_mode = clock_mode; /* CLOCK_REALTIME also required for time() */ vdso_ts = &vdata[CS_HRES_COARSE].basetime[CLOCK_REALTIME]; vdso_ts->sec = tk->xtime_sec; vdso_ts->nsec = tk->tkr_mono.xtime_nsec; /* CLOCK_REALTIME_COARSE */ vdso_ts = &vdata[CS_HRES_COARSE].basetime[CLOCK_REALTIME_COARSE]; vdso_ts->sec = tk->xtime_sec; vdso_ts->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; /* CLOCK_MONOTONIC_COARSE */ vdso_ts = &vdata[CS_HRES_COARSE].basetime[CLOCK_MONOTONIC_COARSE]; vdso_ts->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec; nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; nsec = nsec + tk->wall_to_monotonic.tv_nsec; vdso_ts->sec += __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec); /* * Read without the seqlock held by clock_getres(). * Note: No need to have a second copy. */ WRITE_ONCE(vdata[CS_HRES_COARSE].hrtimer_res, hrtimer_resolution); /* * If the current clocksource is not VDSO capable, then spare the * update of the high reolution parts. */ if (clock_mode != VDSO_CLOCKMODE_NONE) update_vdso_data(vdata, tk); __arch_update_vsyscall(vdata, tk); vdso_write_end(vdata); __arch_sync_vdso_data(vdata); }

这里一目了然vdso_ts就是vdata的成员,结构体如下:

struct vdso_data { u32 seq; s32 clock_mode; u64 cycle_last; u64 mask; u32 mult; u32 shift; union { struct vdso_timestamp basetime[VDSO_BASES]; struct timens_offset offset[VDSO_BASES]; }; s32 tz_minuteswest; s32 tz_dsttime; u32 hrtimer_res; u32 __unused; struct arch_vdso_data arch_data; };

所以数据存放在vvar区域,我们定义了一个数据结构,在内核中,我们直接利用vvar区域的数据赋值给vdso的代码调用,也就避免了系统调用。

四、总结

至此,我们从内核的所有方面了解到了vdso的实现原理,相当于内核直接实现了一段代码,作为动态链接放在每个程序上运行,这样就避免了syscall带来的性能问题。

编辑
2025-01-22
记录知识
0

linux系统中有一个很有意思的共享库,名字为linux-vdso.so.1,这个库我们在rootfs中找不到实体,但是每个elf文件都需要链接它。之前和同事讨论的时候,同事想要了解elf的运行原理,我顺便提出了vDSO的这个东西,elf不必多说,相信大家都清楚,本文本着普及了解vDSO的目的,介绍一下什么是vDSO,以及深入了解vDSO。

一、什么是vDSO

vDSO是virtual dynamic shared object,也就是虚拟的动态链接库。

关于vDSO的解释,第一次看到的时候是如下文章,讲解的很仔细,可以看看:

https://www.kernel.org/doc/Documentation/ABI/stable/vdso

对于更详细的文章,可以看如下:

https://lwn.net/Articles/615809/

根据链接的意思,对于每个应用程序,会主动加载vDSO程序到进程空间,这样提供高度优化的syscall方案,也就是加快了系统的syscall的调用性能。

关于arm的实现,我们可以查看如下ppt

image.png

二、系统的vDSO

对于系统中的vDSO,我们两个地方可以查看。以systemd为例

2.1 ldd查看

# ldd /usr/bin/systemd linux-vdso.so.1 (0x0000007f8888a000)

这里我们看到程序未运行时,默认有一个linux-vdso.so.1加载地址

2.2 maps查看

# cat /proc/1/maps | grep "vdso\|vvar" 7f93766000-7f93768000 r--p 00000000 00:00 0 [vvar] 7f93768000-7f93769000 r-xp 00000000 00:00 0 [vdso]

可以看到systemd运行的时候,实际的重定向后的地址是0x7f93768000,这个地址小于ld-2.31,大于其他动态链接库。

7f93736000-7f93737000 rw-p 0000b000 b3:04 143061 /usr/lib/aarch64-linux-gnu/libdrm-cursor.so.1.0.0 7f93737000-7f93738000 rw-p 00000000 00:00 0 7f93738000-7f93759000 r-xp 00000000 b3:04 141832 /usr/lib/aarch64-linux-gnu/ld-2.31.so 7f93759000-7f93765000 rw-p 00000000 00:00 0 7f93766000-7f93768000 r--p 00000000 00:00 0 [vvar] 7f93768000-7f93769000 r-xp 00000000 00:00 0 [vdso] 7f93769000-7f9376a000 r--p 00021000 b3:04 141832 /usr/lib/aarch64-linux-gnu/ld-2.31.so 7f9376a000-7f9376c000 rw-p 00022000 b3:04 141832 /usr/lib/aarch64-linux-gnu/ld-2.31.so

所以这里我们可以获取两个信息:

  • vdso是一个常规动态链接库
  • vdso是ld加载后主动加载的动态库

三、vDSO的位置

带着上面的结论,我们可以知道,这个文件应该是在内核的,所以其位置如下:

arch/arm64/kernel/vdso/vdso.so

因为其是动态链接文件,所以是标准的elf文件,我们可以如下查看:

# readelf -l vdso.so Elf 文件类型为 DYN (共享目标文件) Entry point 0x320 There are 4 program headers, starting at offset 64 程序头: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000990 0x0000000000000990 R E 0x10 DYNAMIC 0x0000000000000860 0x0000000000000860 0x0000000000000860 0x0000000000000110 0x0000000000000110 R 0x8 NOTE 0x00000000000002c8 0x00000000000002c8 0x00000000000002c8 0x0000000000000054 0x0000000000000054 R 0x4 GNU_EH_FRAME 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x8 Section to Segment mapping: 段节... 00 .hash .dynsym .dynstr .gnu.version .gnu.version_d .note .text .eh_frame .dynamic .got .got.plt 01 .dynamic 02 .note 03

我们看看其中的符号,如下

# readelf -s vdso.so Symbol table '.dynsym' contains 7 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6.39 2: 0000000000000780 108 FUNC GLOBAL DEFAULT 7 __kernel_clock_getres@@LINUX_2.6.39 3: 00000000000007f0 8 NOTYPE GLOBAL DEFAULT 7 __kernel_rt_sigreturn@@LINUX_2.6.39 4: 00000000000005c0 424 FUNC GLOBAL DEFAULT 7 __kernel_gettimeofday@@LINUX_2.6.39 5: 0000000000000320 664 FUNC GLOBAL DEFAULT 7 __kernel_clock_gettime@@LINUX_2.6.39

可以发现,这个so就提供了四个函数符号。

也就是说,如果程序调用这四个符号,则默认优先调用vdso,而不是直接系统调用

四、程序实验

为了测试验证vDSO的功能,我们以gettimeofday为例,编写程序,用于测试vDSO,如下是代码

#include <sys/syscall.h> #include <sys/time.h> #include <sys/auxv.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> int main(int argc, char *argv[]) { struct timeval tv; int i; unsigned int loop; unsigned long sysinfo_ehdr = getauxval(AT_SYSINFO_EHDR); if(argc==1 || (strcmp(argv[1], "--help")==0)){ printf("Usage:\n"); printf("\t %s %s %s %s\n", argv[0], "vdso|syscall", "count", "loop"); return 0; } if (argc == 3) loop = atoi(argv[2]); else { loop = 1000; } printf("pid=%d sysinfo_ehdr(vdso_addr)=%#lx \n", getpid(), sysinfo_ehdr); if (strcmp(argv[1], "vdso") == 0) { int (*ptr)(struct timeval *, void *) = gettimeofday; printf("gettimeofday addr=%p \n", ptr); for (i = 0; i < loop; i++){ gettimeofday(&tv, NULL); } } else if (strcmp(argv[1], "syscall") == 0){ for (i = 0; i < loop; i++) syscall(__NR_gettimeofday, &tv, NULL); } if(argc == 4 && (strcmp(argv[3], "loop")==0)){ while(1){ sleep(60); } } return 0; }

默认此程序如下提示:

# ./test_vdso Usage: ./test_vdso vdso|syscall count loop

我们可以进行两项基准测试:

vdso syscall

count代表循环的次数,loop代表是否进入死循环。

4.1 vdso测试

我们以1次的vdso测试,如下:

# ./test_vdso vdso 1 loop pid=53329 sysinfo_ehdr(vdso_addr)=0x7f96689000 gettimeofday addr=0x7f966895c0

我们拿到了两个地址,一个是vdso_addr=0x7f96689000,一个是函数符号地址 gettimeofday=0x7f966895c0

此时我们可以查看maps,如下:

# cat /proc/$(pidof test_vdso)/maps | grep "\[vdso\]" 7f96689000-7f9668a000 r-xp 00000000 00:00 0 这里看到0x7f96689000能够对应上AT_SYSINFO_EHDR

此时我们查看gettimeofday的符号地址如下:

00000000000005c0 424 FUNC GLOBAL DEFAULT 7 __kernel_gettimeofday@@LINUX_2.6.39

可以发现其计算如下:

0x7f966895c0 = 0x7f96689000 + 00000000000005c0

我们使用ltrace定位如下:

# ltrace ./test_vdso vdso 1 __libc_start_main(0x55756a0a9c, 3, 0x7fddd9a658, 0x55756a0cc0 <unfinished ...> getauxval(33, 0, 0x7fddd9a678, 0x55756a0a9c) = 0x7f80bba000 strcmp("vdso", "--help") = 73 atoi(0x7fddd9b4f9, 0x55756a0d61, 118, 45) = 1 getpid() = 59791 printf("pid=%d sysinfo_ehdr(vdso_addr)=%"..., 59791, 0x7f80bba000pid=59791 sysinfo_ehdr(vdso_addr)=0x7f80bba000 ) = 48 strcmp("vdso", "vdso") = 0 printf("gettimeofday addr=%p \n", 0x7f80bba5c0gettimeofday addr=0x7f80bba5c0 ) = 32 gettimeofday(0x7fddd9a4e8, 0) = 0 __cxa_finalize(0x55756b2008, 0x55756a0a50, 0x11d20, 1) = 0x7f80b4ec70 +++ exited (status 0) +++

这里看到ltrace调用能够定位到其调用了动态库的gettimeofday函数。我们strace查看调用如下:

strace ./test_vdso vdso 1 2>&1 | grep "gettimeofday("

可以发现vdso的时候,调用gettimeofday并不会产生系统调用。

至此,我们可以知道,代码里gettimeofday(&tv, NULL);的调用就是调用的vdso.so里面的__kernel_gettimeofday@@LINUX_2.6.39

此时我们将count放大为1亿次调用,统计时间如下:

# time ./test_vdso vdso 100000000 pid=58465 sysinfo_ehdr(vdso_addr)=0x7f8e1b8000 gettimeofday addr=0x7f8e1b85c0 real 0m3.946s user 0m3.940s sys 0m0.007s 可以发现,用时3.946s

4.2 syscall测试

我们以1次的syscall测试,通过strace查看系统调用,如下:

# strace ./test_vdso syscall 1 2>&1 | grep "gettimeofday(" gettimeofday({tv_sec=1734333991, tv_usec=512630}, NULL) = 0

可以发现其通过syscall下发的gettimeofday。

我们尝试看看ltrace的信息

# ltrace ./test_vdso syscall 1 2>&1 __libc_start_main(0x558d780a9c, 3, 0x7fe207c958, 0x558d780cc0 <unfinished ...> getauxval(33, 0, 0x7fe207c978, 0x558d780a9c) = 0x7fb95b3000 strcmp("syscall", "--help") = 70 atoi(0x7fe207d4f9, 0x558d780d61, 115, 45) = 1 getpid() = 63596 printf("pid=%d sysinfo_ehdr(vdso_addr)=%"..., 63596, 0x7fb95b3000pid=63596 sysinfo_ehdr(vdso_addr)=0x7fb95b3000 ) = 48 strcmp("syscall", "vdso") = -3 strcmp("syscall", "syscall") = 0 syscall(169, 0x7fe207c7e8, 0, 0x11b033b440000) = 0 __cxa_finalize(0x558d792008, 0x558d780a50, 0x11d20, 1) = 0x7fb9547c70 +++ exited (status 0) +++

可以发现ltrace这里没有gettimeofday。

此时我们将count放大为1亿次调用,统计时间如下:

# time ./test_vdso syscall 100000000 pid=64134 sysinfo_ehdr(vdso_addr)=0x7fb55c0000 real 0m16.279s user 0m3.927s sys 0m12.352s

可以发现,用时16.279s,主要耗时在syscall上。

五、结论

我们可以发现,对于vDSO而言,linux设计了一个动态库,使其默认通过vDSO的共享地址调用函数,而不需要使用系统调用,其在1亿次为基准的情况下能够是syscall的5-6倍的性能提升。