aarch64架构有多个异常方式,例如:中断,软件异常,同步异常等等
这些异常在代码中需要对应一个内存地址用作跳转,在aarch64中,每个异常都可以设置入口函数,从而实现异常的处理。与armv7不同的是,v8上默认向量表的空间更大了,支持0x80大小
在linux中,异常向量表通过如下定义
SYM_CODE_START(vectors) kernel_ventry 1, t, 64, sync // Synchronous EL1t kernel_ventry 1, t, 64, irq // IRQ EL1t kernel_ventry 1, t, 64, fiq // FIQ EL1t kernel_ventry 1, t, 64, error // Error EL1t kernel_ventry 1, h, 64, sync // Synchronous EL1h kernel_ventry 1, h, 64, irq // IRQ EL1h kernel_ventry 1, h, 64, fiq // FIQ EL1h kernel_ventry 1, h, 64, error // Error EL1h kernel_ventry 0, t, 64, sync // Synchronous 64-bit EL0 kernel_ventry 0, t, 64, irq // IRQ 64-bit EL0 kernel_ventry 0, t, 64, fiq // FIQ 64-bit EL0 kernel_ventry 0, t, 64, error // Error 64-bit EL0 kernel_ventry 0, t, 32, sync // Synchronous 32-bit EL0 kernel_ventry 0, t, 32, irq // IRQ 32-bit EL0 kernel_ventry 0, t, 32, fiq // FIQ 32-bit EL0 kernel_ventry 0, t, 32, error // Error 32-bit EL0 SYM_CODE_END(vectors)
通过上面可以看到,linux定义了这些异常的实现,右边有注释,就不解释了。
代码如下:
.macro kernel_ventry, el:req, ht:req, regsize:req, label:req .align 7 .Lventry_start\@: .if \el == 0 /* * This must be the first instruction of the EL0 vector entries. It is * skipped by the trampoline vectors, to trigger the cleanup. */ b .Lskip_tramp_vectors_cleanup\@ .if \regsize == 64 mrs x30, tpidrro_el0 msr tpidrro_el0, xzr .else mov x30, xzr .endif .Lskip_tramp_vectors_cleanup\@: .endif sub sp, sp, #PT_REGS_SIZE #ifdef CONFIG_VMAP_STACK /* * Test whether the SP has overflowed, without corrupting a GPR. * Task and IRQ stacks are aligned so that SP & (1 << THREAD_SHIFT) * should always be zero. */ add sp, sp, x0 // sp' = sp + x0 sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp tbnz x0, #THREAD_SHIFT, 0f sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0 sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp b el\el\ht\()_\regsize\()_\label 0: /* * Either we've just detected an overflow, or we've taken an exception * while on the overflow stack. Either way, we won't return to * userspace, and can clobber EL0 registers to free up GPRs. */ /* Stash the original SP (minus PT_REGS_SIZE) in tpidr_el0. */ msr tpidr_el0, x0 /* Recover the original x0 value and stash it in tpidrro_el0 */ sub x0, sp, x0 msr tpidrro_el0, x0 /* Switch to the overflow stack */ adr_this_cpu sp, overflow_stack + OVERFLOW_STACK_SIZE, x0 /* * Check whether we were already on the overflow stack. This may happen * after panic() re-enables interrupts. */ mrs x0, tpidr_el0 // sp of interrupted context sub x0, sp, x0 // delta with top of overflow stack tst x0, #~(OVERFLOW_STACK_SIZE - 1) // within range? b.ne __bad_stack // no? -> bad stack pointer /* We were already on the overflow stack. Restore sp/x0 and carry on. */ sub sp, sp, x0 mrs x0, tpidrro_el0 #endif b el\el\ht\()_\regsize\()_\label .org .Lventry_start\@ + 128 // Did we overflow the ventry slot? .endm .macro tramp_alias, dst, sym, tmp mov_q \dst, TRAMP_VALIAS adr_l \tmp, \sym add \dst, \dst, \tmp adr_l \tmp, .entry.tramp.text sub \dst, \dst, \tmp .endm
这段汇编有点长,我跳过一下清理和栈必要操作,下面开始解析关键部分。
首先查看宏定义如下
.macro kernel_ventry, el:req, ht:req, regsize:req, label:req
这里宏有四个参数,其中req指的是required,四个参数含义如下:
b el\el\ht\()_\regsize\()_\label
这里对应c是如下
el##${el}${ht}_${regsize}_${label}
以fiq举例如下:
kernel_ventry 1, t, 64, fi
则跳转函数el1t_64_fiq
.org .Lventry_start\@ + 128
这里确保每个异常向量入口占用恰好 128 字节的空间。
为了确定汇编宏扩展是否正确,可以通过nm查看,如下
# nm entry.o | awk '{print $3}' | grep "^el" el0t_32_error el0t_32_fiq el0t_32_irq el0t_32_sync el0t_64_error el0t_64_fiq el0t_64_irq el0t_64_sync el1h_64_error el1h_64_fiq el1h_64_irq el1h_64_sync el1t_64_error el1t_64_fiq el1t_64_irq el1t_64_sync
我们原来最早调试51的时候,碰到程序跑飞了就不管了,那个时候认为程序跑飞是概率事件,所以没有考虑其跑飞的真正原因。最近发现aarch64架构的bl指令,非常容易导致程序跑飞。本文复现这种跑飞的情况,用来说明一个非常常见的跑飞现象,并且给出解决思路。 同样,思考以前调试51的时候,跑飞很可能是芯片的一些特性导致的问题。只怪当时知识尚浅,无法理解。
本文以rtems上的汇编为例,当然可以在linux中复现,本文就不在linux重复实验了,问题一样会出现,如下是代码diff
diff --git a/bsps/aarch64/shared/start/start.S b/bsps/aarch64/shared/start/start.S index 0a89d85035..04e2b1a8fa 100644 --- a/bsps/aarch64/shared/start/start.S +++ b/bsps/aarch64/shared/start/start.S @@ -43,6 +43,18 @@ .globl _start .section ".bsp_start_text", "ax" +kylin_call: + mov x0, xzr + ret + +kylin_out_of_control: + mov x0, xzr + bl kylin_call + mov x0, xzr + ret + /* Start entry */ _start: @@ -341,6 +353,8 @@ _start: /* Branch to start hook 1 */ bl bsp_start_hook_1 + bl kylin_out_of_control + /* Branch to boot card */ mov x0, #0 bl boot_card
上面的代码我定义了一个kylin_out_of_contrl标签,通过bl来跳转此标签,然后在kylin_out_of_control中,添加了一个没有意义的指令来代替其他指令,没有意义的指令是将x0清空。然后调用kylin_call标签,在kylin_call中继续将x0清空后返回到kylin_out_of_control,然后kylin_out_of_control内继续调用x0清空来代替其他指令,最后返回。
根据我们对汇编的理解,上面的代码完全正常,bl是跳转指令,ret返回指令,mov移动指令。每个指令简单易懂,这些代码为什么组合会跑飞呢。可以思考一下
为了理解这个问题,我们需要了解x30寄存器。在aapcs中我们可以知道x30作为lr寄存器,这个寄存器保存的是上一个函数调用pc值的+4,也就是 last_pc + 4 的值。这里出现跑飞的原因是x30寄存器不正常了。
那么我们可以思考,为什么x30会不正常?
我们知道ret指令是返回指令,那么有一个问题,ret指令它会操作系统寄存器吗(x0-x30寄存器)?
aarch64的定义上,ret没有提到会操作寄存器,我们看看ret的说明如下:
根据上面的信息,我们知道ret默认返回x30寄存器。完全没问题啊,不会操作任何寄存器
那么问题就出来了,ret不会操作寄存器,但是我们bl进入的时候,会根据aapce记录x30寄存器为原pc+4。也就是会改写x30寄存器。
发现没有,x30寄存器会被aapcs改写,但是不会在ret还原。所以真相是:
如果bl出现嵌套,那么两次ret时,最外层的函数的x30和内层函数的x30相等,那么程序就一直在这里打转。于是出现了跑飞现象
上面文字介绍不是很直观,下面开始演示
如下
qemu-system-aarch64 -no-reboot -nographic -serial mon:stdio -machine xlnx-zcu102 -m 4096 -smp 4 -kernel build/aarch64/zynqmp_qemu/testsuites/libtests/kylinos.exe -s -S
gdb开始监测
aarch64-rtems6-gdb build/aarch64/zynqmp_qemu/testsuites/libtests/malloctest.exe
挂上断点
(gdb) b kylin_out_of_control Breakpoint 1 at 0x18008: file ../../../bsps/aarch64/shared/start/start.S, line 51. (gdb) c Continuing. Thread 1 hit Breakpoint 1, kylin_out_of_control () at ../../../bsps/aarch64/shared/start/start.S:51 51 mov x0, xzr (gdb) l 46 kylin_call: 47 mov x0, xzr 48 ret 49 50 kylin_out_of_control: 51 mov x0, xzr 52 //mov x19, x30 53 bl kylin_call 54 //mov x30, x19 55 mov x0, xzr
此时x30寄存器为
(gdb) x/x $x30 0x180e4 <_start+204>
这里还是一切正常,x30正好是_start+200 + 4
此时进入kylin_call,如下
(gdb) x/x $x30 0x18010 <kylin_out_of_control+8>:
然后我们第一个ret返回,如下
(gdb) s kylin_out_of_control () at ../../../bsps/aarch64/shared/start/start.S:55 55 mov x0, xzr (gdb) x/x $x30 0x18010 <kylin_out_of_control+8>: 0xaa1f03e0 (gdb) disassemble Dump of assembler code for function kylin_out_of_control: 0x0000000000018008 <+0>: mov x0, xzr 0x000000000001800c <+4>: bl 0x18000 <bsp_vector_table_begin> => 0x0000000000018010 <+8>: mov x0, xzr 0x0000000000018014 <+12>: ret End of assembler dump.
可以看到在kylin_out_of_control中了,但是异常出现了,kylin_out_of_control函数中的x30寄存器并不是我们认为的_start+204,而是kylin_out_of_control+8
按照代码逻辑,即将在0x0000000000018014处调用ret,而这个ret默认找的x30是kylin_out_of_control+8。程序会死循环了。也就是我们说的跑飞了。
为了说明这个问题,我们需要看看两次bl和ret是否是bl修改了x30,而ret不会修改任何寄存器。信息如下
第一个bl,进入kylin_out_of_control时,寄存器如下
x1 0x0 0 x2 0xffffffffffffffe8 -24 x3 0x1030c0 1061056 x4 0x103128 1061160 x5 0x4 4 x6 0x4 4 x7 0xf 15 x8 0x1c 28 x9 0x4 4 x10 0x3c0 960 x11 0x3fffffff 1073741823 x12 0x0 0 x13 0x8000000000 549755813888 x14 0x40000000 1073741824 x15 0x200000 2097152 x16 0x0 0 x17 0x0 0 x18 0x0 0 x19 0x0 0 x20 0x0 0 x21 0x0 0 x22 0x0 0 x23 0x0 0 x24 0x0 0 x25 0x0 0 x26 0x0 0 x27 0x0 0 x28 0x0 0 x29 0x0 0 x30 0x180e4 98532
然后再一个bl进入kylin_call函数,我们查看x0-x30如下
x0 0x0 0 x1 0x0 0 x2 0xffffffffffffffe8 -24 x3 0x1030c0 1061056 x4 0x103128 1061160 x5 0x4 4 x6 0x4 4 x7 0xf 15 x8 0x1c 28 x9 0x4 4 x10 0x3c0 960 x11 0x3fffffff 1073741823 x12 0x0 0 x13 0x8000000000 549755813888 x14 0x40000000 1073741824 x15 0x200000 2097152 x16 0x0 0 x17 0x0 0 x18 0x0 0 x19 0x0 0 x20 0x0 0 x21 0x0 0 x22 0x0 0 x23 0x0 0 x24 0x0 0 x25 0x0 0 x26 0x0 0 x27 0x0 0 x28 0x0 0 x29 0x0 0 x30 0x18010 98320
发现没有,bl寄存器会修改x30寄存器。
然后我们看第一个ret后的寄存器,此时应该在函数kylin_call的ret后,也就是kylin_out_of_control中
x0 0x0 0 x1 0x0 0 x2 0xffffffffffffffe8 -24 x3 0x1030c0 1061056 x4 0x103128 1061160 x5 0x4 4 x6 0x4 4 x7 0xf 15 x8 0x1c 28 x9 0x4 4 x10 0x3c0 960 x11 0x3fffffff 1073741823 x12 0x0 0 x13 0x8000000000 549755813888 x14 0x40000000 1073741824 x15 0x200000 2097152 x16 0x0 0 x17 0x0 0 x18 0x0 0 x19 0x0 0 x20 0x0 0 x21 0x0 0 x22 0x0 0 x23 0x0 0 x24 0x0 0 x25 0x0 0 x26 0x0 0 x27 0x0 0 x28 0x0 0 x29 0x0 0 x30 0x18010 98320
可以发现,此时寄存器和bl后的寄存器完全一致,x30没有发生改变
然后查看第二次ret的寄存器信息
x0 0x0 0 x1 0x0 0 x2 0xffffffffffffffe8 -24 x3 0x1030c0 1061056 x4 0x103128 1061160 x5 0x4 4 x6 0x4 4 x7 0xf 15 x8 0x1c 28 x9 0x4 4 x10 0x3c0 960 x11 0x3fffffff 1073741823 x12 0x0 0 x13 0x8000000000 549755813888 x14 0x40000000 1073741824 x15 0x200000 2097152 x16 0x0 0 x17 0x0 0 x18 0x0 0 x19 0x0 0 x20 0x0 0 x21 0x0 0 x22 0x0 0 x23 0x0 0 x24 0x0 0 x25 0x0 0 x26 0x0 0 x27 0x0 0 x28 0x0 0 x29 0x0 0 x30 0x18010 98320
完全没有变化,而且代码回到了bl后的地址上。详细如下
(gdb) disassemble Dump of assembler code for function kylin_out_of_control: 0x0000000000018008 <+0>: mov x0, xzr 0x000000000001800c <+4>: bl 0x18000 <bsp_vector_table_begin> => 0x0000000000018010 <+8>: mov x0, xzr 0x0000000000018014 <+12>: ret End of assembler dump. (gdb) s 56 ret (gdb) disassemble Dump of assembler code for function kylin_out_of_control: 0x0000000000018008 <+0>: mov x0, xzr 0x000000000001800c <+4>: bl 0x18000 <bsp_vector_table_begin> 0x0000000000018010 <+8>: mov x0, xzr => 0x0000000000018014 <+12>: ret End of assembler dump. (gdb) s 55 mov x0, xzr (gdb) disassemble Dump of assembler code for function kylin_out_of_control: 0x0000000000018008 <+0>: mov x0, xzr 0x000000000001800c <+4>: bl 0x18000 <bsp_vector_table_begin> => 0x0000000000018010 <+8>: mov x0, xzr 0x0000000000018014 <+12>: ret End of assembler dump.
修复这个问题很简单,我们可以利用x19这种aapcs约定临时存储寄存器来保存x30,如下
diff --git a/bsps/aarch64/shared/start/start.S b/bsps/aarch64/shared/start/start.S index 0a89d85035..f284cf2baa 100644 --- a/bsps/aarch64/shared/start/start.S +++ b/bsps/aarch64/shared/start/start.S @@ -43,6 +43,17 @@ .globl _start .section ".bsp_start_text", "ax" +kylin_call: + mov x0, xzr + ret + +kylin_out_of_control: + mov x19, x30 + bl kylin_call + mov x30, x19 + mov x0, xzr + ret + /* Start entry */ _start: @@ -341,6 +352,8 @@ _start: /* Branch to start hook 1 */ bl bsp_start_hook_1 + bl kylin_out_of_control + /* Branch to boot card */ mov x0, #0 bl boot_card
此时运行代码,系统不再跑飞了。
apt install docker docker.io docker-compose
/etc/docker/daemon.json
{ "dns": ["8.8.8.8", "8.8.4.4"], "registry-mirrors": [ "https://docker.m.daocloud.io/", "https://huecker.io/", "https://dockerhub.timeweb.cloud", "https://noohub.ru/", "https://dockerproxy.com", "https://docker.mirrors.ustc.edu.cn", "https://docker.nju.edu.cn", "https://xx4bwyg2.mirror.aliyuncs.com", "http://f1361db2.m.daocloud.io", "https://registry.docker-cn.com", "http://hub-mirror.c.163.com" ] }
curl -L https://vanblog.mereith.com/vanblog.sh -o vanblog.sh && chmod +x vanblog.sh && ./vanblog.sh
按下10备份即可
10. 备份 VanBlog
此时获得文件 vanblog-backup-20250311153742.tar.gz
按下11 恢复即可
11. 恢复 VanBlog
至此,全部完成
Waydorid是从anbox演进的新的容器安卓的一种方案,我们的客户想要运行安卓程序,这里描述如何使用waydroid来兼容安卓10的系统
git clone https://github.com/waydroid/libglibutil.git git clone https://github.com/waydroid/libgbinder.git git clone https://github.com/waydroid/gbinder-python git clone https://github.com/waydroid/waydroid.git
apt build-dep . debuild -us -uc
这里 gbinder-python没用changlog,所有需要额外处理一下
dch --create --package "gbinder-python" --newversion "1.0.0~git20210909-1" foo bar dh_make --createorig -p "gbinder-python_1.0.0~git20210909"
因为waydroid依赖容器,所以需要安装lxc
apt install lxc dpkg -i bridge-utils_1.6-2kylin1_arm64.deb liblxc1_1%3a4.0.2-0kylin1_arm64.deb liblxc-common_1%3a4.0.2-0kylin1_arm64.deb lxc_1%3a4.0.2-0kylin1_all.deb lxc-utils_1%3a4.0.2-0kylin1_arm64.deb
然后安装编译出来的包
dpkg -i libgbinder_1.1.25_arm64.deb libgbinder-tools_1.1.25_arm64.deb libglibutil_1.0.66_arm64.deb python3-gbinder_1.1_arm64.deb waydroid_1.3.0_all.deb
安卓系统需要binder和ashmem,如果系统安装了linux-module包,可以直接加载,如下
modprobe ashmem_linux modprobe binder_linux
如果linux是可编译的,可以修改config增加如下
CONFIG_ASHMEM=y CONFIG_ANDROID=y CONFIG_ANDROID_BINDER_IPC=y CONFIG_ANDROID_BINDERFS=y
CONFIG_BRIDGE / "802.1d Ethernet Bridging" ('Networking support -> Networking options -> 802.1d Ethernet Bridging') CONFIG_VETH / "Veth pair device" CONFIG_MACVLAN / "Macvlan" CONFIG_VLAN_8021Q / "Vlan" CONFIG_NETFILTER CONFIG_NF_NAT CONFIG_NF_TABLES CONFIG_NFT_NAT CONFIG_NETFILTER_XT_TARGET_CHECKSUM CONFIG_IP_NF_TARGET_MASQUERADE CONFIG_NF_CONNTRACK CONFIG_NETFILTER_XTABLES CONFIG_IP_NF_IPTABLES CONFIG_IP_NF_FILTER CONFIG_IP_NF_MANGLE
如下是kmre启动过程中需要的iptable_nat模块添加的宏:
CONFIG_PACKET=y CONFIG_NETFILTER=y CONFIG_IP_NF_CONNTRACK=y CONFIG_IP_NF_FTP=y CONFIG_IP_NF_IRC=y CONFIG_IP_NF_IPTABLES=y CONFIG_IP_NF_FILTER=y CONFIG_IP_NF_NAT=y CONFIG_IP_NF_MATCH_STATE=y CONFIG_IP_NF_TARGET_LOG=y CONFIG_IP_NF_MATCH_LIMIT=y CONFIG_IP_NF_TARGET_MASQUERADE=y CONFIG_NETFILTER_ADVANCED=y CONFIG_NF_CONNTRACK_IPV4=y
CONFIG_NAMESPACES / "Namespaces" ('General Setup -> Namespaces support') CONFIG_UTS_NS / "Utsname namespace" ('General Setup -> Namespaces Support / UTS namespace') CONFIG_IPC_NS / "Ipc namespace" ('General Setup -> Namespaces Support / IPC namesapce') CONFIG_USER_NS / "User namespace" ('General Setup -> Namespaces Support / User namespace (EXPERIMENTAL)') CONFIG_PID_NS / "Pid namespace" ('General Setup -> Namespaces Support / PID Namespaces') CONFIG_NET_NS / "Network namespace" ('General Setup -> Namespaces Support -> Network namespace')
CONFIG_CGROUP_SCHED / "Cgroup sched" ('General Setup -> Control Group support -> Group CPU scheduler') CONFIG_FAIR_GROUP_SCHED / "Group scheduling for SCHED_OTHER" ('General Setup -> Control Group support -> Group CPU scheduler -> Group scheduling for SCHED_OTHER') CONFIG_BLK_CGROUP / "Block IO controller" ('General Setup -> Control Group support -> Block IO controller') CONFIG_CFQ_GROUP_IOSCHED / "CFQ Group Scheduling support" ('Enable the block layer -> IO Schedulers -> CFQ I/O scheduler -> CFQ Group Scheduling support') CONFIG_CGROUP_CPUACCT / "Cgroup cpu account" ('General Setup -> Control Group support -> Simple CPU accounting cgroup subsystem') CONFIG_MEMCG_SWAP / Memory Resource Controller Swap Extension
内核配置可以按照如下脚本检测docker启动必须的脚本:check-config.sh;使用命令如下:
./check-config.sh rockchip_linux_defconfig
配置如下
General setup ---> [*] Control Group support ---> [*] Freezer cgroup subsystem [*] Device controller for cgroups [*] Cpuset support [*] Include legacy /proc/<pid>/cpuset file [*] Simple CPU accounting cgroup subsystem [*] Resource counters [*] Memory Resource Controller for Control Groups [*] Memory Resource Controller Swap Extension [*] Memory Resource Controller Swap Extension enabled by default [*] Enable perf_event per-cpu per-container group (cgroup) monitoring [*] Group CPU scheduler ---> [*] Group scheduling for SCHED_OTHER [*] Group scheduling for SCHED_RR/FIFO <*> Block IO controller -*- Namespaces support [*] UTS namespace [*] IPC namespace [*] User namespace (EXPERIMENTAL) [*] PID Namespaces [*] Network namespace [*] Configure standard kernel features (expert users) [*] Networking support ---> Networking options ---> <M> 802.1d Ethernet Bridging <M> 802.1Q VLAN Support Device Drivers ---> [*] Network device support ---> <M> MAC-VLAN support (EXPERIMENTAL) <M> Virtual ethernet pair device Character devices ---> -*- Unix98 PTY support [*] Support multiple instances of devpts
在cmdline,也就是chosen的bootarge下追加 psi=1,可以降低oom的风险
默认需要下载system.img和vendor.img
如果已存在上述文件,可以直接使用
mkdir -p /usr/share/waydroid-extra/images mv system.img vendor.img /usr/share/waydroid-extra/images/ waydroid init
如果没有system vendor,则会自动下载
下载地址为:https://sourceforge.net/projects/waydroid/files/images/
使用之前,建议重启一下容器服务,如下:
systemctl restart waydroid-container.service
重启之后,在机器环境下直接启动安卓UI如下
waydroid show-full-ui
安卓容器可以获取/设置属性,通过如下命令
waydroid prop get ro.hardware.egl waydroid prop set ro.hardware.egl=swiftshader
waydroid app install *.apk waydroid app launch com.android.calculator2
waydroid logcat
waydroid shell
ro.hardware.gralloc和ro.hardware.egl是选择显卡驱动的。
如果没有选择,默认走gbm和mesa。如果为了兼容,可以走
ro.hardware.gralloc=default ro.hardware.egl=swiftshader
当然不同机器按照实际情况走即可
waydroid是基于wayland后端,如果要在X11平台启动,则需要启动一个wayland后端,这里可以是weston
然后在weston里面启动waydroid ui即可