rts-cpu-isolator may have some non-standard requirements for being stable
This issue is being used to track the module unload issue (and any others we find) until we have answers as to what requirements the rts-cpu-isolator has, apart from Debian 11's default configuration.
Completed Tests (See working configs)
- Load/Unload Test
- The benchmark kernel module was repetitively loaded and unloaded by a script for ~10 hours
- Kernel was still functioning as expected after the test, and a quick glance at the stats suggested timing stability
Upcoming Tests
- Long term timing test with the most vanilla Debian 11 configuration we can find
Configurations
Working
- Isolate CPUs (BOOT_IMAGE=/boot/vmlinuz-5.10.0-10-amd64 ... ro isolcpus=2-7 quiet)
- This is currently the most vanilla configuration that appears to fix the module unload issue
Broken
- Default Everything (BOOT_IMAGE=/boot/vmlinuz-5.10.0-10-amd64 ... ro quiet) (DFE)
- The very default Debian 11 configuration, appears to have the module unload issue
- DFE Bring the CPU down and back up before running the benchmark on it
- DFE Isolate with tuna and move kernel work queue affinity
sudo tuna --cpus=7 --isolate
find /sys/devices/virtual/workqueue -name cpumask -exec sh -c 'echo 1 > {}' ';'
Known Issues
Module Unload Issue
This happens when we try to unload the benchmark module. If we ignore this issue, and keep loading the unloading the modules, the kernel eventually crashes.
[ 173.169782] LIGO code is done, calling regular shutdown code
[ 175.231750] smpboot: Booting Node 0 Processor 7 APIC 0xe
[ 175.232871] ------------[ cut here ]------------
[ 175.232879] WARNING: CPU: 7 PID: 0 at arch/x86/mm/tlb.c:625 initialize_tlbstate_and_flush+0x16e/0x180
[ 175.232880] Modules linked in: rtKernelBench(OE-) rts_cpu_isolator(OE) intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ast drm_vram_helper drm_ttm_helper irqbypass ttm drm_kms_helper ghash_clmulni_intel cec aesni_intel libaes crypto_simd cryptd glue_helper iTCO_wdt intel_pmc_bxt pcspkr iTCO_vendor_support evdev joydev rapl intel_cstate ipmi_ssif mei_me watchdog mei intel_uncore acpi_ipmi ipmi_si sg ipmi_devintf ipmi_msghandler ioatdma acpi_power_meter acpi_pad button drm fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic hid_generic usbhid hid sd_mod sr_mod cdrom t10_pi crc_t10dif crct10dif_generic xhci_pci ahci xhci_hcd libahci ehci_pci ehci_hcd crct10dif_pclmul libata crct10dif_common crc32_pclmul igb usbcore crc32c_intel scsi_mod i2c_algo_bit i2c_i801 dca lpc_ich i2c_smbus ptp usb_common pps_core wmi
[ 175.232922] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G OE 5.10.0-10-amd64 #1 Debian 5.10.84-1
[ 175.232922] Hardware name: Racklive Super Server/X10SRi-F, BIOS 3.1 06/06/2018
[ 175.232924] RIP: 0010:initialize_tlbstate_and_flush+0x16e/0x180
[ 175.232926] Code: 02 00 75 14 0f 0b 48 8b 73 50 bf 00 00 00 80 48 8d 14 3e e9 12 ff ff ff 48 8b 73 50 bf 00 00 00 80 48 8d 14 3e e9 00 ff ff ff <0f> 0b e9 e7 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[ 175.232927] RSP: 0000:ffffb45840137ee0 EFLAGS: 00010006
[ 175.232928] RAX: 000000080520a000 RBX: ffff8997c4946e80 RCX: 00000001085ba000
[ 175.232929] RDX: ffff8991485ba000 RSI: ffff8990c85ba000 RDI: 0000766fc0000000
[ 175.232930] RBP: 0000000000000001 R08: 000000000000000f R09: fffffffface08aa0
[ 175.232930] R10: 000000000000000f R11: ffffb45840137e16 R12: 0000000000000007
[ 175.232931] R13: ffff8990c09f2f80 R14: 0000000000000000 R15: 000000000000b000
[ 175.232932] FS: 0000000000000000(0000) GS:ffff89981fdc0000(0000) knlGS:0000000000000000
[ 175.232932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 175.232933] CR2: 0000000000000000 CR3: 000000080520a000 CR4: 00000000003300a0
[ 175.232933] Call Trace:
[ 175.232940] cpu_init+0x1d4/0x3b0
[ 175.232944] start_secondary+0x23/0x150
[ 175.232946] ? set_cpu_sibling_map+0x570/0x570
[ 175.232948] secondary_startup_64_no_verify+0xb0/0xbb
[ 175.232950] ---[ end trace 630c4a88f58c57df ]---
Edited by Ezekiel Dohmen