Khepu

Khepu https://khepu.com Where I technologically vent Fri, 25 Apr 2025 23:49:38 GMT https://validator.w3.org/feed/docs/rss2.html Feed | Nigiri en Khepu https://khepu.com/favicon/apple-touch-icon.png https://khepu.com All rights reserved 2025 <![CDATA[Battleplans: Pathfinding (part 1)]]> https://khepu.com/posts/2025-03-30 https://khepu.com/posts/2025-03-30 Sun, 30 Mar 2025 13:00:00 GMT <![CDATA[Cockpit: Handling raw data]]> https://khepu.com/posts/2024-11-24 https://khepu.com/posts/2024-11-24 Sun, 24 Nov 2024 23:22:00 GMT <![CDATA[Cockpit: Structure of a basic driver]]> https://khepu.com/posts/2024-08-25 https://khepu.com/posts/2024-08-25 Sun, 25 Aug 2024 08:25:00 GMT #include #include MODULE_LICENSE("GPL"); MODULE_AUTHOR("Giorgos Makris"); MODULE_DESCRIPTION("Driver for the WinWing UFC1"); #define VENDOR_ID 0x4098 #define PRODUCT_ID 0xbed0 static struct usb_device_id usb_table[] = { {USB_DEVICE(VENDOR_ID, PRODUCT_ID)}, {}, }; MODULE_DEVICE_TABLE(usb, usb_table); static int probe(struct usb_interface *intf, const struct usb_device_id *id) { printk("winwing_ufc1_devdrv - Probed\n"); return 0; } static void disconnect(struct usb_interface *intf){ printk("winwing_ufc1_devdrv - Disconnected\n"); } static struct usb_driver driver = { .name = "winwing_ufc1_devdrv", .id_table = usb_table, .probe = probe, .disconnect = disconnect }; static int __init ww_ufc_init(void) { int registered = usb_register(&driver); if (registered) { printk("winwing_ufc1_devdrv - Error: could not register driver!\n"); return -registered; } printk("winwing_ufc1_devdrv - Initialized driver\n"); return 0; } static void __exit ww_ufc_exit(void) { usb_deregister(&driver); printk("winwing_ufc1_devdrv - Unloaded driver\n"); } module_init(ww_ufc_init); module_exit(ww_ufc_exit); ``` ## Licenses I am gonna start with `MODULE_LICENSE`, because that's the one that was the weirdest one for me. Originally I had it set to `MIT`. Tough luck, turns out that if you don't have the right license specified, compilation breaks. Here are the logs when you have the wrong license: ```bash $ make make -C /lib/modules/6.5.0-41-generic/build M=/home/gmakris/project/winwing-ufc1-driver modules make[1]: Entering directory '/usr/src/linux-headers-6.5.0-41-generic' CC [M] /home/gmakris/project/winwing-ufc1-driver/winwing_ufc1_devdrv.o MODPOST /home/gmakris/project/winwing-ufc1-driver/Module.symvers ERROR: modpost: GPL-incompatible module winwing_ufc1_devdrv.ko uses GPL-only symbol 'usb_deregister' ERROR: modpost: GPL-incompatible module winwing_ufc1_devdrv.ko uses GPL-only symbol 'usb_register_driver' make[3]: *** [scripts/Makefile.modpost:144: /home/gmakris/project/winwing-ufc1-driver/Module.symvers] Error 1 make[2]: *** [/usr/src/linux-headers-6.5.0-41-generic/Makefile:1991: modpost] Error 2 make[1]: *** [Makefile:234: __sub-make] Error 2 make[1]: Leaving directory '/usr/src/linux-headers-6.5.0-41-generic' make: *** [Makefile:4: all] Error 2 ``` `usb_deregister` and `usb_register_driver` are GPL-only symbols. Unlike the kernel, I couldn't care less about the license, so I just switched it... ## Driver layout We've looked at how to find the `vendor_id` and `product_id` in a previous post. They are important here because handing them over to the `usb_table` is what the kernel will use to match a driver to a connected device. There are ways to be more generic and write a driver that allows the kernel to match it to multiple devices but we don't need that, we have one device and our driver is tailored to it. The `usb_table`, along with the `probe` and `disconnect` functions are what we need to define a driver. We pack all that in a struct `driver` and then make use of them in the module init and exit functions (`ww_ufc_init` and `ww_ufc_exit`). ## Installing the driver Linux actually does a great work at being modular and allowing you to easily plug in new modules, such as a driver. After bulding with make it is just a matter of running `sudo insmod winwing_ufc1_devdrv.ko`. If it fails a more descriprive error message should be in `dmesg`. I found out the ugly way that `uname -r` can lie to you if your system has updated since it booted. In that case you will get an error like this: ```bash $ dmesg [ 177.492936] module winwing_ufc1_devdrv: .gnu.linkonce.this_module section size must match the kernel's built struct module size at run time ``` If this happens then reboot and `apt-get install --reinstall linux-headers-$(uname -r)`. In any case, if it all works, then that's what `dmesg` should show: ```bash $ dmesg [ 305.695775] usbcore: registered new interface driver winwing_ufc1_devdrv [ 305.695784] winwing_ufc1_devdrv - Initialized driver ``` Even though the driver is initialized at the kernel level we do not see the probe message, which means that it was not assigned to the device. There could be a couple of reasons for that but the most probable one I found is that the device itself declares that it is an HID device. So the above driver would not even be considered in this case. Time to convert it to an HID driver and look at the difference then. ## Converting to HID ```c #include #include #include #include MODULE_LICENSE("GPL"); MODULE_AUTHOR("Giorgos Makris"); MODULE_DESCRIPTION("Driver for the WinWing UFC1"); #define VENDOR_ID 0x4098 #define PRODUCT_ID 0xbed0 static struct hid_device_id device_table[] = { {HID_USB_DEVICE(VENDOR_ID, PRODUCT_ID)}, {}, }; MODULE_DEVICE_TABLE(hid, device_table); static int probe(struct hid_device *hdev, const struct hid_device_id *id) { printk("winwing_ufc1_devdrv: probed\n"); return 0; } static int input_configured(struct hid_device *hdev, struct hid_input *hidinput) { return 0; } static int raw_event(struct hid_device *hdev, struct hid_report *report, u8 *raw_data, int size) { return 0; } static struct hid_driver ww_ufc1_driver = { .name = "winwing_ufc1_devdrv", .id_table = device_table, .probe = probe, .input_configured = input_configured, .raw_event = raw_event }; module_hid_driver(ww_ufc1_driver); ``` Not many changes, mostly removing stuff and changing names. HID drivers are more standardized than generic USB drivers so we no longer need to have the explicit module declaration and pass it the `__init` and `__exit` functions. No more registering and deregistering, the HID module will handle that for us. Overall seems like what I should have used in the first place, just some things you have to figure out as you go. We will take a better look at what the new functions added are for when they become necessary. Replugging the device now shows this: ```bash $ dmesg [ 3451.393737] winwing_ufc1_devdrv: module verification failed: signature and/or required key missing - tainting kernel [ 3451.445871] winwing_ufc1_devdrv - probed ``` Not even caring about `module verification failed` until it becomes a problem. Just glad that is was picked up. # Thoughts so far It's not that I've done anything special so far, I wanted to explain the basic structure so any work that follows is easier to comprehend, both for me and others. Learned a bit about drivers and how the kernel goes about selecting the right one and it has been fun! ## Resources Here is what has helped me go through this so far: - [Johannes 4GNU_Linux](https://www.youtube.com/@johannes4gnu_linux96) has some awesome material, definitely worth watching - [Linux Device Drivers](https://github.com/lancetw/ebook-1/blob/master/03_operating_system/Linux%20Device%20Drivers.3rd.Edition.pdf), I have not read through the entire thing but what I have read has been great to get my mind thinking the right way about this - [Another WingWing driver](https://github.com/igorinov/linux-winwing/tree/main) has also been interesting to look at when doing the HID conversion _code is available [here](https://github.com/Khepu/winwing-ufc1-driver)._ ]]> <![CDATA[Cockpit: Unpacked]]> https://khepu.com/posts/2024-07-21 https://khepu.com/posts/2024-07-21 Sun, 21 Jul 2024 11:45:00 GMT

# What's next? We have made a huge leap in understanding how the device communicates with the system, what gets send back and forth. While this marks the end of the investigation for now we will be coming back in order to figure out LEDs and how to actually toggle them. The next step is to finally start writing code for this. I have yet to understand if we do need an actual driver or if we can build on top of `usbhid` so that will also be explained in the next episode. ]]> <![CDATA[Cockpit: Assessing the situation]]> https://khepu.com/posts/2024-07-13 https://khepu.com/posts/2024-07-13 Sat, 13 Jul 2024 19:39:00 GMT <![CDATA[Cockpit: The plan]]> https://khepu.com/posts/2024-07-09 https://khepu.com/posts/2024-07-09 Tue, 09 Jul 2024 23:02:00 GMT

It's a replica straight from an F-18 cockpit. Not only does it look cool but it looks far more versatile than any macropad I've seen. But there is an issue... There is exactly 0 linux support for it. That's not a dealbreaker but it does mean that I will have to write the driver myself. No problem, you just read through the technical specification about how it transmits/receives data and implement it. Yeah, right... The manufacturer provides no such thing. Not all hope is lost but things are getting increasingly difficult (or interesting). The first option is to decompile the windows driver they provide and try to learn something from there. Option two is to reverse engineer this by plugging in the device, pressing a bunch of buttons and looking at what comes out the other side of the cable. Funnily enough, even though option 2 sounds more like a shot in the dark, I think it's going to get me some results faster. The only thing that might not come easy is switching the LEDs on the panels on/off. After the driver is working, there is still the problem of managing the functionality. Knobs, switches that likely affect everything around them, this thing has state and it's gonna get messy. Anyway, this is what's to come, a deep dive into driver development in linux. I haven't touched C since I left university and it's time I brush up on it... Don't worry though, I'll find a place to fit some lisp in there! ]]> <![CDATA[Crafting a JSON parser with SIMD: Hidden costs]]> https://khepu.com/posts/2024-05-14 https://khepu.com/posts/2024-05-14 Tue, 14 May 2024 20:55:00 GMT <![CDATA[Crafting a JSON parser with SIMD: Descent into madness]]> https://khepu.com/posts/2024-04-21 https://khepu.com/posts/2024-04-21 Sun, 21 Apr 2024 20:55:00 GMT

First byte is `0xC4`, treat it like a constant, a prefix prefix if you will. It all goes downhill from here. This is the second byte: `0b11100010`. The first 3 1s are easy to guess, the opcode does not include `R`, `X`, or `B` and their inverse defines the first 3 bytes. The next 5 btyes are the opcode map to be used. This is where the Intel manual starts to disappoint and I found a way forward through the AMD manual where it clearly defines `map_select` for `BLSR` as being 2. Actually the AMD manual is great because it almost gives us the third byte too! Remember `W` from the REX prefix, it's pretty much the same here, sets the 8th bit in the third byte. `vvvv` encodes the destination register, but we don't know which register will be used at this time... This is where we jump back into lisp land: ```lisp (logior #b10000000 (ash (lognot (reg-id-num (reg-id dst))) 3)) ``` We can just ask the compiler to find the register-id from the destination provided, invert it and stuff it into place. `L` (vector length) is 0, in the Intel manual this is indicated by `LZ`. `pp` is just 2 more prefix bits. While the Intel manual does not explicitly define them for this instruction it does give a hint in Volume 2 Chapter 3.1.1.2. While writing this, it all seems a lot simpler, the manuals make sense and hold all the information you need but the first time you go into this it feels very chaotic and incomprehensible. Every little character in those initial tables packs so much information and unless you understand all of them then you cannot move forward. ## A new instruction is born We've already done all the hard work, now it's all about emitting the bytes: ```lisp (in-package :sb-x86-64-asm) (sb-assem::define-instruction blsr (segment dst src) (:printer ...) (:emitter (let ((size (matching-operand-size dst src))) (when (not (eq size :qword)) (error "BLSR instruction only supported for 64-bit operand size"))) (emit-byte segment #xC4) (emit-byte segment #b11100010) (emit-byte segment (logior #b10000000 (ash (lognot (reg-id-num (reg-id dst))) 3))) ;; opcode (emit-byte segment #xf3) ;; mod-r/m byte (emit-byte segment (logior (ash (logior #b11000 1) 3) (reg-encoding src segment))))) ``` Keep in mind that `define-instruction` is *removed* after SBCL compiles itself. So I did what any sensible person would and just copied it over from the source code, along with `%def-inst-encoder`. The only thing we haven't covered is the ModR/M byte. Bits 7 and 6 indicate "register adressing" mode. Then bits 5, 4, and 3 are now used to indicate the variant (`/1`) while what's left encodes the source register. It's been a while since I put this much effort in 7 lines of code, going through such a journey just to hack this together has really expanded my understanding. Enough peacocking, let's get back to work: ```lisp (define-vop (unset-rightmost-bit) (:policy :fast) (:translate unset-rightmost-bit) (:args (x :scs (unsigned-reg) :target r)) (:arg-types positive-fixnum) (:results (r :scs (unsigned-reg))) (:result-types positive-fixnum) (:generator 1 (inst blsr x r))) ``` Do I honestly think that this is going to make a difference in speed? Not really, but I just had to give it a try. # Benchmarks No benchmarks this time, `unset-rightmost-bit` was called only when un-escaping characters. My sample JSON did not have any and are probably not that common for this to matter. # What's next? Since SBCL is lacking this and other similar instructions, I want to attempt and make an actual contribution, adding all of them. But it's going to take some polishing. Did you notice how I casually omitted the `:printer` part of the instruction? Didn't really do it to save space, just hid it because it's not fully done yet. ]]> <![CDATA[Crafting a JSON parser with SIMD: Slowing down to go faster]]> https://khepu.com/posts/2024-03-29 https://khepu.com/posts/2024-03-29 Fri, 29 Mar 2024 20:23:00 GMT <![CDATA[Crafting a JSON parser with SIMD: Finding the rightmost set bit]]> https://khepu.com/posts/2024-03-22 https://khepu.com/posts/2024-03-22 Fri, 22 Mar 2024 20:24:00 GMT "10" ``` I would expect my result to be: `"0100 0000 0000 0000 0000 0000 0000 0000"`. Well, it's "backwards". Not really, because the first item of the array has been moved into the first bit of the bitmap and all leading zeroes have been ommitted. Makes a lot of sense when you think about it like that... How does this help us? The 1s in the bitmap represent all the characters in the initial string that are of interest, and we can skip over anything that's a 0. To do that, my initial idea was a bithack: ```lisp (defun rightmost-bit (n) (declare (type fixnum n)) (the fixnum (logand (1+ (lognot n)) n))) (defun rightmost-bit-index (n) "Returns a 0-based index of the location of the rightmost set bit of `n'." (the fixnum (truncate (log (rightmost-bit n) 2)))) (defun unset-rightmost-bit (n) (declare (type fixnum n)) (logxor n (rightmost-bit n))) ``` You take a number `n`, flip the bits, add 1, use bitwise-and with the original number and you end up isolating the rightmost set bit. Almost there, here comes the ugly part. We need to convert that number into the 0-based index of the set bit in order to use it as an offset and locate the original character in the string we are parsing. There are really 2 ways to do this (that I know of): 1. use the `log` function with base 2 (what you see above) 2. basically brute force it by shifting a bit from index 0 to the left until it equals the original number. Looking for something built into SBCL did not lead anywhere either. I hate both, for different reasons. It was good enough for a while but at some point `rightmost-bit-index` popped up in my profiler: ```lisp seconds | gc | consed | calls | sec/call | name -------------------------------------------------------------- 0.172 | 0.000 | 0 | 1,579,605 | 0.000000 | RIGHTMOST-BIT-INDEX 0.125 | 0.000 | 0 | 3,672,540 | 0.000000 | SKIP-TO-NEXT-CHARACTER 0.125 | 0.000 | 0 | 5,326,737 | 0.000000 | NOT-WHITESPACE-P 0.094 | 0.016 | 227,813,712 | 1,579,605 | 0.000000 | CHUNK 0.078 | 0.000 | 44,410,608 | 748,191 | 0.000000 | %PARSE-STRING 0.078 | 0.000 | 13,146,592 | 261,399 | 0.000000 | %PARSE-ARRAY 0.078 | 0.000 | 0 | 1,285,566 | 0.000000 | PARSE 0.078 | 0.000 | 57,041,792 | 90,003 | 0.000001 | %PARSE-OBJECT 0.031 | 0.000 | 0 | 604,161 | 0.000000 | %PARSE-NUMBER 0.031 | 0.000 | 6,520,832 | 402,774 | 0.000000 | %PARSE-DECIMAL-SEGMENT 0.016 | 0.000 | 0 | 1,496,382 | 0.000000 | NEXT-OFFSET 0.000 | 0.000 | 0 | 1,818 | 0.000000 | %PARSE-NULL 0.000 | 0.000 | 0 | 1,579,605 | 0.000000 | RIGHTMOST-BIT -------------------------------------------------------------- 0.906 | 0.016 | 348,933,536 | 18,628,386 | | Total ``` Throughout my journey with SIMD my biggest resource was people playing with similar things in C/C++. Most of the time I could look into what they are doing and roughly translate that to CL. That's how I learned the basics of SIMD, documentation for it in SBCL is scarce to say the least. This time I was less enthusiastic to see their solution. Modern CPUs have a specific set of [instructions](https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set) to do these kinds of operations. If they are cheating then so am I. # Defining new intrinsics `BSF`, or bit-scan forward, does almost the same job as `rightmost-bit-index` only probably faster because it's baked into your CPU and exposed as a single assembly instruction. They differ in that `BSF` used a 1-based index for the set bit. The issue is that SBCL does not expose it in any way. The only option that remains is to expose it ourselves through a virtual operator. If SIMD in SBCL feels like a dark room with no light switch, then defining your own virtual operators feels like you are in a dark forest. It's one of those things that you have to learn by reading through the existing examples, and there are many in SBCL's repo but I still felt lost. This is the only real [resource](https://pvk.ca/Blog/2014/08/16/how-to-define-new-intrinsics-in-sbcl/) that explains some of the things you see used in virtual operators and that's what I blindly followed: ```lisp (in-package :cl-user) (sb-c:defknown bit-scan-forward ((unsigned-byte 64)) (integer 0 64) (sb-c:foldable sb-c:flushable sb-c:movable) :overwrite-fndb-silently t) (in-package :sb-vm) (define-vop (cl-user::bit-scan-forward) (:policy :fast) (:translate cl-user::bit-scan-forward) (:args (x :scs (any-reg) :target r)) (:arg-types positive-fixnum) (:results (r :scs (unsigned-reg))) (:result-types positive-fixnum) (:generator 1 (inst bsf x r) (inst dec r))) (in-package :cl-user) (bit-scan-forward 3) ;; => 0 ``` Here is how the generator is read: - apply `BSF` to `x` and store it in `r` (extracting the rightmost-set-bit) - decrement `r` by 1 (converting it to 0-based) Converting it to 0-based does imply that any value passed to `bit-scan-forward` must be a positive `fixnum`, so we have to check in advance. ## Downsides You might be wondering, is there a downside in doing this? Always. I am not going to mention any portability concerns as those went out the window the moment I decided to use SIMD for this. The real downside I noticed is that the compiler will not generate the assembly we provided for `bit-scan-forward` unless it is absolutely certain that it is being given a `fixnum`. In any other case it will scream out `UNDEFINED FUNCTION`. So you do have to be extra careful when calling it. If you do wrap it in it's own function that ensures that the argument is a `fixnum` then you do get a small performance hit. The function call overhead here is significant given that we are only executing 2 lines of assembly inside it. If you inline that function you seem to lose the type guarantee/check, a nasty trap! # Benchmarks Let's look at the difference. First the 2 functions in isolation: ```lisp > (time (loop for i from 1 to 100000000 do (rightmost-bit-index i))) Evaluation took: 1.284 seconds of real time 0.375000 seconds of total run time (0.375000 user, 0.000000 system) 29.21% CPU 2,713,127,471 processor cycles 0 bytes consed > (time (loop for i from 1 to 100000000 do (bit-scan-forward i))) Evaluation took: 0.011 seconds of real time 0.000000 seconds of total run time (0.000000 user, 0.000000 system) 0.00% CPU 24,539,824 processor cycles 0 bytes consed ``` A nice 11,572% improvement. Not too bad. But the real test is hooking it up in the parser and seeing how that was affected. Here we go! ```lisp ;; before (rightmost-bit-index) (time (loop for i from 0 to 100 do (parse *data*))) Evaluation took: 17.192 seconds of real time 1.796875 seconds of total run time (1.703125 user, 0.093750 system) [ Real times consist of 2.212 seconds GC time, and 14.980 seconds non-GC time. ] [ Run times consist of 0.171 seconds GC time, and 1.626 seconds non-GC time. ] 10.45% CPU 36,311,175,076 processor cycles 5,323,226,416 bytes consed ;; after (bit-scan-forward) (time (loop for i from 0 to 100 do (parse *data*))) Evaluation took: 10.616 seconds of real time 0.984375 seconds of total run time (0.890625 user, 0.093750 system) [ Real times consist of 2.251 seconds GC time, and 8.365 seconds non-GC time. ] [ Run times consist of 0.109 seconds GC time, and 0.876 seconds non-GC time. ] 9.27% CPU 22,422,362,573 processor cycles 5,323,339,984 bytes consed ``` _NOTE: `*data*` here is a 10mb JSON document_ Speedup translates quite well! Now all this work just let us get rid of `rightmost-bit-index` but we are still using `rightmost-bit` and `unset-rightmost-bit`, which are not too bad but we can perform the same process with the equivalent BMIs and speed this up even more. Check the code [here](https://github.com/Khepu/jsoon). *EDIT*: Of course, after I finished writing this it occurred to me to grep the SBCL source code for `bsf` and found a similar vop named `unsigned-word-find-first-bit`. ]]> <![CDATA[Merging cron expressions]]> https://khepu.com/posts/2024-02-25 https://khepu.com/posts/2024-02-25 Sun, 25 Feb 2024 11:00:00 GMT

Seeing the problem visualized, the first question that pops up in is: How many edges can there be between 2 nodes? It turns out there are only 3 options here. 0 if they cannot be merged, 1 if they can, and 5 if they are exactly the same. We can limit this to just the first 2 cases by considering only a unique set of crons, eliminating any duplicates. The graph is great because it poses another very direct question: Which way do you merge? For this example it does not matter, you are going to end up with 2 expressions that cannot be further simplified, but let's look at a larger example to put things into perspective.

## Coming up with an algorithm Where do we even begin? Before we start answering that we need to know how any merge affects the graph. We noted earlier that merging over the `minute` segment did produce a cron that was no longer able to merge over the `hour` segment. If you mentally go through the graph and do the merges you will notice that every merge of type `A` will result in a cron that procludes all merges of type `B`. This is great, it let's us calculate the amount of "damage" we would be doing to the graph. I do call it damage because to perform the most amount of merges we do want the graph to preserve the most amount of edges between operations, otherwise we will end up with a sub-optimal solution. On the other hand, edges of the same type are preserved, so we can count on those to stay after the merge. Knowing that we can keep following merges of one type does let us visualize paths in the graph and you might even recognize some [Strongly Connected Components](https://en.wikipedia.org/wiki/Strongly_connected_component) of the same edge type. It's an interesting concept but it goes against our principle of causing the least damage possible to the graph. ### What if we started with the least harmful merge? It's a simple idea, graph theory refers to this as a [Minimum Cut](https://en.wikipedia.org/wiki/Minimum_cut), though ours is a little special... In order for minimum cut to work we need to alter it so that it fits our problem. Minimum cut allows you to arbitarily split a graph in 2 but we want to limit this to just 2 nodes at a time. Our minimum cut would also need to take into account the edges that would not be lost because of the merge type, those don't really count as damage. So, here is the process: 1. Isolate 2 nodes through minimum cut 2. Produce a new node by merging them 3. Insert the new node into the graph 4. Recalculate the edges 5. Repeat until there are no more edges Example of a minimum cut in the above graph:

Only 2 connections are permanently removed, `0 1 * * * -hour- 0 2 * * *` and `0 1 * * * -hour- 0 3 * * *`. There might be other pairs that would result in an equivalent minimum cut and at that point order should not really matter. Chosing one over the other could lead to slightly different results though everything should still be reduced down to the same number of expressions. # Final thoughts I wrote the [code](https://github.com/Khepu/cron-condenser) for this a while ago, while unoptimized I probably will not spend more time improving it. If I ever return to this project it will be to try my hand at a formal proof that the above algorithm works as well as I think it does. ]]>