<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Khepu</title>
        <link>https://khepu.com</link>
        <description>Where I technologically vent</description>
        <lastBuildDate>Sat, 11 Apr 2026 11:08:13 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>Feed | Nigiri</generator>
        <language>en</language>
        <image>
            <title>Khepu</title>
            <url>https://khepu.com/favicon/apple-touch-icon.png</url>
            <link>https://khepu.com</link>
        </image>
        <copyright>All rights reserved 2026</copyright>
        <item>
            <title><![CDATA[Battleplans: Befriending the cache]]></title>
            <link>https://khepu.com/posts/2025-07-06</link>
            <guid>https://khepu.com/posts/2025-07-06</guid>
            <pubDate>Sun, 06 Jul 2025 00:23:00 GMT</pubDate>
            <content:encoded><![CDATA[
After haveing a working method for generating Dijkstra maps, it's time to understand if they are viable for our needs here. They are not generated at every frame but it needs to be fast enough for things to feel responsive.

Let's first talk about when they get computed and when we can do hacky things to avoid an expensive computation.

When a user is directly commanding units then we definitely need to produce a fresh one, we need to pay this upfront cost as reusing another one would not really save us any work. The longer it takes, though, for a command to complete the more stale the map becomes and we cannot really have the behavior map deviate too much from what the player sees.

When buildings get added/deleted then it is easy enough to patch that area of the map and those should not be frequent enough to make this problematic. What about units though? You can hardly expect them to sit in one place for long and keeping track of their movement in multiple behavior maps seems like too much work. The easy solution here is to exclude them completely from this computation, we simply do not care about unit positioning here and it will be handled elsewhere, like an occupancy map or similar.

Another problem appears when we consider multiple goals being present in a single behavior map, once one of the goals is completed then what? We could recompute but that also sounds expensive, depending on how long-lived goals are. To deal with that, once a goal has been completed we will simply replace it with a "reverse-goal". Essentially, a goal with very low priority which will have a repulsive effect on units, pushing them away. The good thing about this approach is that it is easy to patch onto an existing map. The reverse-goal will act as a flood fill up until it merges with the "borders" of the other goals. This will reuse the existing generation method, keeping things simple.

We now have a good sense of when a full computation is required though my original implementation is quite far from usable even in what we described above. On a 1000x1000 grid this is how long it takes to populate the behavior map:

```lisp
Evaluation took:
  1.465 seconds of real time
  0.671875 seconds of total run time (0.515625 user, 0.156250 system)
  [ Real times consist of 1.045 seconds GC time, and 0.420 seconds non-GC time. ]
  [ Run times consist of 0.421 seconds GC time, and 0.251 seconds non-GC time. ]
  45.87% CPU
  3,095,285,390 processor cycles
  896,070,112 bytes consed
```
 and here is the implementation:

 ```lisp
(defun populate-behavior-grid (field goals)
  (let* ((width (field-width field))
         (height (field-height field))
         (behavior (make-array (list (field-width field)
                                     (field-height field))
                               :initial-element (1+ (max width height))))
         (grid (field-grid field))
         (queue (make-queue)))
    (dolist (goal goals)
      (enqueue goal queue))

    (loop for goal = (dequeue queue)
          for coords = (goal-coords goal)
          for x = (truncate (vx coords))
          for y = (truncate (vy coords))
          for priority = (goal-priority goal)
          unless (or
                  (let ((current-value (aref behavior x y)))
                    (<= current-value priority))
                  (let ((map-value (aref grid x y)))
                    (eql map-value +occupied+)))
            do (let ((new-priority (1+ priority)))
                 (setf (aref behavior x y) priority)
                 (dolist (neighbor (neighbors coords width height))
                   (enqueue (make-goal :coords neighbor
                                       :priority new-priority)
                            queue)))
          until (queue-empty-p queue))
    behavior))
 ```

It is all quite simple so makes you wonder, where the hell is all that time spent? Notice the queue,  it's backed by a doubly-linked list. The `enqueue`, `dequeue`, and `queue-empty-p` operations are O(1). The devil is in the details! All those additional allocations, spread all over memory, are an absolute nightmare for the CPU. Cannot predict shit.

Let's give it some help, we will throw away the queue and instead preallocate an array that is large enough, and use a fill pointer to keep track of how many elements we have "queued up". Here is what that would look like:

```lisp
(defun populate-behavior-grid (field goals)
  (let* ((width (field-width field))
         (height (field-height field))
         (behavior (make-array (list width height)
                               :initial-element (1+ (max width height))
                               :element-type '(signed-byte 16)))
         (grid (field-grid field))
         (queue (make-array (list (* width height))
                            :initial-element nil
                            :element-type '(or null goal)
                            :fill-pointer 0)))
    (dolist (goal goals)
      (vector-push goal queue))

    ;; Incrementing INDEX essentially acts as removing the first
    ;; element of the queue.
    (loop for index from 0
          until (eql index (length queue))
          for goal of-type goal = (aref queue index)
          for coords = (goal-coords goal)
          for x of-type fixnum = (floor (vx coords))
          for y of-type fixnum = (floor (vy coords))
          for priority = (goal-priority goal)
          unless (or
                  (let ((current-value (aref behavior x y)))
                    (<= current-value priority))
                  (let ((map-value (aref grid x y)))
                    (eql map-value +occupied+)))
            do (setf (aref behavior x y) priority)
               (dolist (neighbor (neighbors coords width height))
                 (let ((new-goal (make-goal :coords neighbor
                                            :priority (1+ priority))))
                   (vector-push new-goal queue))))
    behavior))
```

by being more explicit about our memory needs and using a more cache-friendly structure we can speed things up quite a bit. Our access pattern is dead simple too which makes this entire setup pretty ideal:

```lisp
Evaluation took:
  0.036 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  76,524,229 processor cycles
  90,568,672 bytes consed
```

~50ms is by no means the best we can do but it is certainly good enough to move on with building the rest of the game and ignore this until it becomes an issue again. I am sure it will as we will be adding more and more complexity to the generation of a behavior map but this is necessary.

While we are using a high level language and operate many abstraction layers above the CPU, sometimes understanding what happens at that level gives you a whole new perspective on how to design your code. I am sure more challenges like that will come up and this was definitely a fun one. I knew that arrays were more cache-friendly than lists but I had never really grasped the difference this can make in a tight loop. Along with that, we have removed a lot of levels of indirection and allowed the compiler to do its absolute best to help us in what we are doing. Take a moment out of your day to thank your compiler for its hard work.

3 tricks to remember:
- preallocate the memory you are going to use in a tight loop
- consider using a more cache-friendly structure where it makes sense
- predictable access patterns
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Battleplans: Pathfinding (part 1)]]></title>
            <link>https://khepu.com/posts/2025-03-30</link>
            <guid>https://khepu.com/posts/2025-03-30</guid>
            <pubDate>Sun, 30 Mar 2025 13:00:00 GMT</pubDate>
            <content:encoded><![CDATA[
Recently, I've been trying my hand at game development. The goal is to make a classic RTS but with minimal graphics, I am sticking to geometric shapes because I couldn't make things look nice to save myself. Using common lisp and cl-raylib I managed to cover quite a bit of ground quite fast. I am at the point where I can render circles that represent units and rectangles that represent buildings. Centering the text in either of those was more mental gymnastics than I thought it would be, but still easier than CSS.

# Pathfinding in dynamic environments

In RTS games the field constantly changes, whether that is new buildings being added, new units created or just them moving around, there really isn't much that is constant. Pathfinding was never a light topic but a dynamic environment along with the constraint to solve those problems in real-time for the game to be playable makes it especially challenging.

The first problem I had here was changing my perception of space. If I want to apply an off-the-shelf pathfinding algorithm I could not be thinking about continuous space, it's all about graphs in that domain... It took me a while to internalize that essentially everything is placed on an imaginary grid, treated like a graph where the value of each slot in the grid was the cost to traverse it. The catch here is that things don't magically teleport from one square to the other, they can be "in-transit", which does create just as many problems as it solves. One thing at a time though.

## A*

I decided that having something to build around and use it to understand the problem would be better than nothing. A simple A* implementation should get things rolling. It didn't take much to get it working and even handled cases where there was no path between the unit and the destination which is great.

The point of A* is to not map out the entire grid and assign values, it always picks up searching from the square closest (by manhattan distance) to the goal. Regardless of how big the grid is, the cost only scales by the distance between the start and the goal. At this point I have zero clue how granular the grid should be so I am just winging it and making up numbers, to be adjusted later.

Well, a couple of issues are already apparent. By mapping out just one path, the moment at least one square on that path changes it becomes useless and we need a new one... We really cannot afford to be doing this for every change. We would also need a path per unit, if we wanted 100 units to move at once, that's 100 A* calls, ouch! Lastly, A* would not be able to handle multiple goals. This is important for positioning and targeting. If 2 targets (or more!) are available to a unit then we'd need to get it to pick the closest one and navigate there.

## Dijkstra Maps

Since I now knew what was missing, all I had to do was look for it. Dijkstra maps turned out to be a wonderful and very simple tool. It's a terrible name though so I prefer to call them behavior maps. The idea here is that we create a second grid, naming it behavior grid, and on that we mark with `0` all the goals for a unit, or group of them. All adjacent squares are marked as `1`, their adjacents `2` and so on. This should take a bit longer to calculate than A* but now, no matter where something is on the grid, if it were to always move to the next square with the lowest number, it would reach one of the goals. Effectively, we have pathfinding for all units, immediately, to all goals.

|   |   |   |   |
|---|---|---|---|
| 4 | 4 | 4 | 4 |
| 3 | # | 3 | 3 |
| 2 | # | 2 | 2 |
| 2 | 1 | 1 | 1 |
| 2 | 1 | 0 | 1 |
| 2 | 1 | 1 | 1 |

_Where `#` means blocked_

Due to how the behavior map is laid out, a unit will always gravitate towards the closest goal. More than that, the behavior grid is quite easy to manipulate, even if at the first pass we need to map out everything, we can accommodate smaller changes by "patching" it. New building got placed? Let me just mark those squares as closed and adjust the numbers on the adjacent one, done!

This is more than pathfinding though because you can encode all sorts of behavior on that map depending on what a unit needs to do. If it's a passive unit that needs to flee from enemies, or a ranged unit that needs to be X squares away from the target, all of that can be encoded there. It doesn't directly handle the issue with multiple units having to not step on each other but that is something that can be handled at a different level.

The implementation for it is quite clean and simple, nothing fancy though I have not bothered optimizing it yet. My guess is that this is a very SIMD-able problem and maybe I can get a massive speedup that way.


```lisp
(defstruct (behavior (:constructor %make-behavior))
  grid
  width
  height
  goals)

(defstruct goal
  (coords   (error "No COORDS supplied for goal!")   :type vec2)
  (priority (error "No PRIORITY supplied for goal!") :type fixnum))

(defun make-behavior (field goals)
  (%make-behavior :grid (populate-behavior-grid field goals)
                  :width (field-width field)
                  :height (field-height field)
                  :goals goals))

(defun out-of-bounds-p (x y width height)
  (not (and (<= 0 x (1- width))
            (<= 0 y (1- height)))))

(defun neighbors (center width height)
  (loop for direction in (list (vec  0  1)
                               (vec  1  0)
                               (vec  1  1)
                               (vec  0 -1)
                               (vec -1  0)
                               (vec -1 -1)
                               (vec -1  1)
                               (vec  1 -1))
        for coord = (v+ center direction)
        unless (out-of-bounds-p (vx coord) (vy coord) width height)
          collect coord))

(defun populate-behavior-grid (field goals)
  (let* ((width (field-width field))
         (height (field-height field))
         (behavior (make-array (list (field-width field)
                                     (field-height field))
                               :initial-element (1+ (max width height))))
         (grid (field-grid field))
         (queue (make-queue)))
    (dolist (goal goals)
      (enqueue goal queue))

    (loop for goal = (dequeue queue)
          for coords = (goal-coords goal)
          for x = (truncate (vx coords))
          for y = (truncate (vy coords))
          for priority = (goal-priority goal)
          unless (or (<= (aref behavior x y) priority)
                     (eql (aref grid x y) +occupied+))
            do (let ((new-priority (1+ priority)))
                 (setf (aref behavior x y) priority)
                 (dolist (neighbor (neighbors coords width height))
                   (enqueue (make-goal :coords neighbor
                                       :priority new-priority)
                            queue)))
          until (queue-empty-p queue))
    behavior))

(defun find-closest-neighbor (coords behavior)
  (let ((grid (behavior-grid behavior))
        (width (behavior-width behavior))
        (height (behavior-height behavior)))
    (loop with min = coords
          with min-value = (aref grid
                                 (truncate (vx min))
                                 (truncate (vy min)))
          for neighbor in (neighbors coords width height)
          for value = (aref grid
                            (truncate (vx neighbor))
                            (truncate (vy neighbor)))
          when (< value min-value)
            do (setf min       neighbor
                     min-value value)
          finally (return min))))

```

and here is what the unit actually does:

```lisp
(defvar *units* nil)

(defparameter *unit-size* 16.0)
(defparameter *unit-font-size* 12)

(defstruct (unit (:constructor %make-unit))
  (kind   (error "KIND not supplied for unit!")   :type keyword)
  (coords (error "COORDS not supplied for unit!") :type vec2)
  (owner  (error "OWNER not supplied for unit!")  :type player)
  (speed  (error "SPEED not supplied for unit!")  :type single-float)
  (behavior nil                                   :type (or null behavior)))

(defun make-unit (&key kind coords owner (field *field*))
  (let ((unit (%make-unit :kind kind
                          :coords coords
                          :speed 0.5
                          :owner owner)))
    (push unit *units*)
    (place unit field)
    unit))

(defun move (unit offset field)
  (let ((new-coords (v+ (unit-coords unit) offset)))
    (unless (or (out-of-bounds-p (vx new-coords) (vy new-coords)
                                 (field-width field) (field-height field))
                (occupied-p (vx new-coords) (vy new-coords) field))
      (setf (unit-coords unit) new-coords)
      (place unit field))
    unit))

(defun offset (unit source target)
  (let* ((speed (unit-speed unit))
         (distance (v- target source))
         (x (vx distance))
         (y (vy distance)))
    (vec (if (minusp x)
             (max (- speed) x)
             (min speed     x))
         (if (minusp y)
             (max (- speed) y)
             (min speed     y)))))

(defmethod act ((unit unit) (field field))
  (when-let (behavior (unit-behavior unit))
    (let* ((normalized-coords (v/ (unit-coords unit) 8.0))
           (target (find-closest-neighbor normalized-coords behavior)))
      (if (v/= normalized-coords target)
          (move unit (offset unit normalized-coords target) field)
          (setf (unit-behavior unit) nil)))))
```


In the future this will probably get a bit more convoluted as I would like to be able to pass in a function that will better define the behavior of a unit but as a starting point, I am really happy with how simple it was to get this working.

Nothing interesting to show at this point but I am having fun, once I get those dumb units to stop running on top of each other I will try and share a video in a following post, along with other problems that came up along the way!

# Resources

- [cl-raylib](https://github.com/longlene/cl-raylib)
- [Dijkstra maps](https://www.roguebasin.com/index.php/Dijkstra_Maps_Visualized)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Cockpit: Handling raw data]]></title>
            <link>https://khepu.com/posts/2024-11-24</link>
            <guid>https://khepu.com/posts/2024-11-24</guid>
            <pubDate>Sun, 24 Nov 2024 23:22:00 GMT</pubDate>
            <content:encoded><![CDATA[
The goal for today is to get to the point where we can receive events from the device in our driver and basically understand what we are getting.

In it's current state, even though the driver does execute the `probe` function, it does not seem to be receing any raw events. The reason is that `probe` does not even come close to handling everything it should. It needs to first receive and parse a [report descriptor](https://docs.kernel.org/hid/hidintro.html#id5) which explains the data that are going to be sent by the device. Remember how `evtest` could give us a list of all the input options? It can do that by reading the report descriptor and analyzing it. Thankfully we are not doing anything special with it so far and we can just use the default parsing function offered by the kernel.

# Probe

```c
static int probe(struct hid_device *hdev, const struct hid_device_id *id) {
  printk(KERN_INFO "winwing_ufc1_devdrv: probed\n");

  int ret = hid_parse(hdev);

  if (ret) {
    printk(KERN_ERR "winwing_ufc1_devdrv: parse failed");
    return ret;
  }

  ret = hid_hw_start(hdev, HID_CONNECT_HIDRAW);

  if (ret) {
    printk(KERN_ERR "winwing_ufc1_devdrv: hw start failed");
    return ret;
  }

  ret = hid_hw_open(hdev);
  if (ret) {
    printk(KERN_ERR "winwing_ufc1_devdrv: hw open failed");
    hid_hw_stop(hdev);
    return ret;
  }

  return 0;
}

static void remove(struct hid_device *hdev) {
  hid_hw_close(hdev);
  hid_hw_stop(hdev);
}
```

The descriptor is parsed with `hid_parse`. After that we need to "start" the device. This initializes hardware buffers and connects to the device. `HID_CONNECT_HIDRAW` is specified, instead of the standard `HID_CONNECT_DEFAULT` as the mode in order to avoid having events be sent to the default HID driver. The purpose of `hid_hw_open` is to tell the device that we are finally ready to receive events.

To go along with that, I've also added the `remove` function which takes care of signaling to the device that we will stop receiving events (`hid_hw_close`) and to clean up the buffers (`hid_hw_stop`). This is passed to the module struct in order for the kernel to call it when appropriate.

# Raw events

At this point, our `raw_events` function is being called hundreds of times per second and I honestly have no clue why. So I modified it to print the data and maybe that will give us an idea. Here is the update `raw_events` function:

```c
static int raw_event(struct hid_device *hdev,
                     struct hid_report *report,
                     u8 *raw_data,
                     int size) {
  printk(KERN_INFO "winwing_ufc1_devdrv: data:");

  for (int i = 0; i < size; i++) {
    printk(KERN_CONT " %02x", raw_data[i]);
  }
  printk(KERN_CONT "\n");

  return 0;
}
```

Nothing really to explain here so let's look at the data:

```bash
$ dmesg -w
[22210.301743] winwing_ufc1_devdrv: data: 01 00 00 00 00 80 00 36 01 76 0d 00 00 00 00 00 00
[22210.311749] winwing_ufc1_devdrv: data: 01 00 00 00 00 80 00 36 01 76 0d 00 00 00 00 00 00
[22210.322045] winwing_ufc1_devdrv: data: 01 00 00 00 00 80 00 36 01 76 0d 00 00 00 00 00 00
[22210.331743] winwing_ufc1_devdrv: data: 01 00 00 00 00 80 00 36 01 76 0d 00 00 00 00 00 00
[22210.341884] winwing_ufc1_devdrv: data: 01 00 00 00 00 80 00 36 01 76 0d 00 00 00 00 00 00
```

We are looking things from a different perspective than before, when we were using `evtest`. This array is essentially the state of all buttons given to us at the same time. When I press a button, a single bit changes in that array and when I release it, it gets reverted. We don't have to care about handling sync packets or parsing the USB data. Switches and knobs are also tracked in that same array. We are looking at a higher level representation because HID still does some of the heavy lifting for us, and I am now very thankful for it.

# Next steps
From here on out it's a matter of understanding the mapping between buttons and the bits that change in this array. Once that is done and properly handled then this is half the work done for the kernel level driver. The other half is going to be LEDs which I still have little clue how to do properly...

We didn't cover a lot of ground today as I am slowly getting back into this project but the path now seems clearer and I am excited to keep working on this!

# Resources

- [hid-core.c](https://elixir.bootlin.com/linux/v6.12.1/source/drivers/hid/hid-core.c)
- [another driver to look at for reference](https://elixir.bootlin.com/linux/v4.4.274/source/drivers/hid/wacom_wac.c)
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Cockpit: Structure of a basic driver]]></title>
            <link>https://khepu.com/posts/2024-08-25</link>
            <guid>https://khepu.com/posts/2024-08-25</guid>
            <pubDate>Sun, 25 Aug 2024 08:25:00 GMT</pubDate>
            <content:encoded><![CDATA[
Last time, we left things off pondering if a kernelspace driver is the way to go as there already is a basic one. I no longer care what the right answer is here, only about which is the most fun way to go about this. So yes, we are building a driver. In this post we are going to look through the entire setup and basic structure of a driver, without really implementing anything useful yet. Think of it as the "Hello, world!" for drivers.

Despite this being the simplest driver we can write, there is still a lot of details to get right. First of all, system dependencies:

```bash
sudo apt-get install gcc-12 \
                     flex \
                     bison \
                     linux-headers-$(uname -r)
```

# A basic driver

There are 2 parts to a basic driver, the driver definition, and fitting it into a Linux Module.

```c
#include <linux/module.h>
#include <linux/init.h>
#include <linux/usb.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Giorgos Makris");
MODULE_DESCRIPTION("Driver for the WinWing UFC1");

#define VENDOR_ID 0x4098
#define PRODUCT_ID 0xbed0

static struct usb_device_id usb_table[] = {
  {USB_DEVICE(VENDOR_ID, PRODUCT_ID)},
  {},
};

MODULE_DEVICE_TABLE(usb, usb_table);

static int probe(struct usb_interface *intf, const struct usb_device_id *id) {
  printk("winwing_ufc1_devdrv - Probed\n");
  return 0;
}

static void disconnect(struct usb_interface *intf){
  printk("winwing_ufc1_devdrv - Disconnected\n");
}

static struct usb_driver driver = {
  .name = "winwing_ufc1_devdrv",
  .id_table = usb_table,
  .probe = probe,
  .disconnect = disconnect
};

static int __init ww_ufc_init(void) {
  int registered = usb_register(&driver);

  if (registered) {
    printk("winwing_ufc1_devdrv - Error: could not register driver!\n");
    return -registered;
  }
  printk("winwing_ufc1_devdrv - Initialized driver\n");
  return 0;
}

static void __exit ww_ufc_exit(void) {
  usb_deregister(&driver);
  printk("winwing_ufc1_devdrv - Unloaded driver\n");
}

module_init(ww_ufc_init);
module_exit(ww_ufc_exit);
```

## Licenses

I am gonna start with `MODULE_LICENSE`, because that's the one that was the weirdest one for me. Originally I had it set to `MIT`. Tough luck, turns out that if you don't have the right license specified, compilation breaks.

Here are the logs when you have the wrong license:

```bash
$ make

make -C /lib/modules/6.5.0-41-generic/build M=/home/gmakris/project/winwing-ufc1-driver modules
make[1]: Entering directory '/usr/src/linux-headers-6.5.0-41-generic'
  CC [M]  /home/gmakris/project/winwing-ufc1-driver/winwing_ufc1_devdrv.o
  MODPOST /home/gmakris/project/winwing-ufc1-driver/Module.symvers
ERROR: modpost: GPL-incompatible module winwing_ufc1_devdrv.ko uses GPL-only symbol 'usb_deregister'
ERROR: modpost: GPL-incompatible module winwing_ufc1_devdrv.ko uses GPL-only symbol 'usb_register_driver'
make[3]: *** [scripts/Makefile.modpost:144: /home/gmakris/project/winwing-ufc1-driver/Module.symvers] Error 1
make[2]: *** [/usr/src/linux-headers-6.5.0-41-generic/Makefile:1991: modpost] Error 2
make[1]: *** [Makefile:234: __sub-make] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-6.5.0-41-generic'
make: *** [Makefile:4: all] Error 2
```

`usb_deregister` and `usb_register_driver` are GPL-only symbols. Unlike the kernel, I couldn't care less about the license, so I just switched it...

## Driver layout

We've looked at how to find the `vendor_id` and `product_id` in a previous post. They are important here because handing them over to the `usb_table` is what the kernel will use to match a driver to a connected device. There are ways to be more generic and write a driver that allows the kernel to match it to multiple devices but we don't need that, we have one device and our driver is tailored to it.

The `usb_table`, along with the `probe` and `disconnect` functions are what we need to define a driver. We pack all that in a struct `driver` and then make use of them in the module init and exit functions (`ww_ufc_init` and `ww_ufc_exit`).

## Installing the driver

Linux actually does a great work at being modular and allowing you to easily plug in new modules, such as a driver. After bulding with make it is just a matter of running `sudo insmod winwing_ufc1_devdrv.ko`. If it fails a more descriprive error message should be in `dmesg`.

I found out the ugly way that `uname -r` can lie to you if your system has updated since it booted. In that case you will get an error like this:
```bash
$ dmesg

[  177.492936] module winwing_ufc1_devdrv: .gnu.linkonce.this_module section size must match the kernel's built struct module size at run time
```

If this happens then reboot and `apt-get install --reinstall linux-headers-$(uname -r)`.

In any case, if it all works, then that's what `dmesg` should show:

```bash
$ dmesg

[  305.695775] usbcore: registered new interface driver winwing_ufc1_devdrv
[  305.695784] winwing_ufc1_devdrv - Initialized driver
```

Even though the driver is initialized at the kernel level we do not see the probe message, which means that it was not assigned to the device. There could be a couple of reasons for that but the most probable one I found is that the device itself declares that it is an HID device. So the above driver would not even be considered in this case.

Time to convert it to an HID driver and look at the difference then.

## Converting to HID

```c
#include <linux/module.h>
#include <linux/init.h>
#include <linux/hid.h>
#include <linux/hidraw.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Giorgos Makris");
MODULE_DESCRIPTION("Driver for the WinWing UFC1");

#define VENDOR_ID 0x4098
#define PRODUCT_ID 0xbed0

static struct hid_device_id device_table[] = {
  {HID_USB_DEVICE(VENDOR_ID, PRODUCT_ID)},
  {},
};

MODULE_DEVICE_TABLE(hid, device_table);

static int probe(struct hid_device *hdev, const struct hid_device_id *id) {
  printk("winwing_ufc1_devdrv: probed\n");
  return 0;
}

static int input_configured(struct hid_device *hdev, struct hid_input *hidinput) {
  return 0;
}

static int raw_event(struct hid_device *hdev, struct hid_report *report, u8 *raw_data, int size) {
  return 0;
}

static struct hid_driver ww_ufc1_driver = {
  .name = "winwing_ufc1_devdrv",
  .id_table = device_table,
  .probe = probe,
  .input_configured = input_configured,
  .raw_event = raw_event
};

module_hid_driver(ww_ufc1_driver);
```

Not many changes, mostly removing stuff and changing names. HID drivers are more standardized than generic USB drivers so we no longer need to have the explicit module declaration and pass it the `__init` and `__exit` functions. No more registering and deregistering, the HID module will handle that for us.

Overall seems like what I should have used in the first place, just some things you have to figure out as you go. We will take a better look at what the new functions added are for when they become necessary.

Replugging the device now shows this:
```bash
$ dmesg

[ 3451.393737] winwing_ufc1_devdrv: module verification failed: signature and/or required key missing - tainting kernel
[ 3451.445871] winwing_ufc1_devdrv - probed
```

Not even caring about `module verification failed` until it becomes a problem. Just glad that is was picked up.

# Thoughts so far

It's not that I've done anything special so far, I wanted to explain the basic structure so any work that follows is easier to comprehend, both for me and others. Learned a bit about drivers and how the kernel goes about selecting the right one and it has been fun!

## Resources
Here is what has helped me go through this so far:

- [Johannes 4GNU_Linux](https://www.youtube.com/@johannes4gnu_linux96) has some awesome material, definitely worth watching
- [Linux Device Drivers](https://github.com/lancetw/ebook-1/blob/master/03_operating_system/Linux%20Device%20Drivers.3rd.Edition.pdf), I have not read through the entire thing but what I have read has been great to get my mind thinking the right way about this
- [Another WingWing driver](https://github.com/igorinov/linux-winwing/tree/main) has also been interesting to look at when doing the HID conversion

_code is available [here](https://github.com/Khepu/winwing-ufc1-driver)._
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Cockpit: Unpacked]]></title>
            <link>https://khepu.com/posts/2024-07-21</link>
            <guid>https://khepu.com/posts/2024-07-21</guid>
            <pubDate>Sun, 21 Jul 2024 11:45:00 GMT</pubDate>
            <content:encoded><![CDATA[
Looking at raw bytes is no way to live, not unless you have other choices. So this is about exploring what tools are there for those that have gone down the same path. But before that, I've made some progress in understanding what we are dealing with.

# Tying loose ends

```bash
$ lsusb -t

/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/12p, 480M
    |__ Port 1: Dev 5, If 0, Class=Human Interface Device, Driver=usbhid, 12M
    |__ Port 6: Dev 2, If 0, Class=Video, Driver=uvcvideo, 480M
    |__ Port 6: Dev 2, If 1, Class=Video, Driver=uvcvideo, 480M
    |__ Port 8: Dev 3, If 1, Class=Wireless, Driver=btusb, 12M
    |__ Port 8: Dev 3, If 0, Class=Wireless, Driver=btusb, 12M
```

Linux does recognize this as an input device and it has assigned a generic driver to it. It's a "Human Interface Device" and the driver assigned is `usbhid`. I'll let you guess what `hid` stands for. That is why we are even able to peek into the handler and read the input. That's great but I am still unsure if I should be writing my own to replace this or build on top of it...

If you also remember the handlers from the previous post of this series. While we were using `event14` to listen to all the input, there is a second one attached, `js0` which turns out to be short for "joystick". Looking back, things are now painfully obvious:

```bash
$ dmesg | grep usb

...
[   93.725773] input: Winwing WINWING UFC1 as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.0/0003:4098:BED0.0002/input/input16
[   93.726406] hid-generic 0003:4098:BED0.0002: input,hidraw1: USB HID v1.11 Joystick [Winwing WINWING UFC1] on usb-0000:00:14.0-1/input0
[   93.726602] usbcore: registered new interface driver usbhid
[   93.726612] usbhid: USB HID core driver
```

It's clearly shown that the `usbhid` driver is selected and `js0` lines up with the fact that this is marked as a Joystick.

# Untangling the mess

## evtest

Given that there is a driver we now have a way of decoding the packets with `evtest`. `evtest` is described as an "Input device event monitor and query tool" and it lives up to its name:

```bash
$ sudo evtest /dev/input/event14

Input driver version is 1.0.1
Input device ID: bus 0x3 vendor 0x4098 product 0xbed0 version 0x111
Input device name: "Winwing WINWING UFC1"
Supported events:
  Event type 0 (EV_SYN)
  Event type 1 (EV_KEY)
    Event code 288 (BTN_TRIGGER)
    Event code 289 (BTN_THUMB)
    Event code 290 (BTN_THUMB2)
    Event code 291 (BTN_TOP)
    Event code 292 (BTN_TOP2)
    Event code 293 (BTN_PINKIE)
    Event code 294 (BTN_BASE)
    Event code 295 (BTN_BASE2)
    Event code 296 (BTN_BASE3)
    Event code 297 (BTN_BASE4)
    Event code 298 (BTN_BASE5)
    Event code 299 (BTN_BASE6)
    Event code 300 (?)
    Event code 301 (?)
    Event code 302 (?)
    Event code 303 (BTN_DEAD)
    Event code 704 (BTN_TRIGGER_HAPPY1)
    Event code 705 (BTN_TRIGGER_HAPPY2)
    Event code 706 (BTN_TRIGGER_HAPPY3)
    Event code 707 (BTN_TRIGGER_HAPPY4)
    Event code 708 (BTN_TRIGGER_HAPPY5)
    Event code 709 (BTN_TRIGGER_HAPPY6)
    Event code 710 (BTN_TRIGGER_HAPPY7)
    Event code 711 (BTN_TRIGGER_HAPPY8)
    Event code 712 (BTN_TRIGGER_HAPPY9)
    Event code 713 (BTN_TRIGGER_HAPPY10)
    Event code 714 (BTN_TRIGGER_HAPPY11)
    Event code 715 (BTN_TRIGGER_HAPPY12)
    Event code 716 (BTN_TRIGGER_HAPPY13)
    Event code 717 (BTN_TRIGGER_HAPPY14)
    Event code 718 (BTN_TRIGGER_HAPPY15)
    Event code 719 (BTN_TRIGGER_HAPPY16)
    Event code 720 (BTN_TRIGGER_HAPPY17)
    Event code 721 (BTN_TRIGGER_HAPPY18)
    Event code 722 (BTN_TRIGGER_HAPPY19)
    Event code 723 (BTN_TRIGGER_HAPPY20)
    Event code 724 (BTN_TRIGGER_HAPPY21)
    Event code 725 (BTN_TRIGGER_HAPPY22)
    Event code 726 (BTN_TRIGGER_HAPPY23)
    Event code 727 (BTN_TRIGGER_HAPPY24)
    Event code 728 (BTN_TRIGGER_HAPPY25)
  Event type 3 (EV_ABS)
    Event code 3 (ABS_RX)
      Value      0
      Min        0
      Max     4095
      Fuzz      15
      Flat     255
    Event code 4 (ABS_RY)
      Value   1457
      Min        0
      Max     4095
      Fuzz      15
      Flat     255
    Event code 5 (ABS_RZ)
      Value    822
      Min        0
      Max     4095
      Fuzz      15
      Flat     255
    Event code 6 (ABS_THROTTLE)
      Value      0
      Min        0
      Max    65535
      Fuzz     255
      Flat    4095
    Event code 7 (ABS_RUDDER)
      Value      0
      Min        0
      Max    65535
      Fuzz     255
      Flat    4095
  Event type 4 (EV_MSC)
    Event code 4 (MSC_SCAN)
Key repeat handling:
  Repeat type 20 (EV_REP)
    Repeat code 0 (REP_DELAY)
      Value    250
    Repeat code 1 (REP_PERIOD)
      Value     33
Properties:
Testing ... (interrupt to exit)
Event: time 1721556084.837383, type 4 (EV_MSC), code 4 (MSC_SCAN), value 9001a
Event: time 1721556084.837383, type 1 (EV_KEY), code 713 (BTN_TRIGGER_HAPPY10), value 1
Event: time 1721556084.837383, -------------- SYN_REPORT ------------
Event: time 1721556084.967497, type 4 (EV_MSC), code 4 (MSC_SCAN), value 9001a
Event: time 1721556084.967497, type 1 (EV_KEY), code 713 (BTN_TRIGGER_HAPPY10), value 0
Event: time 1721556084.967497, -------------- SYN_REPORT ------------
Event: time 1721556086.557322, type 3 (EV_ABS), code 4 (ABS_RY), value 1459
Event: time 1721556086.557322, -------------- SYN_REPORT ------------
Event: time 1721556086.567535, type 3 (EV_ABS), code 4 (ABS_RY), value 1461
Event: time 1721556086.567535, -------------- SYN_REPORT ------------
Event: time 1721556086.577537, type 3 (EV_ABS), code 4 (ABS_RY), value 1463
Event: time 1721556086.577537, -------------- SYN_REPORT ------------
```

Whole lot of information, some raise more questions than they answer. We get a list of supported events, along with their codes which seem to be mapped to specific input. The buttons sort of make sense except that there are more buttons listed than I can count on the device, and then there are these:

```bash
    Event code 300 (?)
    Event code 301 (?)
    Event code 302 (?)
```

We can also see joystick related event codes where `ABS_RY` has been triggered below. Turns out those are mapped to the knobs, yes I have 3 knobs but 5 event codes.  I can only assume that this is because the company might have started with making joysticks at first and when they expanded their range they wanted to keep a uniform interface to make things easier for them to deal with.

This is a regular button press:
```bash
Event: time 1721556084.967497, type 4 (EV_MSC), code 4 (MSC_SCAN), value 9001a
Event: time 1721556084.967497, type 1 (EV_KEY), code 713 (BTN_TRIGGER_HAPPY10), value 0
Event: time 1721556084.967497, -------------- SYN_REPORT ------------
```

Still unsure what the `MSC_SCAN` part means but I know that this event is not emitted for switches (which are also buttons). You can also see that a pressed button has a value of `1` and when it gets released it turns to `0`. What I had originally thought to be 3 events are actually 2 and I was tricked by `MSC_SCAN` packets being sent.

## Shifting perspective

Given all that, I can correlate what we see here with the raw bytes we were looking at just to fully bridge the gap. What follows is going to be a bizarre dance of hex values so let me explain some `xxd` flags first. `-c` sets the number of bytes to be printed per row and `-g` defines how many bytes should be in the same group (default is 4).

```bash
$ cat /dev/input/event14 | xxd -c 24

00000000: f7fc 9c66 0000 0000 877a 0a00 0000 0000 0400 0400 1a00 0900  ...f.....z..............
00000018: f7fc 9c66 0000 0000 877a 0a00 0000 0000 0100 c902 0100 0000  ...f.....z..............
00000030: f7fc 9c66 0000 0000 877a 0a00 0000 0000 0000 0000 0000 0000  ...f.....z..............
00000048: f7fc 9c66 0000 0000 ee64 0b00 0000 0000 0400 0400 1a00 0900  ...f.....d..............
00000060: f7fc 9c66 0000 0000 ee64 0b00 0000 0000 0100 c902 0000 0000  ...f.....d..............
00000078: f7fc 9c66 0000 0000 ee64 0b00 0000 0000 0000 0000 0000 0000  ...f.....d..............
```

which corresponds to:
```bash
Event: time 1721564407.686727, type 4 (EV_MSC), code 4 (MSC_SCAN), value 9001a
Event: time 1721564407.686727, type 1 (EV_KEY), code 713 (BTN_TRIGGER_HAPPY10), value 1
Event: time 1721564407.686727, -------------- SYN_REPORT ------------
Event: time 1721564407.746734, type 4 (EV_MSC), code 4 (MSC_SCAN), value 9001a
Event: time 1721564407.746734, type 1 (EV_KEY), code 713 (BTN_TRIGGER_HAPPY10), value 0
Event: time 1721564407.746734, -------------- SYN_REPORT ------------
```

and all of a sudden things are a bit clearer. The first 2 bytes act as a clock, the first byte looks like it corresponds to seconds and the second to minutes from my testing. Not in terms of values but in terms of how they change between button presses.

```bash
byte    : 0 1  2 3  4 5  6 7  8 9  1011 1213 1415 1617 1819 2021 2223  manually added
00001d28: fefd 9c66 0000 0000 d36e 0000 0000 0000 0000 0000 0000 0000  ...f.....n..............
00001d40: fffd 9c66 0000 0000 7c43 0200 0000 0000 0400 0400 1a00 0900  ...f....|C..............
00001d58: fffd 9c66 0000 0000 7c43 0200 0000 0000 0100 c902 0100 0000  ...f....|C..............
00001d70: fffd 9c66 0000 0000 7c43 0200 0000 0000 0000 0000 0000 0000  ...f....|C..............
00001d88: fffd 9c66 0000 0000 6c16 0400 0000 0000 0400 0400 1a00 0900  ...f....l...............
00001da0: fffd 9c66 0000 0000 6c16 0400 0000 0000 0100 c902 0000 0000  ...f....l...............
00001db8: fffd 9c66 0000 0000 6c16 0400 0000 0000 0000 0000 0000 0000  ...f....l...............
00001dd0: 00fe 9c66 0000 0000 af77 0500 0000 0000 0400 0400 1a00 0900  ...f.....w..............
00001de8: 00fe 9c66 0000 0000 af77 0500 0000 0000 0100 c902 0100 0000  ...f.....w..............
00001e00: 00fe 9c66 0000 0000 af77 0500 0000 0000 0000 0000 0000 0000  ...f.....w..............
00001e18: 00fe 9c66 0000 0000 9073 0700 0000 0000 0400 0400 1a00 0900  ...f.....s..............
00001e30: 00fe 9c66 0000 0000 9073 0700 0000 0000 0100 c902 0000 0000  ...f.....s..............
00001e48: 00fe 9c66 0000 0000 9073 0700 0000 0000 0000 0000 0000 0000  ...f.....s..............
```

Third and fourth bytes so far seem to be always the same they are always `0x9C66`. Then always follow 8 zero-bytes. I can only speculate as to why but for now let's just make a note of that.

Skipping right after the second lump of zero-bytes we get to what `evtest` shows. Bytes 16-17 make up the event type and bytes 18-19  the event code. Well how is `0xC902` equal to `713`? Do you see the pattern yet? No? What if I told you that `0x02C9` is `713`? So far we have been assuming a Big Endian representation but the event code has been a dead giveaway that this should be trated as little endian. Let's rewrite our `xxd` command:

```bash
$ cat /dev/input/event14 | xxd -c 24 -e -g 4

00000000: 669d0446 00000000 00092d91 00000000 00040004 0009001a  F..f.....-..............
00000018: 669d0446 00000000 00092d91 00000000 02c90001 00000001  F..f.....-..............
00000030: 669d0446 00000000 00092d91 00000000 00000000 00000000  F..f.....-..............
00000048: 669d0446 00000000 000a667a 00000000 00040004 0009001a  F..f....zf..............
00000060: 669d0446 00000000 000a667a 00000000 02c90001 00000000  F..f....zf..............
00000078: 669d0446 00000000 000a667a 00000000 00000000 00000000  F..f....zf..............
```

Now you can take the first _integer_ convert it to decimal and you have a unix timestamp while the third integer behaves like nanoseconds. Perspective is indeed worth 80 IQ points... Now, I am willing to bet that those are not integers but longs, so the zero-bytes go to the front and not the back, like so:

```bash
$ cat /dev/input/event14 | xxd -c 24 -e -g 8

00000000: 00000000669d0583 00000000000ad3d9 0009001a00040004  ...f....................
00000018: 00000000669d0583 00000000000ad3d9 0000000102c90001  ...f....................
00000030: 00000000669d0583 00000000000ad3d9 0000000000000000  ...f....................
00000048: 00000000669d0583 00000000000c7ffc 0009001a00040004  ...f....................
00000060: 00000000669d0583 00000000000c7ffc 0000000002c90001  ...f....................
00000078: 00000000669d0583 00000000000c7ffc 0000000000000000  ...f....................
```

The first 2 _longs_ make sense but the rest is made up of smaller types, let me add some spacing:

```bash
00000000: 00000000669d0583 00000000000ad3d9 0009001a 0004 0004  ...f....................
00000018: 00000000669d0583 00000000000ad3d9 00000001 02c9 0001  ...f....................
00000030: 00000000669d0583 00000000000ad3d9 00000000 0000 0000  ...f....................
00000048: 00000000669d0583 00000000000c7ffc 0009001a 0004 0004  ...f....................
00000060: 00000000669d0583 00000000000c7ffc 00000000 02c9 0001  ...f....................
00000078: 00000000669d0583 00000000000c7ffc 00000000 0000 0000  ...f....................
```

I can make out an integer which is the `value` field shown by `evtest` and 2 half-integer/short values which are the event code and the event type. And with that, we have fully understood what's on the wire.

<div align="center">
    <img src="/images/cockpit/packet.png" width="100%"/>
</div>

# What's next?

We have made a huge leap in understanding how the device communicates with the system, what gets send back and forth. While this marks the end of the investigation for now we will be coming back in order to figure out LEDs and how to actually toggle them.

The next step is to finally start writing code for this. I have yet to understand if we do need an actual driver or if we can build on top of `usbhid` so that will also be explained in the next episode.
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Cockpit: Assessing the situation]]></title>
            <link>https://khepu.com/posts/2024-07-13</link>
            <guid>https://khepu.com/posts/2024-07-13</guid>
            <pubDate>Sat, 13 Jul 2024 19:39:00 GMT</pubDate>
            <content:encoded><![CDATA[
Right, time to get this started, I am going to document my steps and thoughts so I can call myself out once I know more. This is going to be about getting to know the device and its behavior. I expect to feel like I know less at the end than when I started.

# Diagnostics

Baby's first steps, I can see the lights turn on when I plug it in but I have no idea what the device looks like from the OS perspective. Let's look at what the kernel sees.

```bash
$ dmesg | grep usb
...
[   93.534289] usb 1-1: new full-speed USB device number 4 using xhci_hcd
[   93.688292] usb 1-1: New USB device found, idVendor=4098, idProduct=bed0, bcdDevice= 1.05
[   93.688309] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   93.688316] usb 1-1: Product: WINWING UFC1
[   93.688321] usb 1-1: Manufacturer: Winwing
[   93.688326] usb 1-1: SerialNumber: 412340B2A416D321A3160002
[   93.725773] input: Winwing WINWING UFC1 as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.0/0003:4098:BED0.0002/input/input16
[   93.726406] hid-generic 0003:4098:BED0.0002: input,hidraw1: USB HID v1.11 Joystick [Winwing WINWING UFC1] on usb-0000:00:14.0-1/input0
[   93.726602] usbcore: registered new interface driver usbhid
[   93.726612] usbhid: USB HID core driver
```

It's great the manufacturer has properly named the device, I am not sure why they call it a Joystick but hey, that's a detail. What you need to take note from here is this part: `idVendor=4098, idProduct=bed0`. This will let us look for the device under the input devices and find its handlers.


```bash
$ cat /proc/bus/input/devices

I: Bus=0003 Vendor=4098 Product=bed0 Version=0111
N: Name="Winwing WINWING UFC1"
P: Phys=usb-0000:00:14.0-1/input0
S: Sysfs=/devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.0/0003:4098:BED0.0002/input/input16
U: Uniq=412340B2A416D321A3160002
H: Handlers=event14 js0
B: PROP=0
B: EV=10001b
B: KEY=1ffffff 0 0 0 0 0 0 ffff00000000 0 0 0 0
B: ABS=f8
B: MSC=10
```

The handler where device events are sent to `event14` and we can take a peek inside it with: `cat /dev/input/event14 | xxd`. I am still not sure what the other handler is but I am sure we are going to find out in the future. I am piping everything through `xxd` because the output is binary data.

# Mindless button-pressing

## Standard button
So this is a single button press:

```
00000000: 35be 9266 0000 0000 b34d 0000 0000 0000  5..f.....M......
00000010: 0400 0400 1a00 0900 35be 9266 0000 0000  ........5..f....
00000020: b34d 0000 0000 0000 0100 c902 0100 0000  .M..............
00000030: 35be 9266 0000 0000 b34d 0000 0000 0000  5..f.....M......
00000040: 0000 0000 0000 0000 35be 9266 0000 0000  ........5..f....
00000050: 2dad 0100 0000 0000 0400 0400 1a00 0900  -...............
00000060: 35be 9266 0000 0000 2dad 0100 0000 0000  5..f....-.......
00000070: 0100 c902 0000 0000 35be 9266 0000 0000  ........5..f....
00000080: 2dad 0100 0000 0000 0000 0000 0000 0000  -...............
```

It's hard to show you in an article but those lines arrived at different times, 3 lines at a time. My suspicion is that they correspond to the events "key down", "key pressed", and "key up", in that order. You can also see that these packets do seem to have a standard format, there are just 2 words being changed each time to signal the different events and most of them end up being 0. That's cool, let's press it again, it's obviously going to have some very deterministic behavior and the exact same packets are gonna show up, this is easy. Here are 2 button presses of the same button because I forgot which was the first one:

```
00000000: a5c0 9266 0000 0000 cce8 0200 0000 0000  ...f............
00000010: 0400 0400 1a00 0900 a5c0 9266 0000 0000  ...........f....
00000020: cce8 0200 0000 0000 0100 c902 0100 0000  ................

00000030: a5c0 9266 0000 0000 cce8 0200 0000 0000  ...f............
00000040: 0000 0000 0000 0000 a5c0 9266 0000 0000  ...........f....
00000050: a50b 0500 0000 0000 0400 0400 1a00 0900  ................

00000060: a5c0 9266 0000 0000 a50b 0500 0000 0000  ...f............
00000070: 0100 c902 0000 0000 a5c0 9266 0000 0000  ...........f....
00000080: a50b 0500 0000 0000 0000 0000 0000 0000  ................

00000090: a6c0 9266 0000 0000 776f 0d00 0000 0000  ...f....wo......
000000a0: 0400 0400 1a00 0900 a6c0 9266 0000 0000  ...........f....
000000b0: 776f 0d00 0000 0000 0100 c902 0100 0000  wo..............

000000c0: a6c0 9266 0000 0000 776f 0d00 0000 0000  ...f....wo......
000000d0: 0000 0000 0000 0000 a7c0 9266 0000 0000  ...........f....
000000e0: ab77 0000 0000 0000 0400 0400 1a00 0900  .w..............

000000f0: a7c0 9266 0000 0000 ab77 0000 0000 0000  ...f.....w......
00000100: 0100 c902 0000 0000 a7c0 9266 0000 0000  ...........f....
00000110: ab77 0000 0000 0000 0000 0000 0000 0000  .w..............
```

Uh... those are not the same at all. Let's do 2 more of the same:

```
00000000: f9c0 9266 0000 0000 0120 0600 0000 0000  ...f..... ......
00000010: 0400 0400 1a00 0900 f9c0 9266 0000 0000  ...........f....
00000020: 0120 0600 0000 0000 0100 c902 0100 0000  . ..............

00000030: f9c0 9266 0000 0000 0120 0600 0000 0000  ...f..... ......
00000040: 0000 0000 0000 0000 f9c0 9266 0000 0000  ...........f....
00000050: df42 0800 0000 0000 0400 0400 1a00 0900  .B..............

00000060: f9c0 9266 0000 0000 df42 0800 0000 0000  ...f.....B......
00000070: 0100 c902 0000 0000 f9c0 9266 0000 0000  ...........f....
00000080: df42 0800 0000 0000 0000 0000 0000 0000  .B..............

00000090: fac0 9266 0000 0000 6b58 0700 0000 0000  ...f....kX......
000000a0: 0400 0400 1a00 0900 fac0 9266 0000 0000  ...........f....
000000b0: 6b58 0700 0000 0000 0100 c902 0100 0000  kX..............

000000c0: fac0 9266 0000 0000 6b58 0700 0000 0000  ...f....kX......
000000d0: 0000 0000 0000 0000 fac0 9266 0000 0000  ...........f....
000000e0: 527b 0900 0000 0000 0400 0400 1a00 0900  R{..............

000000f0: fac0 9266 0000 0000 527b 0900 0000 0000  ...f....R{......
00000100: 0100 c902 0000 0000 fac0 9266 0000 0000  ...........f....
00000110: 527b 0900 0000 0000 0000 0000 0000 0000  R{..............
```

Crap, every time it's different. There must be some state affecting it... Right, well that's a puzzle for later, certainly an interesting one! Let's do some more experimenting, time to try one of the switches and one of the knobs.

## Knobs

The knobs are actually pretty sensible, at specific points in their rotation they will emit a single packet. Those packets also seem to have some state and differ at each step but that's sort of expected, here are 2 packets:

```
00000000: fdc1 9266 0000 0000 c2a7 0a00 0000 0000  ...f............
00000010: 0300 0300 1e00 0000 fdc1 9266 0000 0000  ...........f....
00000020: c2a7 0a00 0000 0000 0000 0000 0000 0000  ................

00000030: fdc1 9266 0000 0000 7c55 0c00 0000 0000  ...f....|U......
00000040: 0300 0300 1f00 0000 fdc1 9266 0000 0000  ...........f....
00000050: 7c55 0c00 0000 0000 0000 0000 0000 0000  |U..............
```

Again, very specific words changing, what's odd is that the same are changing in the button packets... That's odd, another puzzle.

## Switches

Switches are going to be like buttons though, I mean, how much can they differ? Apparently quite a bit... Once plugged in and a switch is in the neutral position then it emits no signal, so far so good. But once you flick it either way it keeps emitting packets, yes every time they are also slightly different. When you flick it back to neutral it still emits packets at a rapid rate until you press another button... So in any position a switch will keep emitting packets until another button is pressed but if a knob is turned then the packets get multiplexed and you have a stream of both coming up.

## LEDs

I will be ignoring the fact that there are LEDs I can control until I have input sorted out. There is already a lot of stuff going on so we will be coming back to this later!

# What's next?

In theory we just pressed some buttons but this is actually a lot of information and quite a few short-term goals are becoming clearer. To recap, we learned that:

- there are structured packets, each packet has a length of 48-bytes
- identical actions do not produce identical data, there must be some state mixed in there
- buttons do seem to follow the web-standard for input events
- each type of button has its own behavior, sometimes affected by other buttons

The next thing I want to tackle is being able to identify which button is being pressed. If I can figure out why the packets change that's even better but I would be willing to settle for just identifying the button and ignoring the rest for now. There are 2 things in my mind to tackle this, the first is to play a bit more with the buttons while looking at the packets, maybe look at their binary representation until I can identify some pattern. If that doesn't pan out, I have to assume that developers followed some sort of good USB-related practices that could explain this. I have no clue if such a thing even exists but it might be worth having a look just in case this is in-fact very standard behavior that I don't understand because I lack the background.
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Cockpit: The plan]]></title>
            <link>https://khepu.com/posts/2024-07-09</link>
            <guid>https://khepu.com/posts/2024-07-09</guid>
            <pubDate>Tue, 09 Jul 2024 23:02:00 GMT</pubDate>
            <content:encoded><![CDATA[
I grew up around military aircraft, if I close my eyes I can hear them flying by and see their imposing figure right in front of me. They always take me back to my childhood, so I've decided to bring a part of them to my job.

There is an absolute ton of stuff I need to have on hand to be even remotely good at what I do and I kept thinking about getting a macropad or a streamdeck, hook up some bash scripts and leave it at that. But then I wouldn't have a story to tell, nor would I have had any fun...

This was going nowhere, up until I stumbled on this:

<div align="center">
    <img src="/images/cockpit/base.jpeg" width="100%"/>
</div>

It's a replica straight from an F-18 cockpit. Not only does it look cool but it looks far more versatile than any macropad I've seen. But there is an issue...

There is exactly 0 linux support for it. That's not a dealbreaker but it does mean that I will have to write the driver myself. No problem, you just read through the technical specification about how it transmits/receives data and implement it. Yeah, right... The manufacturer provides no such thing.

Not all hope is lost but things are getting increasingly difficult (or interesting). The first option is to decompile the windows driver they provide and try to learn something from there. Option two is to reverse engineer this by plugging in the device, pressing a bunch of buttons and looking at what comes out the other side of the cable.

Funnily enough, even though option 2 sounds more like a shot in the dark, I think it's going to get me some results faster. The only thing that might not come easy is switching the LEDs on the panels on/off. After the driver is working, there is still the problem of managing the functionality. Knobs, switches that likely affect everything around them, this thing has state and it's gonna get messy.

Anyway, this is what's to come, a deep dive into driver development in linux. I haven't touched C since I left university and it's time I brush up on it... Don't worry though, I'll find a place to fit some lisp in there!
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Crafting a JSON parser with SIMD: Hidden costs]]></title>
            <link>https://khepu.com/posts/2024-05-14</link>
            <guid>https://khepu.com/posts/2024-05-14</guid>
            <pubDate>Tue, 14 May 2024 20:55:00 GMT</pubDate>
            <content:encoded><![CDATA[

When I originally designed this parser, I decided that all JSON objects would be parsed into hash-maps. They seem like the right tool for the job and, in some sense, that's exactly what JSON objects are, mappings of keys to values.

# Time complexity

Hash-maps lure you in with their appealing time complexity for the average case, let's have a look at what they promise:

|Operation|Average Case|Worst Case|
|--|--|--|
|Insertion|O(1)|O(n)|
|Lookup|O(1)|O(n)|
|Deletion|O(1)|O(n)|

This table may differ slightly depending on the actual implementation but it will have to do. Hashmaps are sick, right? Most of the time you are dealing with constant time complexity. It's not a lie but this distracts from what is actually going on.

First, let's compare the hashmap to a humble list:

|Operation| Average/Worst Case|
|--|--|
|Insertion (front)|O(1)|
|Insertion (back)|O(n)|
|Lookup|O(n)|
|Deletion|O(n)|

Looks horrible, and yet, parsing objects into associative lists (alists) instead of hashmaps has sped up my parser by a significant amount...

## Bringing lists up to speed

Here is the object parsing code as was previously:

```lisp
(defun %parse-object (string index)
  (declare (type simple-string string)
           (type fixnum index))
  (let ((current-index (skip-to-next-character string (1+ index)))
        (parsed-object (make-hash-table :test 'equal
                                        :size 10)))
    (declare (type fixnum current-index))
    ;; empty object check
    (when (char= (char string current-index) #\})
      (return-from %parse-object (values parsed-object current-index)))
    (loop with raw-string-length = (length string)
          while (< current-index raw-string-length)
          when (char/= +double-quote+ (char string current-index))
            do (error "Key not of type string at  position ~a!" current-index)
          do (multiple-value-bind (parsed-key new-index)
                 (%parse-string string current-index)
               (setf current-index (skip-to-next-character string new-index))
               (if (char= #\: (char string current-index))
                   (incf current-index)
                   (error "Missing ':' after key '~a' at position ~a!" parsed-key current-index))
               (multiple-value-bind (parsed-value new-index)
                   (parse string current-index)
                 (setf current-index (skip-to-next-character string new-index)
                       (gethash parsed-key parsed-object) parsed-value)
                 (let ((character (char string current-index)))
                   (cond
                     ((char= #\} character)
                      (loop-finish))
                     ((char/= #\, character)
                      (error "Expected ',' after object value. Instead found ~a at position ~a!"
                             character current-index))
                     (t
                      (setf current-index (skip-to-next-character string (incf current-index)))))))))
    (values parsed-object (incf current-index))))
```

and here is the sped up version:

```lisp
(defun %parse-object (string index)
  (declare (type simple-string string)
           (type fixnum index))
  (let ((current-index (skip-to-next-character string (1+ index))))
    (declare (type fixnum current-index))
    ;; empty object check
    (when (char= (char string current-index) #\})
      (return-from %parse-object (values nil current-index)))
    (values
     (loop with raw-string-length = (length string)
           while (< current-index raw-string-length)
           when (char/= +double-quote+ (char string current-index))
             do (error "Key not of type string at position ~a!" current-index)
           collect (multiple-value-bind (parsed-key new-index)
                       (%parse-string string current-index)
                     (setf current-index (skip-to-next-character string new-index))
                     (if (char= #\: (char string current-index))
                         (incf current-index)
                         (error "Missing ':' after key '~a' at position ~a!" parsed-key current-index))
                     (multiple-value-bind (parsed-value new-index)
                         (parse string current-index)
                       (setf current-index (skip-to-next-character string new-index))
                       (cons parsed-key parsed-value)))
           do (let ((character (char string current-index)))
                (cond
                  ((char= #\} character)
                   (loop-finish))
                  ((char/= #\, character)
                   (error "Expected ',' after object value. Instead found ~a at position ~a!"
                          character current-index))
                  (t
                   (setf current-index (skip-to-next-character string (incf current-index)))))))
     (incf current-index))))
```

So we went from your standard hashmap to creating alists:

```lisp
(("key1" . "value1")
 ("key2" . (("key 3" . "value3"))))
```

but why is it faster?

### Cheap appends

Given a random list, if you were asked to append an element to it then yes, it would cost `O(n)`. When you are actually creating it though, it's a different story. `loop` hides all the magic so let's expand it:

```lisp
(BLOCK NIL
  (LET ((RAW-STRING-LENGTH (LENGTH STRING)))
    (DECLARE (IGNORABLE RAW-STRING-LENGTH))
    (LET* ((#:LOOP-LIST-HEAD-286 (LIST NIL))
           (#:LOOP-LIST-TAIL-287 #:LOOP-LIST-HEAD-286))
      (DECLARE (DYNAMIC-EXTENT #:LOOP-LIST-HEAD-286))
      (TAGBODY
       SB-LOOP::NEXT-LOOP
        (IF (< CURRENT-INDEX RAW-STRING-LENGTH)
            NIL
            (GO SB-LOOP::END-LOOP))
        (IF (CHAR/= +DOUBLE-QUOTE+ (CHAR STRING CURRENT-INDEX))
            (ERROR "Key not of type string at position ~a!" CURRENT-INDEX))
        (RPLACD #:LOOP-LIST-TAIL-287
                (SETQ #:LOOP-LIST-TAIL-287
                        (LIST
                         (MULTIPLE-VALUE-CALL
                             #'(LAMBDA
                                   (
                                    &OPTIONAL PARSED-KEY NEW-INDEX
                                    &REST #:IGNORE)
                                 (DECLARE (IGNORE #:IGNORE))
                                 (SETQ CURRENT-INDEX
                                         (SKIP-TO-NEXT-CHARACTER STRING
                                          NEW-INDEX))
                                 (IF (CHAR= #\: (CHAR STRING CURRENT-INDEX))
                                     (SETQ CURRENT-INDEX (+ 1 CURRENT-INDEX))
                                     (ERROR
                                      "Missing ':' after key '~a' at position ~a!"
                                      PARSED-KEY CURRENT-INDEX))
                                 (MULTIPLE-VALUE-CALL
                                     #'(LAMBDA
                                           (
                                            &OPTIONAL PARSED-VALUE NEW-INDEX
                                            &REST #:IGNORE)
                                         (DECLARE (IGNORE #:IGNORE))
                                         (SETQ CURRENT-INDEX
                                                 (SKIP-TO-NEXT-CHARACTER STRING
                                                  NEW-INDEX))
                                         (CONS PARSED-KEY PARSED-VALUE))
                                   (PARSE STRING CURRENT-INDEX)))
                           (%PARSE-STRING STRING CURRENT-INDEX)))))
        (LET ((CHARACTER (CHAR STRING CURRENT-INDEX)))
          (DECLARE (TYPE CHARACTER CHARACTER))
          (IF (CHAR= #\} CHARACTER)
              (GO SB-LOOP::END-LOOP)
              (IF (CHAR/= #\, CHARACTER)
                  (ERROR
                   "Expected ',' after object value. Instead found ~a at position ~a!"
                   CHARACTER CURRENT-INDEX)
                  (THE T
                       (SETQ CURRENT-INDEX
                               (SKIP-TO-NEXT-CHARACTER STRING
                                (SETQ CURRENT-INDEX (+ 1 CURRENT-INDEX))))))))
        (GO SB-LOOP::NEXT-LOOP)
       SB-LOOP::END-LOOP
        (RETURN-FROM NIL (TRULY-THE LIST (CDR #:LOOP-LIST-HEAD-286)))))))
```

It does exactly what you would expect. It keeps a reference to the last cons cell in the list `#:LOOP-LIST-TAIL-287` and uses that to append at `O(1)` time. But didn't we already have `O(1)` insertion with the hashmap?

### O(1) is relative

Just because 2 data structures define one operation as `O(1)` it doesn't mean they cost the same, all it means is that their cost remains constant as the number of elements in the data structure increases.

Insertion in a hashmap is much more complicated than in a list. First of all, each insertion also bears the cost of the hashing algorithm, sure we are talking about nanoseconds but it all matters. Notice the size set on the hashmap initialization `:size 10`. When you exceed that size you have to pay the hashing tax again in order to rehash all the keys so they can be moved into a larger structure while setting the too high to begin with just wastes memory. It's hard to find a balance when you don't really know much about the data that will be moved into it.

Then there is the testing function, the hashmap actually cares if 2 keys are identical which I initially thought was a benefit but I am just parsing the JSON, if 2 keys are the same then `assoc` will consistently retrieve the first one, sure the second one will still be in the structure but at this point we have been given an "invalid" JSON and all bets are off the table.

So in general, creating the alist is going to be cheaper because we save on overhead and have ways to bridge the gap in time complexity.

Lots of little hidden costs to a hashmap. Wouldn't lookups be faster though? As always, it depends, after a certain number of keys the hashmap will be faster, below that the alist should win because of the reduced overhead.

# Benchmarks

## Before
```lisp
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  6.362 seconds of real time
  2.015625 seconds of total run time (1.640625 user, 0.375000 system)
  [ Real times consist of 2.084 seconds GC time, and 4.278 seconds non-GC time. ]
  [ Run times consist of 0.625 seconds GC time, and 1.391 seconds non-GC time. ]
  31.69% CPU
  13,438,524,761 processor cycles
  4,875,842,528 bytes consed
```

## After
```lisp
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  4.857 seconds of real time
  0.750000 seconds of total run time (0.671875 user, 0.078125 system)
  [ Real times consist of 0.993 seconds GC time, and 3.864 seconds non-GC time. ]
  [ Run times consist of 0.140 seconds GC time, and 0.610 seconds non-GC time. ]
  15.44% CPU
  10,258,086,921 processor cycles
  3,837,231,136 bytes consed
```

Just about 1.5s cut off though the real win is looking at how much less work the GC is doing as well as the reduction in bytes consed.

# What's next?

Probably looking into making string parsing a little better. The idea is to extract index ranges instead of substrings in hopes that this reduces memory and saves me some more time.
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Crafting a JSON parser with SIMD: Descent into madness]]></title>
            <link>https://khepu.com/posts/2024-04-21</link>
            <guid>https://khepu.com/posts/2024-04-21</guid>
            <pubDate>Sun, 21 Apr 2024 20:55:00 GMT</pubDate>
            <description><![CDATA[Let me guide you through how what I thought would take up 20mins ended up being more than I bargained for]]></description>
            <content:encoded><![CDATA[
In the first post of this series I mentioned how this function:

```lisp
(defun unset-rightmost-bit (n)
  (declare (type fixnum n))
  (the fixnum (logand n (- n 1))))
```

could probably be written as a simple assembly instruction. Let me guide you through how what I thought would take up 20mins ended up being more than I bargained for.

# It's just another VOP, right?

So the instruction we are looking for is `BLSR` which just unsets the rightmost set bit, exactly what we need. Let's just create a VOP to use it, much like we did for `BSF`:

```lisp
(in-package :cl-user)

(sb-c:defknown unset-rightmost-bit ((unsigned-byte 64)) (integer 0 64)
    (sb-c:foldable sb-c:flushable sb-c:movable)
  :overwrite-fndb-silently t)

(in-package :sb-vm)

(define-vop (cl-user::unset-rightmost-bit)
  (:policy :fast)
  (:translate cl-user::unset-rightmost-bit)
  (:args (x :scs (any-reg) :target r))
  (:arg-types positive-fixnum)
  (:results (r :scs (unsigned-reg)))
  (:result-types positive-fixnum)
  (:generator 1
    (inst blsr x r)))

(in-package :cl-user)

(defun sth (a)
  (declare (type fixnum a)
           (optimize (speed 3) (safety 0)))
  (unset-rightmost-bit a))
```

which fails with:

```lisp
Undefined instruction: BLSR in
 (INST BLSR X R)
```

Huh!? Grepping through the SBCL repo for `BLSR` does not match anything... Turns out, the compiler does not know how to emit that instruction. But at this point I'm invested. I don't want to just give up and call it a day, which, in hindsight, would have been the wise thing to do... Let's look at how `BSF` was defined as an instruction and just imitate that.

# Teaching the compiler new words

I've already complained that SIMD and define-vop don't really come with any documentation except for a few sources and examples. Well `define-instruction` is not something I could really find anything on besides inside the SBCL source code. The plan is to look at some examples and hopefully grab one that looks close, adjust it and have it work.

## Looking at existing instructions

Here is how `BSF` was taught to the compiler:

```lisp
(define-instruction bsf (segment &prefix prefix dst src)
  (:printer ext-reg-reg/mem-no-width ((op #xBC)))
  (:emitter (emit* segment #xBC prefix dst src)))
```

For the compiler to be able to emit an assembly instruction, it must know the bytes of machine code that correspond to it. How do we know what bytes to emit? Open the [Intel](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)/[AMD](https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf) Developer Manual, find the instructions page and look at the opcode.

|Opcode|Instruction|Op/En|64-bit Mode |Compat/Leg Mode|Description|
| --- | --- | --- | --- | --- | --- |
|0F BC /r|BSF r16, r/m16|RM|Valid|Valid|Bit scan forward on r/m16.|
|0F BC /r|BSF r32, r/m32|RM|Valid|Valid|Bit scan forward on r/m32.|
|REX.W + 0F BC /r|BSF r64, r/m64|RM|Valid|N.E.|Bit scan forward on r/m64.|

Looking at the table in Intel's manual, the first 2 entries are pretty straight-forward, the instruction corresponds to bytes `0x0F 0xBC`, that's the opcode. `/r` is part of the [ModR/M byte](https://en.wikipedia.org/wiki/ModR/M). This is emitted after the opcode and encodes information about the instruction operands.

The third one is where things get more complicated. Some instructions require a prefix before the opcode. To use `BSF` with 64bit integers you need to first emit the [REX](https://en.wikipedia.org/wiki/VEX_prefix#REX) prefix with the `W` (width) flag set to `1`. I think in this case it would be `0b01001000`.

I haven't dived deep enough to be able to explain everything but `r16` indicates a 16-bit register as the source and `r/m16` encodes the destination, either a 16-bit register or a 16-bit memory address.

This is a lot of information, I know it was for me but here is the recap for machine code instruction format:
```lisp
[PREFIX] OPCODE MOD_RM
```

_I am no expert and this might be lacking but it was good enough to guide me to a result and it should be enough to follow along._

Time to pull up the same table for `BLSR` and see if it's making any sense.

|Opcode/Instruction|Op/En|64/32-bit Mode|CPUID Feature Flag|Description|
|---|---|---|---|---|
|VEX.LZ.0F38.W0 F3 /1 BLSR r32, r/m32|VM|V/V|BMI1|Reset lowest set bit of r/m32, keep all other bits of r/m32 and write result to r32.|
|VEX.LZ.0F38.W1 F3 /1 BLSR r64, r/m64|VM|V/N.E.|BMI1|Reset lowest set bit of r/m64, keep all other bits of r/m64 and write result to r64.|

This is starting to feel a lot like the difference between what's given for homework and what's on the final test...

## VEX Prefix

To my dismay, [VEX](https://en.wikipedia.org/wiki/VEX_prefix#VEX3) is far more complicated than REX... VEX was created to support instructions for SSE and AVX (among others), its encoding allows for access to YMM (128-bit) and XMM (256-bit) registers. I did not realize it at first but this is a big deal. It does not only impact the prefix emitted but also the ModR/M byte.

The section for encoding the register in the ModR/M byte is being repurposed for VEX instructions. Notice the `/1` in the opcode, this is the value to be placed in the register segment but what does it mean? Apparently, there is a whole set of instructions that has the same prefix and opcode as `BLSR` and the only way to indicate which one you are going to use is by setting that value. As an example, `/2` here would indicate `BLMSK` instead.

The VEX prefix can be either 2 or 3 bytes long, for `BLSR` it's going to be 3, you can tell by `0x0F38`. If it was just `0x0F` then it would be 2. This already give us the first byte of the prefix, 1/3 of the way there! Yeah, right...

<div align="center">
  <img src="/images/vex/blsr.png" width="100%"/>
</div>

First byte is `0xC4`, treat it like a constant, a prefix prefix if you will. It all goes downhill from here. This is the second byte: `0b11100010`. The first 3 1s are easy to guess, the opcode does not include `R`, `X`, or `B` and their inverse defines the first 3 bytes. The next 5 btyes are the opcode map to be used. This is where the Intel manual starts to disappoint and I found a way forward through the AMD manual where it clearly defines `map_select` for `BLSR` as being 2. Actually the AMD manual is great because it almost gives us the third byte too!

Remember `W` from the REX prefix, it's pretty much the same here, sets the 8th bit in the third byte. `vvvv` encodes the destination register, but we don't know which register will be used at this time... This is where we jump back into lisp land:

```lisp
(logior #b10000000
        (ash (lognot (reg-id-num (reg-id dst))) 3))
```

We can just ask the compiler to find the register-id from the destination provided, invert it and stuff it into place. `L` (vector length) is 0, in the Intel manual this is indicated by `LZ`. `pp` is just 2 more prefix bits. While the Intel manual does not explicitly define them for this instruction it does give a hint in Volume 2 Chapter 3.1.1.2.

While writing this, it all seems a lot simpler, the manuals make sense and hold all the information you need but the first time you go into this it feels very chaotic and incomprehensible. Every little character in those initial tables packs so much information and unless you understand all of them then you cannot move forward.

## A new instruction is born

We've already done all the hard work, now it's all about emitting the bytes:

```lisp
(in-package :sb-x86-64-asm)

(sb-assem::define-instruction blsr (segment dst src)
  (:printer ...)
  (:emitter
   (let ((size (matching-operand-size dst src)))
     (when (not (eq size :qword))
       (error "BLSR instruction only supported for 64-bit operand size")))
   (emit-byte segment #xC4)
   (emit-byte segment #b11100010)
   (emit-byte segment (logior #b10000000
                              (ash (lognot (reg-id-num (reg-id dst))) 3)))
   ;; opcode
   (emit-byte segment #xf3)
   ;; mod-r/m byte
   (emit-byte segment (logior (ash (logior #b11000 1) 3)
                              (reg-encoding src segment)))))
```

Keep in mind that `define-instruction` is *removed* after SBCL compiles itself. So I did what any sensible person would and just copied it over from the source code, along with `%def-inst-encoder`.

The only thing we haven't covered is the ModR/M byte. Bits 7 and 6 indicate "register adressing" mode. Then bits 5, 4, and 3 are now used to indicate the variant (`/1`)  while what's left encodes the source register. It's been a while since I put this much effort in 7 lines of code, going through such a journey just to hack this together has really expanded my understanding.

Enough peacocking, let's get back to work:

```lisp
(define-vop (unset-rightmost-bit)
  (:policy :fast)
  (:translate unset-rightmost-bit)
  (:args (x :scs (unsigned-reg) :target r))
  (:arg-types positive-fixnum)
  (:results (r :scs (unsigned-reg)))
  (:result-types positive-fixnum)
  (:generator 1
    (inst blsr x r)))
```

Do I honestly think that this is going to make a difference in speed? Not really, but I just had to give it a try.

# Benchmarks

No benchmarks this time, `unset-rightmost-bit` was called only when un-escaping characters. My sample JSON did not have any and are probably not that common for this to matter.

# What's next?

Since SBCL is lacking this and other similar instructions, I want to attempt and make an actual contribution, adding all of them. But it's going to take some polishing. Did you notice how I casually omitted the `:printer` part of the instruction? Didn't really do it to save space, just hid it because it's not fully done yet.
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Crafting a JSON parser with SIMD: Slowing down to go faster]]></title>
            <link>https://khepu.com/posts/2024-03-29</link>
            <guid>https://khepu.com/posts/2024-03-29</guid>
            <pubDate>Fri, 29 Mar 2024 20:23:00 GMT</pubDate>
            <content:encoded><![CDATA[
When I started this project, I chose the AVX instruction set because it was the only one that could process 256bit vectors. Every other one (excluding AVX512) was limited to 128bits. This translates to processing 32 characters per instruction. Sounds way better than 16, right? In theory it should be faster but calling out to AVX is not free. There is some overhead when we create chunks from the string being parsed in order to create the SIMD pack.

Since we are taking the time to create the SIMD pack, we want to maximize the information used. In some cases, such as parsing strings, it's worth reusing the bitmask calculated from a chunk in order to look for characters that need to be escaped or double-quotes. In any other case though we are only interested in the first rightmost set bit of the bitmask as it's a pivot point after which we need to look for a completely different one, making the bitmask pretty much single use.

That would make the ideal bitmap look like this:

```lisp
1000 0000 0000 0000 0000 0000 0000 0000
```

Skipping the most amount of characters possible within the chunk while still finding the anchor point. What I am thinking is that this is very unlikely, in fact the opposite is more likely to be true. Unless someone, on purpose, creates a horribly formatted JSON then all the anchor points are not going to be too far apart. Even JSON keys are probably unlikely to be over 10 characters.

Let's look at an example, here is the API response from an [OpenWeather endpoint](https://openweathermap.org/current):
```javascript
{
  "coord": {
    "lon": 10.99,
    "lat": 44.34
  },
  "weather": [
    {
      "id": 501,
      "main": "Rain",
      "description": "moderate rain",
      "icon": "10d"
    }
  ],
  "base": "stations",
  "main": {
    "temp": 298.48,
    "feels_like": 298.74,
    "temp_min": 297.56,
    "temp_max": 300.05,
    "pressure": 1015,
    "humidity": 64,
    "sea_level": 1015,
    "grnd_level": 933
  },
  "visibility": 10000,
  "wind": {
    "speed": 0.62,
    "deg": 349,
    "gust": 1.18
  },
  "rain": {
    "1h": 3.16
  },
  "clouds": {
    "all": 100
  },
  "dt": 1661870592,
  "sys": {
    "type": 2,
    "id": 2075663,
    "country": "IT",
    "sunrise": 1661834187,
    "sunset": 1661882248
  },
  "timezone": 7200,
  "id": 3163858,
  "name": "Zocca",
  "cod": 200
}
```

Pay attention to the anchor points and how far apart they are. The end of string for keys is adjacent to `:` and the value begins after a single space. The `,` is also adjacenet to the end of the value and the next key begins after a linefeed and 2-6 spaces. Data is relatively closely packed and this is the human readable version, it's not uncommon to have the linefeeds and identation removed to decrease size.

Now let's look at the keys, `description` is the longest, at 11 characters, while the average is much less. Even the longest string value would comfortably fit in 16 character vectors. We could look at 100 more examples and collect more accurate statistics but I feel confident that this is a fair assumption to draw and build on. Paying the cost to move 32 characters into a SIMD pack seems rather wasteful and unlikely to help. AVX does support 128bit vector instructions as well so it's worth a test just to get an idea of the difference it would make.

# Benchmarks

```lisp
;; chunk size of 32
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  10.616 seconds of real time
  0.984375 seconds of total run time (0.890625 user, 0.093750 system)
  [ Real times consist of 2.251 seconds GC time, and 8.365 seconds non-GC time. ]
  [ Run times consist of 0.109 seconds GC time, and 0.876 seconds non-GC time. ]
  9.27% CPU
  22,422,362,573 processor cycles
  5,323,339,984 bytes consed

;; chunk size of 16
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  6.362 seconds of real time
  2.015625 seconds of total run time (1.640625 user, 0.375000 system)
  [ Real times consist of 2.084 seconds GC time, and 4.278 seconds non-GC time. ]
  [ Run times consist of 0.625 seconds GC time, and 1.391 seconds non-GC time. ]
  31.69% CPU
  13,438,524,761 processor cycles
  4,875,842,528 bytes consed
```

Good improvement, not only with respect to time but also space.

## Comparing to standard parsers

More importantly, this change marks the point where the SIMD implementation became faster than other CL parsers. Here are the benchmarks for [cl-json](https://github.com/hankhero/cl-json) and [jsown](https://github.com/madnificent/jsown).

```lisp
(time (loop for i from 0 to 100 do (cl-json:decode-json-from-string *data*)))
Evaluation took:
  29.079 seconds of real time
  9.562500 seconds of total run time (8.625000 user, 0.937500 system)
  [ Real times consist of 3.137 seconds GC time, and 25.942 seconds non-GC time. ]
  [ Run times consist of 0.906 seconds GC time, and 8.657 seconds non-GC time. ]
  32.88% CPU
  61,415,718,331 processor cycles
  25,724,477,840 bytes consed

(time (loop for i from 0 to 100 do (jsown::parse *data*)))
Evaluation took:
  9.593 seconds of real time
  1.750000 seconds of total run time (1.593750 user, 0.156250 system)
  [ Real times consist of 1.497 seconds GC time, and 8.096 seconds non-GC time. ]
  [ Run times consist of 0.296 seconds GC time, and 1.454 seconds non-GC time. ]
  18.24% CPU
  20,261,563,049 processor cycles
  5,149,047,872 bytes consed
```

To be completely fair, both of these libraries support a lot more features than mine which could be dragging down their performance. Shout out to `jsown` as it took a lot of effort to outperform.

# Closing thoughts

Using SIMD is not a free ticket to performance. Parsing becomes more complex and there are a lot of details that can drag down performance. It took a lot of iterations to get something that is even slightly better than what's already out there for lisp.

This improvement has probably been the easiest to implement, code-wise, and it only makes parsing faster when the assumption holds true. If JSON object keys were larger than 16 characters or anchor points were further apart then this would actually slow things down quite a bit. Speed improvement might translate quite differently depending on the contents but I think for most cases this is going to be better.

There is still the option of doing things in a hybrid way and utilizing both vector sizes though I feel like the complexity added outweighs the benefits.

## What's next?

You might have noticed that out of the `6.3s` of runtime, `2s` are spent GCing. If allocations were to be reduced we might see another speed up due to the GC running less. So that's where I will be focusing next, either trying to stack allocate things or pulling some other trick.

## What about SSE?

SSE could be a more portable option, it's been around longer. For the sake of completeness, I did give it a try and performed almost identically to AVX with 128bit vectors. It does come with some downsides though.

The first is that it has less instructions available, even with my limited use of AVX I still ran into it during the conversion where `u8.16/=` was not available in SSE. SSE has less registers available for use and since we can do the same thing with AVX I don't see any reason to completely switch to using it.

Here is the [commit](https://github.com/Khepu/jsoon/commit/77b5ca80209b5181870bdeb3c9e3d3ad985e1e8b) with SSE being used.
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Crafting a JSON parser with SIMD: Finding the rightmost set bit]]></title>
            <link>https://khepu.com/posts/2024-03-22</link>
            <guid>https://khepu.com/posts/2024-03-22</guid>
            <pubDate>Fri, 22 Mar 2024 20:24:00 GMT</pubDate>
            <content:encoded><![CDATA[
One of the more common uses of SIMD in this project is using it for character comparisons. Instead of comparing characters one by one, you load their numerical value into an unsigned-byte array of length 32 (`32 * 8 bits = 256 bits` which is the size of the AVX SIMD pack) and then compare it (`sb-simd-avx2:u8.32=`) with the numeric value of another character. What you get back from this comparison is another SIMD pack where values are either `0`, for false, or `255`, for true.

```lisp
(sb-simd-avx2:u8.32= my-chunk
                     (sb-simd-avx2:u8.32 (char-code #\")))
```

Obviously, looping over that vector largely defeats the purpose of usign SIMD. Thankfully, there is a better way, though slightly more convoluted... You can take the result of the previous comparison and turn it into a bitmap using `sb-simd-avx2:u8.32-movemask`. `movemask` is a tricky one, it took me a while to understand why the result is exactly what it's supposed to be. Let's look at an example:

```lisp
(defun why-is-my-bitmap-like-that ()
  (let ((chunk (make-array (list +chunk-length+)
                           :element-type '(unsigned-byte 8)
                           :initial-element 0))
        (pack-of-1s (sb-simd-avx2:u8.32 1)))
    (setf (aref chunk 1) 1)
    (let* ((chunk (sb-simd-avx2:u8.32-aref chunk 0)) ;; convert the array to a SIMD pack
           (pack-one-p (sb-simd-avx2:u8.32= chunk pack-of-1s))
           (bitmap (sb-simd-avx2:u8.32-movemask pack-one-p)))
      (format nil "~b" bitmap))))

(why-is-my-bitmap-like-that) ;; => "10"
```

I would expect my result to be: `"0100 0000 0000 0000 0000 0000 0000 0000"`. Well, it's "backwards". Not really, because the first item of the array has been moved into the first bit of the bitmap and all leading zeroes have been ommitted. Makes a lot of sense when you think about it like that...

How does this help us? The 1s in the bitmap represent all the characters in the initial string that are of interest, and we can skip over anything that's a 0. To do that, my initial idea was a bithack:

```lisp
(defun rightmost-bit (n)
  (declare (type fixnum n))
  (the fixnum
    (logand (1+ (lognot n)) n)))

(defun rightmost-bit-index (n)
  "Returns a 0-based index of the location of the rightmost set bit of `n'."
  (the fixnum (truncate (log (rightmost-bit n) 2))))

(defun unset-rightmost-bit (n)
  (declare (type fixnum n))
  (logxor n (rightmost-bit n)))
```

You take a number `n`, flip the bits, add 1, use bitwise-and with the original number and you end up isolating the rightmost set bit. Almost there, here comes the ugly part. We need to convert that number into the 0-based index of the set bit in order to use it as an offset and locate the original character in the string we are parsing. There are really 2 ways to do this (that I know of):
1. use the `log` function with base 2 (what you see above)
2. basically brute force it by shifting a bit from index 0 to the left until it equals the original number.

Looking for something built into SBCL did not lead anywhere either. I hate both, for different reasons. It was good enough for a while but at some point `rightmost-bit-index` popped up in my profiler:

```lisp
  seconds  |     gc     |    consed   |    calls   |  sec/call  |  name
--------------------------------------------------------------
     0.172 |      0.000 |           0 |  1,579,605 |   0.000000 | RIGHTMOST-BIT-INDEX
     0.125 |      0.000 |           0 |  3,672,540 |   0.000000 | SKIP-TO-NEXT-CHARACTER
     0.125 |      0.000 |           0 |  5,326,737 |   0.000000 | NOT-WHITESPACE-P
     0.094 |      0.016 | 227,813,712 |  1,579,605 |   0.000000 | CHUNK
     0.078 |      0.000 |  44,410,608 |    748,191 |   0.000000 | %PARSE-STRING
     0.078 |      0.000 |  13,146,592 |    261,399 |   0.000000 | %PARSE-ARRAY
     0.078 |      0.000 |           0 |  1,285,566 |   0.000000 | PARSE
     0.078 |      0.000 |  57,041,792 |     90,003 |   0.000001 | %PARSE-OBJECT
     0.031 |      0.000 |           0 |    604,161 |   0.000000 | %PARSE-NUMBER
     0.031 |      0.000 |   6,520,832 |    402,774 |   0.000000 | %PARSE-DECIMAL-SEGMENT
     0.016 |      0.000 |           0 |  1,496,382 |   0.000000 | NEXT-OFFSET
     0.000 |      0.000 |           0 |      1,818 |   0.000000 | %PARSE-NULL
     0.000 |      0.000 |           0 |  1,579,605 |   0.000000 | RIGHTMOST-BIT
--------------------------------------------------------------
     0.906 |      0.016 | 348,933,536 | 18,628,386 |            | Total
```

Throughout my journey with SIMD my biggest resource was people playing with similar things in C/C++. Most of the time I could look into what they are doing and roughly translate that to CL. That's how I learned the basics of SIMD, documentation for it in SBCL is scarce to say the least.

This time I was less enthusiastic to see their solution. Modern CPUs have a specific set of [instructions](https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set) to do these kinds of operations. If they are cheating then so am I.

# Defining new intrinsics

`BSF`, or bit-scan forward, does almost the same job as `rightmost-bit-index` only probably faster because it's baked into your CPU and exposed as a single assembly instruction. They differ in that `BSF` used a 1-based index for the set bit. The issue is that SBCL does not expose it in any way. The only option that remains is to expose it ourselves through a virtual operator.

If SIMD in SBCL feels like a dark room with no light switch, then defining your own virtual operators feels like you are in a dark forest. It's one of those things that you have to learn by reading through the existing examples, and there are many in SBCL's repo but I still felt lost. This is the only real [resource](https://pvk.ca/Blog/2014/08/16/how-to-define-new-intrinsics-in-sbcl/) that explains some of the things you see used in virtual operators and that's what I blindly followed:

```lisp
(in-package :cl-user)

(sb-c:defknown bit-scan-forward ((unsigned-byte 64)) (integer 0 64)
    (sb-c:foldable sb-c:flushable sb-c:movable)
  :overwrite-fndb-silently t)

(in-package :sb-vm)

(define-vop (cl-user::bit-scan-forward)
  (:policy :fast)
  (:translate cl-user::bit-scan-forward)
  (:args (x :scs (any-reg) :target r))
  (:arg-types positive-fixnum)
  (:results (r :scs (unsigned-reg)))
  (:result-types positive-fixnum)
  (:generator 1
    (inst bsf x r)
    (inst dec r)))

(in-package :cl-user)

(bit-scan-forward 3) ;; => 0
```

Here is how the generator is read:
- apply `BSF` to `x` and store it in `r` (extracting the rightmost-set-bit)
- decrement `r` by 1 (converting it to 0-based)

Converting it to 0-based does imply that any value passed to `bit-scan-forward` must be a positive `fixnum`, so we have to check in advance.

## Downsides

You might be wondering, is there a downside in doing this? Always. I am not going to mention any portability concerns as those went out the window the moment I decided to use SIMD for this. The real downside I noticed is that the compiler will not generate the assembly we provided for `bit-scan-forward` unless it is absolutely certain that it is being given a `fixnum`. In any other case it will scream out `UNDEFINED FUNCTION`. So you do have to be extra careful when calling it.

If you do wrap it in it's own function that ensures that the argument is a `fixnum` then you do get a small performance hit. The function call overhead here is significant given that we are only executing 2 lines of assembly inside it. If you inline that function you seem to lose the type guarantee/check, a nasty trap!

# Benchmarks

Let's look at the difference. First the 2 functions in isolation:

```lisp
> (time (loop for i from 1 to 100000000 do (rightmost-bit-index i)))
Evaluation took:
  1.284 seconds of real time
  0.375000 seconds of total run time (0.375000 user, 0.000000 system)
  29.21% CPU
  2,713,127,471 processor cycles
  0 bytes consed

> (time (loop for i from 1 to 100000000 do (bit-scan-forward i)))
Evaluation took:
  0.011 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  24,539,824 processor cycles
  0 bytes consed
```

A nice 11,572% improvement. Not too bad. But the real test is hooking it up in the parser and seeing how that was affected. Here we go!

```lisp
;; before (rightmost-bit-index)
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  17.192 seconds of real time
  1.796875 seconds of total run time (1.703125 user, 0.093750 system)
  [ Real times consist of 2.212 seconds GC time, and 14.980 seconds non-GC time. ]
  [ Run times consist of 0.171 seconds GC time, and 1.626 seconds non-GC time. ]
  10.45% CPU
  36,311,175,076 processor cycles
  5,323,226,416 bytes consed

;; after (bit-scan-forward)
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  10.616 seconds of real time
  0.984375 seconds of total run time (0.890625 user, 0.093750 system)
  [ Real times consist of 2.251 seconds GC time, and 8.365 seconds non-GC time. ]
  [ Run times consist of 0.109 seconds GC time, and 0.876 seconds non-GC time. ]
  9.27% CPU
  22,422,362,573 processor cycles
  5,323,339,984 bytes consed
```

_NOTE: `*data*` here is a 10mb JSON document_

Speedup translates quite well! Now all this work just let us get rid of `rightmost-bit-index` but we are still using `rightmost-bit` and `unset-rightmost-bit`, which are not too bad but we can perform the same process with the equivalent BMIs and speed this up even more.

Check the code [here](https://github.com/Khepu/jsoon).

*EDIT*: Of course, after I finished writing this it occurred to me to grep the SBCL source code for `bsf` and found a similar vop named `unsigned-word-find-first-bit`.
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Merging cron expressions]]></title>
            <link>https://khepu.com/posts/2024-02-25</link>
            <guid>https://khepu.com/posts/2024-02-25</guid>
            <pubDate>Sun, 25 Feb 2024 11:00:00 GMT</pubDate>
            <content:encoded><![CDATA[
Have you ever wondered when 2 crons can be merged? Take some time to think about it for yourself, it's a fun problem.

Let's start by defining what merge means here. The cron expression that is produced by this operation would need to have all the trigger times of the first, plus all the trigger times of the second, without introducing any new ones.

Simple, right? It actually is. So long as the expressions differ in only one segment then they are pretty easy to merge:
```
'0 1 * * *' + '0 2-4 * * *' = '0 1-4 * * *'
'0 1 JAN * *' + '0 1 JUL * *' = '0 1 * JAN,JUL *'
```

If they differ in more than one segment then merging them would introduce new trigger times, which is unwanted.

# Merging multiple cron expressions

Here is where it gets tricky. Consider the following 3 expressions:
1. `0 1 * * *`
2. `30 1 * * *`
3. `0 2 * * *`

Both 2 and 3 can be merged with 1 but, after doing so, the resulting cron would not be compatible with what is left. The order in which crons are merged matters! Thinking about crons as a list though does not help, time to switch to graphs for this! The expressions will be nodes and the edges will connect expressions that can be merged. Edges are bidirectional and include the segment over which the merge is possible.

<div align="center">
  <img src="/images/cron-merging/image-1.png" width="100%"/>
</div>

Seeing the problem visualized, the first question that pops up in is: How many edges can there be between 2 nodes? It turns out there are only 3 options here. 0 if they cannot be merged, 1 if they can, and 5 if they are exactly the same. We can limit this to just the first 2 cases by considering only a unique set of crons, eliminating any duplicates.

The graph is great because it poses another very direct question: Which way do you merge? For this example it does not matter, you are going to end up with 2 expressions that cannot be further simplified, but let's look at a larger example to put things into perspective.

<div align="center">
  <img src="/images/cron-merging/image-2.png" width="100%"/>
</div>

## Coming up with an algorithm

Where do we even begin?

Before we start answering that we need to know how any merge affects the graph. We noted earlier that merging over the `minute` segment did produce a cron that was no longer able to merge over the `hour` segment. If you mentally go through the graph and do the merges you will notice that every merge of type `A` will result in a cron that procludes all merges of type `B`.

This is great, it let's us calculate the amount of "damage" we would be doing to the graph. I do call it damage because to perform the most amount of merges we do want the graph to preserve the most amount of edges between operations, otherwise we will end up with a sub-optimal solution.

On the other hand, edges of the same type are preserved, so we can count on those to stay after the merge. Knowing that we can keep following merges of one type does let us visualize paths in the graph and you might even recognize some [Strongly Connected Components](https://en.wikipedia.org/wiki/Strongly_connected_component) of the same edge type. It's an interesting concept but it goes against our principle of causing the least damage possible to the graph.

### What if we started with the least harmful merge?

It's a simple idea, graph theory refers to this as a [Minimum Cut](https://en.wikipedia.org/wiki/Minimum_cut), though ours is a little special... In order for minimum cut to work we need to alter it so that it fits our problem.

Minimum cut allows you to arbitarily split a graph in 2 but we want to limit this to just 2 nodes at a time. Our minimum cut would also need to take into account the edges that would not be lost because of the merge type, those don't really count as damage.

So, here is the process:
1. Isolate 2 nodes through minimum cut
2. Produce a new node by merging them
3. Insert the new node into the graph
4. Recalculate the edges
5. Repeat until there are no more edges

Example of a minimum cut in the above graph:
<div align="center">
  <img src="/images/cron-merging/image-3.png" width="100%"/>
</div>

Only 2 connections are permanently removed, `0 1 * * * -hour- 0 2 * * *` and `0 1 * * * -hour- 0 3 * * *`. There might be other pairs that would result in an equivalent minimum cut and at that point order should not really matter. Chosing one over the other could lead to slightly different results though everything should still be reduced down to the same number of expressions.

# Final thoughts

I wrote the [code](https://github.com/Khepu/cron-condenser) for this a while ago, while unoptimized I probably will not spend more time improving it. If I ever return to this project it will be to try my hand at a formal proof that the above algorithm works as well as I think it does.
]]></content:encoded>
        </item>
    </channel>
</rss>