Battleplans: Befriending the cache

After haveing a working method for generating Dijkstra maps, it's time to understand if they are viable for our needs here. They are not generated at every frame but it needs to be fast enough for things to feel responsive.

Let's first talk about when they get computed and when we can do hacky things to avoid an expensive computation.

When a user is directly commanding units then we definitely need to produce a fresh one, we need to pay this upfront cost as reusing another one would not really save us any work. The longer it takes, though, for a command to complete the more stale the map becomes and we cannot really have the behavior map deviate too much from what the player sees.

When buildings get added/deleted then it is easy enough to patch that area of the map and those should not be frequent enough to make this problematic. What about units though? You can hardly expect them to sit in one place for long and keeping track of their movement in multiple behavior maps seems like too much work. The easy solution here is to exclude them completely from this computation, we simply do not care about unit positioning here and it will be handled elsewhere, like an occupancy map or similar.

Another problem appears when we consider multiple goals being present in a single behavior map, once one of the goals is completed then what? We could recompute but that also sounds expensive, depending on how long-lived goals are. To deal with that, once a goal has been completed we will simply replace it with a "reverse-goal". Essentially, a goal with very low priority which will have a repulsive effect on units, pushing them away. The good thing about this approach is that it is easy to patch onto an existing map. The reverse-goal will act as a flood fill up until it merges with the "borders" of the other goals. This will reuse the existing generation method, keeping things simple.

We now have a good sense of when a full computation is required though my original implementation is quite far from usable even in what we described above. On a 1000x1000 grid this is how long it takes to populate the behavior map:

Evaluation took:
  1.465 seconds of real time
  0.671875 seconds of total run time (0.515625 user, 0.156250 system)
  [ Real times consist of 1.045 seconds GC time, and 0.420 seconds non-GC time. ]
  [ Run times consist of 0.421 seconds GC time, and 0.251 seconds non-GC time. ]
  45.87% CPU
  3,095,285,390 processor cycles
  896,070,112 bytes consed

and here is the implementation:

(defun populate-behavior-grid (field goals)
 (let* ((width (field-width field))
        (height (field-height field))
        (behavior (make-array (list (field-width field)
                                    (field-height field))
                              :initial-element (1+ (max width height))))
        (grid (field-grid field))
        (queue (make-queue)))
   (dolist (goal goals)
     (enqueue goal queue))

   (loop for goal = (dequeue queue)
         for coords = (goal-coords goal)
         for x = (truncate (vx coords))
         for y = (truncate (vy coords))
         for priority = (goal-priority goal)
         unless (or
                 (let ((current-value (aref behavior x y)))
                   (<= current-value priority))
                 (let ((map-value (aref grid x y)))
                   (eql map-value +occupied+)))
           do (let ((new-priority (1+ priority)))
                (setf (aref behavior x y) priority)
                (dolist (neighbor (neighbors coords width height))
                  (enqueue (make-goal :coords neighbor
                                      :priority new-priority)
                           queue)))
         until (queue-empty-p queue))
   behavior))

It is all quite simple so makes you wonder, where the hell is all that time spent? Notice the queue, it's backed by a doubly-linked list. The enqueue, dequeue, and queue-empty-p operations are O(1). The devil is in the details! All those additional allocations, spread all over memory, are an absolute nightmare for the CPU. Cannot predict shit.

Let's give it some help, we will throw away the queue and instead preallocate an array that is large enough, and use a fill pointer to keep track of how many elements we have "queued up". Here is what that would look like:

(defun populate-behavior-grid (field goals)
  (let* ((width (field-width field))
         (height (field-height field))
         (behavior (make-array (list width height)
                               :initial-element (1+ (max width height))
                               :element-type '(signed-byte 16)))
         (grid (field-grid field))
         (queue (make-array (list (* width height))
                            :initial-element nil
                            :element-type '(or null goal)
                            :fill-pointer 0)))
    (dolist (goal goals)
      (vector-push goal queue))

    ;; Incrementing INDEX essentially acts as removing the first
    ;; element of the queue.
    (loop for index from 0
          until (eql index (length queue))
          for goal of-type goal = (aref queue index)
          for coords = (goal-coords goal)
          for x of-type fixnum = (floor (vx coords))
          for y of-type fixnum = (floor (vy coords))
          for priority = (goal-priority goal)
          unless (or
                  (let ((current-value (aref behavior x y)))
                    (<= current-value priority))
                  (let ((map-value (aref grid x y)))
                    (eql map-value +occupied+)))
            do (setf (aref behavior x y) priority)
               (dolist (neighbor (neighbors coords width height))
                 (let ((new-goal (make-goal :coords neighbor
                                            :priority (1+ priority))))
                   (vector-push new-goal queue))))
    behavior))

by being more explicit about our memory needs and using a more cache-friendly structure we can speed things up quite a bit. Our access pattern is dead simple too which makes this entire setup pretty ideal:

Evaluation took:
  0.036 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  76,524,229 processor cycles
  90,568,672 bytes consed

~50ms is by no means the best we can do but it is certainly good enough to move on with building the rest of the game and ignore this until it becomes an issue again. I am sure it will as we will be adding more and more complexity to the generation of a behavior map but this is necessary.

While we are using a high level language and operate many abstraction layers above the CPU, sometimes understanding what happens at that level gives you a whole new perspective on how to design your code. I am sure more challenges like that will come up and this was definitely a fun one. I knew that arrays were more cache-friendly than lists but I had never really grasped the difference this can make in a tight loop. Along with that, we have removed a lot of levels of indirection and allowed the compiler to do its absolute best to help us in what we are doing. Take a moment out of your day to thank your compiler for its hard work.

3 tricks to remember:

preallocate the memory you are going to use in a tight loop
consider using a more cache-friendly structure where it makes sense
predictable access patterns