feat(epicardium): Add a proper mutex implementation

In the current firmware, different locking mechanisms a littered around the code-base. Among them are bare FreeRTOS mutexes and the hw-locks. The callers for these often specify timeouts but don't make much effort in A) picking robust values for the timeout and B) recovering gracefully from a timeout happening. Most of the time, we return -EBUSY to _Python code_. This is really really bad API design. The firmware needs to have enough integrity to ensure these situations can't ever occur. To combat this, add a new locking primitive: The `struct mutex`. The intention is to replace all other locking and synchronization APIs with this one. This will provide one central place to debug any sort of locking issues. The `struct mutex` API is based on a few assumptions about locking. Those are detailed in `Documentation/epicardium/mutex.rst`, which is part of this commit. The most important one is: Locking can **never** fail. By requiring this to be true, we eliminate the need for drivers to contain (often incorrect) logic for dealing with locking fails. This should drastically improve the stability of the firmware in regards to lock-related bugs. This commit does not introduce any functional changes yet. Signed-off-by: Rahix <rahix@rahix.de>

feat(epicardium): Add a proper mutex implementation
82fe0cd7 · rahix · ec71acfd · 82fe0cd7 · 82fe0cd7 · 82fe0cd7
Verified Commit 82fe0cd7 authored 5 years ago by rahix
--- a/Documentation/epicardium/mutex.rst
+++ b/Documentation/epicardium/mutex.rst
+Mutex
+=====
+Ontop of FreeRTOS, we have our own mutex implementation.  **Never use the
+FreeRTOS mutexes directly!  Always use this abstraction layer instead**.  This
+mutex implementation tries to make reasoning about program flow and locking
+behavior easier.  And most importantly tries to help with debugging possible
+dead-locks.
+
+Design
+------
+There are a few guiding design principles:
+
+- Mutexes can only be used from tasks, **never** from interrupts!
+- Timers can use mutexes, but only with :c:func:`mutex_trylock`, **never** with
+  :c:func:`mutex_lock` (Because they are not allowed to block).
+- Locking can *never* fail (if it does, we consider this a fatal error ⇒ panic).
+- No recursive locking.
+- An unlock can only occur from the task which previously acquired the mutex.
+- An unlock is only allowed if the mutex was previously acquired.
+
+For a more elaborate explanation of the rationale behind these rules take a
+look at the :ref:`mutex-design-reasons`.
+
+Definitions
+-----------
+.. c:autodoc:: epicardium/modules/mutex.h
+
+.. _mutex-design-reasons:
+
+Reasons for this Design
+-----------------------
+
+Only from tasks
+^^^^^^^^^^^^^^^
+Cases where it might make sense to acquire a lock from an IRQ are very hard to
+get right.  They often involve the lock and unlock being in different functions
+and thus the correctness of the code is resting on the assumption that those
+functions are called in the correct order, also in respect to the rest of the
+related code.  By disallowing this entirely, we force code-flow to be more
+linear and hopefully easier to reason about.
+
+An exception is with software timers:  While they technically *can* acquire a
+mutex (using :c:func:`mutex_trylock`), I strongly recommend **not** doing that.
+Because :c:func:`mutex_trylock` can fail, doing these kind of timers correctly
+is not trivial.  Instead, use a task and block it using :c:func:`mutex_lock`.
+
+Locking can *never* fail
+^^^^^^^^^^^^^^^^^^^^^^^^
+From a technical standpoint this might seem unintuitive.  But when looking at
+it from the semantics of what it means for a lock to fail, it makes total
+sense:  There are two cases which could be considered failure of a lock:
+
+1. **A dead-lock condition**.  Dead-locks should not be catched by code, but by
+   review of the code earlier on.  Tools like static- or runtime-analysis are
+   the proper way to detect possible dead-locks.  Going the windows-route of
+   trying to write the OS defensive against itself is not going to end well.
+   Fix the code which might dead-lock instead of slapping a band-aid ontop.
+2. **Another task blocking the lock for a long time**.  This is a bit more
+   tricky: FreeRTOS provides the ability to add a timeout to locking attempts
+   to catch these cases.  But this is also a bad idea.  Think about it:  What
+   would you even do after the locking attempt timed out?  You cannot access
+   the mutexed data/peripheral anyway because someone else is still modifying
+   it.  Even if you were to kill the offending task, the data/peripheral might
+   be in a totally unpredictable state.  It is also very hard to define timeout
+   values that are relaxed enough for all possible code-paths acquiring the
+   lock.  You would have to take situations like the task holding the mutex
+   being preempted by a long-running high priority task into account.  So
+   instead, we provide :c:func:`mutex_trylock`:  If the mutex cannot be
+   acquired immediately you have the chance to choose an alternative code-path
+   and try again later.
+
+No recursive locking
+^^^^^^^^^^^^^^^^^^^^
+While recursive locking might seem to make a few cases easier to deal with, it
+carries quite a few implications which make it not worthwhile.  It makes
+analysis of locking chains a *lot* harder and in some cases even impossible
+(for proving 100% correcteness).  We should strive for a design which is as
+simple as possible and as obvious to understand as possible.  Keeping locking
+hierarchy flat helps quite a lot with that.
+
+Only unlock from the acquiring task
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This one should be obvious but FreeRTOS does not enforce this ... By stating
+that locking can never fail, we also won't ever have a situation where
+force-unlocking a forgein lock might be necessary.
+
+Only unlock once after acquisition
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+As a pseude-recursive locking implementation, sometimes the
+:c:func:`mutex_unlock` call is written in a way that allows unlocking an
+already unlocked mutex.  But this only encourages bad practices like unlocking
+everything again at the end of a function "just to be sure".  Code should be
+written in a way where the lock/unlock pair is immediately recognizable as
+belonging together.
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -52,6 +52,7 @@ Last but not least, if you want to start hacking the lower-level firmware, the
   debugger
   pycardium-guide
   memorymap
+   epicardium/mutex
   epicardium/sensor-streams

 .. toctree::

--- a/epicardium/FreeRTOSConfig.h
+++ b/epicardium/FreeRTOSConfig.h
@@ -52,6 +52,8 @@
 #define INCLUDE_vTaskDelay          1
 #define INCLUDE_uxTaskGetStackHighWaterMark 1
 #define INCLUDE_xTimerPendFunctionCall 1
+#define INCLUDE_xSemaphoreGetMutexHolder 1
+
 /* Allow static allocation of data structures */
 #define configSUPPORT_STATIC_ALLOCATION 1


--- a/epicardium/modules/meson.build
+++ b/epicardium/modules/meson.build
@@ -14,6 +14,7 @@ module_sources = files(
  'light_sensor.c',
  'log.c',
  'max30001.c',
+  'mutex.c',
  'panic.c',
  'personal_state.c',
  'pmic.c',

--- a/epicardium/modules/mutex.c
+++ b/epicardium/modules/mutex.c
+#include "modules/mutex.h"
+
+#include <assert.h>
+
+void _mutex_create(struct mutex *m, const char *name)
+{
+	/* Assert that the mutex has not been initialized already */
+	assert(m->name == NULL);
+
+	/*
+	 * The name is just the parameter stringified which is almost always a
+	 * pointer.  If it is, skip over the '&' because it adds no value as
+	 * part of the name.
+	 */
+	if (name[0] == '&') {
+		m->name = &name[1];
+	} else {
+		m->name = name;
+	}
+
+	m->_rtos_mutex = xSemaphoreCreateMutexStatic(&m->_rtos_mutex_data);
+}
+
+void mutex_lock(struct mutex *m)
+{
+	int ret = xSemaphoreTake(m->_rtos_mutex, portMAX_DELAY);
+
+	/* Ensure locking was actually successful */
+	assert(ret == pdTRUE);
+}
+
+bool mutex_trylock(struct mutex *m)
+{
+	int ret = xSemaphoreTake(m->_rtos_mutex, 0);
+	return ret == pdTRUE;
+}
+
+void mutex_unlock(struct mutex *m)
+{
+	/* Ensure that only the owner can unlock a mutex */
+	assert(mutex_get_owner(m) == xTaskGetCurrentTaskHandle());
+
+	int ret = xSemaphoreGive(m->_rtos_mutex);
+
+	/*
+	 * Ensure that unlocking was successful; that is, the mutex must have
+	 * been acquired previously (no multiple unlocks).
+	 */
+	assert(ret == pdTRUE);
+}
+
+TaskHandle_t mutex_get_owner(struct mutex *m)
+{
+	return xSemaphoreGetMutexHolder(m->_rtos_mutex);
+}
--- a/epicardium/modules/mutex.h
+++ b/epicardium/modules/mutex.h
+#ifndef _MUTEX_H
+#define _MUTEX_H
+
+#ifndef __SPHINX_DOC
+/* Some headers are not recognized by hawkmoth for some odd reason */
+#include <stddef.h>
+#include <stdbool.h>
+
+#include "FreeRTOS.h"
+#include "semphr.h"
+#else
+typedef unsigned int size_t;
+typedef _Bool bool;
+
+typedef void *TaskHandle_t;
+typedef void *SemaphoreHandle_t;
+typedef int StaticSemaphore_t;
+#endif /* __SPHINX_DOC */
+
+
+/**
+ * Mutex data type.
+ */
+struct mutex {
+	/* Name of this mutex, kept for debugging purposes.  */
+	const char *name;
+
+	/* FreeRTOS mutex data structures. */
+	SemaphoreHandle_t _rtos_mutex;
+	StaticSemaphore_t _rtos_mutex_data;
+};
+
+/**
+ * Create a new mutex.
+ *
+ * Call this function as early as possible, in an obvious location so it is easy
+ * to find.  Mutexes should be defined statically so they stay alive for the
+ * entire run-time of the firmware.
+ */
+#define mutex_create(mutex) _mutex_create(mutex, #mutex)
+void _mutex_create(struct mutex *m, const char *name);
+
+/**
+ * Lock a mutex.
+ *
+ * If the mutex is held by another task, :c:func:`mutex_lock` will block the
+ * current task until the mutex is unlocked.
+ *
+ * .. warning::
+ *
+ *    This function is **not** safe to use in a timer!
+ */
+void mutex_lock(struct mutex *m);
+
+/**
+ * Try locking a mutex.
+ *
+ * If the mutex is currently locked by another task, :c:func:`mutex_trylock`
+ * will return ``false`` immediately.  If the attmept to lock was successful, it
+ * will return ``true``.
+ *
+ * This funciton is safe for use in timers.
+ */
+bool mutex_trylock(struct mutex *m);
+
+/**
+ * Unlock a mutex.
+ *
+ * You **must** call this function from the same task which originally locked
+ * the mutex.
+ */
+void mutex_unlock(struct mutex *m);
+
+/**
+ * Get the current owner of the mutex.
+ *
+ * Returns the task-handle of the task currently holding the mutex.  If the
+ * mutex is unlocked, ``NULL`` is returned.
+ */
+TaskHandle_t mutex_get_owner(struct mutex *m);
+
+#endif /* _MUTEX_H */