Linking, Monitoring, and Supervising in Elixir

Written by: Leigh Halliday

6 min read

One of the benefits of microservices is that part of the system can go down without bringing the entire system down.

With Elixir, each process is in essence a microservice. It's a small, isolated process that communicates with other processes via message passing, all orchestrated by the Erlang BEAM VM.

No memory is shared between processes, so the failure of one process is guaranteed to not effect other processes. But the key ability of Elixir isn't just how processes work; it's how they can be linked together, monitor one another, and use supervising functionality to determine what to do if a process fails.

In this article, we will touch on linking, monitoring, and supervisors, with an example of how to implement a simple caching GenServer that's supervised by a supervisor.

Linking Processes

In Part I of this series, we looked at how to spawn a process and execute some code:

spawn(fn ->
  IO.puts "In #{inspect self()} process"
end)

But if this process happens to fail for whatever reason, we'll never know about it. It is completely isolated and won't affect our current process at all.

spawn(fn ->
  IO.puts "Uh oh..."
  raise("I have failed you.")
end)
:timer.sleep(500)
IO.puts "I'm done."

The reason our process that spawned this code wasn't affected at all is because they aren't linked. You'll notice that it still printed the "I'm done." message. Linking ties two or more processes together...if one process fails, so does the process that is linked to it. To begin a new linked process, you change the above code only slightly to call the spawn_link function instead.

spawn_link(fn ->
  IO.puts "Uh oh..."
  raise("I have failed you.")
end)
:timer.sleep(500)
IO.puts "I'm done."

Now that we have linked the process to our current (self()) process, it won't print the "I'm done." message. When the linked process went down, so did our current one. Links are bidirectional. It doesn't matter which process fails; by linking them, they are both effected.

So what if we actually did want to recover from a failure in a linked process? To do this, we will have to do something called trapping exits. When a linked process fails, we are given an opportunity to recover from it. We can listen for exits using a receive block, the typical way that messages are passed from one process to another.

 Tell our current process that we want to trap exits
Process.flag(:trap_exit, true)
# Spawn a linked process which will fail
spawn_link(fn ->
  IO.puts "Uh oh..."
  raise("I have failed you.")
end)
# Receive the trapped exit message
receive do
  {:EXIT, pid, :normal} ->
    IO.inspect "Normal exit from #{inspect pid}"
  {:EXIT, pid, msg} ->
    IO.inspect ":EXIT received from #{inspect pid}"
    IO.inspect msg
end
:timer.sleep(500)
IO.puts "I'm done."

You'll notice that in the receive block above, I am actually pattern matching for two different :EXIT messages. The first one is what happens when a process exits normally upon finishing its task. The second one will catch errors and in our case will output:

:EXIT received from #PID<0.73.0>
{%RuntimeError{message: "I have failed you."},
 [{:elixir_compiler_0, :"-__FILE__/1-fun-0-", 0,
   [file: 'error_linking_traps.exs', line: 5]}]}

Monitoring Processes

Links are bidirectional, but monitoring on the other hand is unidirectional. It allows you to monitor (hence the name) the status of another process without linking yourself to it. You're observing it at a safe distance. Unlike linking, an error in a monitored process won't bring down your current one; you'll just be notified of it.

# Spawn a new process and grab its pid
pid = spawn(fn ->
  :timer.sleep 500
  raise("Sorry, my friend.")
end)
# Set up a monitor for this pid
ref = Process.monitor(pid)
# Wait for a down message for given ref/pid
receive do
  {:DOWN, ^ref, :process, ^pid, :normal} ->
    IO.puts "Normal exit from #{inspect pid}"
  {:DOWN, ^ref, :process, ^pid, msg} ->
    IO.puts "Received :DOWN from #{inspect pid}"
    IO.inspect msg
end

We'll see the following:

Received :DOWN from #PID<0.73.0>
{%RuntimeError{message: "Sorry, my friend."},
 [{:elixir_compiler_0, :"-__FILE__/1-fun-0-", 0,
   [file: 'error_monitoring.exs', line: 3]}]}

Supervising

Linking and monitoring are available when you need them, but Elixir comes with Supervisor functionality. This allows us to easily define what behavior should occur when the code that is being supervised fails. We'll use an example of a cache store, which fetches the cached value as long as it hasn't expired.

In the code below, we first fetch the total value, providing a function to call if it doesn't exist or if it has already expired. We then identify the pid of this named process, which is 0.109.0. After sending a :kill message to the process, we then identify the pid again and can see that it is now 0.115.0. It has automatically been restarted by its supervisor and is now able to fetch the total value again (which would need to be recalculated because all state was lost when the process was killed).

iex(1)> CashMan.Cache.fetch('total', fn -> 20 end)
20
iex(2)> pid = Process.whereis(CashMan.Cache)
#PID<0.109.0>
iex(3)> Process.exit(pid, :kill)
true
iex(4)> Process.whereis(CashMan.Cache)
#PID<0.115.0>
iex(5)> CashMan.Cache.fetch('total', fn -> 20 end)
20

Because this example runs as an application, we'll implement the start function which is called automatically. Its job in this case is to start the Supervisor module for this application by calling the start_link function. Supervisors, like any other concurrent code in Elixir, are simply a specialized process.

defmodule CashMan do
  use Application
  def start(_type, _args) do
    CashMan.Supervisor.start_link
  end
end

The implementation for the Supervisor module includes the use Supervisor statement. This gives us all of the functionality which comes built in to Elixir for this behavior.

We'll call the start_link function that comes with Supervisor, passing it the __MODULE__ (our current module, to use as the supervising module), an initial value, which in our case is simply :ok, and the name of this process.

The init function is then called automatically, which is where we can define the exact behavior for this specific supervisor: which children will it supervise, and which strategies should be used in case they fail.

Supervisors can supervise children (GenServers), but they can also supervise other supervisors, creating a supervision hierarchy or tree. Benjamin Tan Wei Hao produced an excellent cheatsheet detailing all of the different functions and options for supervisors.

defmodule CashMan.Supervisor do
  use Supervisor
  def start_link do
    Supervisor.start_link(__MODULE__, :ok, name: CashMan.Supervisor)
  end
  def init(:ok) do
    children = [
      worker(CashMan.Cache, [CashMan.Cache])
    ]
    supervise(children, [strategy: :one_for_one])
  end
end

A good overview of the different strategies can be found in this article, and although it is speaking about Erlang, the restart strategies are identical in Elixir.

I chose to use :one_for_one in the example above because the supervisor is only supervising one child. You would also use this strategy when it is an isolated process that shouldn't effect any other children that the supervisor is supervising.

Below we have the child, which implements the GenServer behavior. If you are looking for more details on how a GenServer works, please refer to my previous article on Concurrency Abstractions in Elixir.

defmodule CashMan.Cache do
  use GenServer
  @default_expiry 60
  def start_link(name) do
    GenServer.start_link(__MODULE__, %{}, name: name)
  end
  # Allow async fetching, which returns a `Task`,
  # allowing you to call `Task.await()` at a later date.
  def async_fetch(key, func, expiry \\ @default_expiry) do
    Task.async(fn ->
      fetch(key, func, expiry)
    end)
  end
  # Fetch the fresh value for a given key
  # If missing or expired, re-generate a new value and store it in the cache.
  def fetch(key, func, expiry \\ @default_expiry) do
    case GenServer.call(__MODULE__, {:fetch, key, expiry}) do
      :missing ->
        value = Task.async(fn -> func.() end) |> Task.await()
        store(key, value, expiry)
        value
      value -> value
    end
  end
  # Store a given value in the cache, providing its expiry time in seconds
  def store(key, value, expiry) do
    GenServer.cast(__MODULE__, {:store, key, value, expiry})
  end
  # Remove all expired entries from the cached
  def prune do
    GenServer.cast(__MODULE__, :prune)
  end
  # Return the current state of the cache
  def current do
    GenServer.call(__MODULE__, :current)
  end
  # Server
  def handle_call({:fetch, key, _expiry}, _from, state) do
    {answer, new_state} = case Map.fetch(state, key) do
      {:ok, {expired_at, value}} ->
        case expired?(expired_at) do
          true -> {:missing, Map.delete(state, key)}
          false -> {value, state}
        end
      :error ->
        {:missing, state}
    end
    {:reply, answer, new_state}
  end
  def handle_call(:current, _from, state) do
    {:reply, state, state}
  end
  def handle_cast({:store, key, value, expiry}, state) do
    new_state = Map.put(state, key, {calc_expired_at(expiry), value})
    {:noreply, new_state}
  end
  def handle_cast(:prune, state) do
    new_state = Enum.reduce(state, %{}, fn ({key, {expired_at, value}}, new_state) ->
      if (expired?(expired_at)) do
        new_state
      else
        Map.put(new_state, key, {expired_at, value})
      end
    end)
    {:noreply, new_state}
  end
  def calc_expired_at(expiry) do
    (DateTime.utc_now() |> DateTime.to_unix()) + expiry
  end
  def expired?(expired_at) do
    DateTime.to_unix(DateTime.utc_now()) > expired_at
  end
end

By calling :observer.start in any iex console, you will be able to see examples of supervisors. Logging contains one which you can explore! Ours from the example above looks like the following:

Conclusion

For a deeper dive into the world of Elixir and OTP, I recommend The Little Elixir and OTP Guidebook. It does a great job diving much deeper into each of the subjects we touched on above. The topic of supervisors in Elixir is much deeper and more nuanced than I could have hoped to cover in a single article. As usual, the Elixir website has an excellent guide on supervisors also.

The cool thing about Elixir is that all of the more advanced/abstracted functionality is built on top of the building blocks of processes, which can link themselves to other processes and send and receive messages from one process to another. Everything else is an abstraction, including a supervisor which is just a specialized GenServer that comes with the language.

Stay up-to-date with the latest insights

Sign up today for the CloudBees newsletter and get our latest and greatest how-to’s and developer insights, product updates and company news!