Ephemeral self-hosted github runners on a dedicated server

There are many existing projects that try to solve the setup of ephemeral github runners. A curated list of such projects can be found here

However none of these projects are both simple and solved my needs.
I wanted to have self-hosted runners that meet the following requirements:

If the workflow can be run with a github runner, it can be run with the self-hosted runner.
It should be possible to define multiple runners in parallel.
It must not share the PAT with our dedicated work-server (this requirement was not obvious at the beginning and added some complexity)

Currently the runner is limited to the scope of repositories. But you can add it o as many repositories as you want.
If you are looking to setup self-hosted runners with an organisation scope, the scripts have to be adjusted to meet that need!

Jump to the final solution

The journey begins ... Trial and error

My goal was to run a dockerized testsuite for the nano currency that spins up multiple docker-containers and writes some files to disk inside a self-hosted runner... Let's go!

Attempt 1 : Setup a simple github runner on a dedicated server

TLDR; This does not meet my needs

I can't run multiple workflows in parallel without changing my current workflow
I have to create additional cleanup logic before a new runner starts
I can't use the default cleanup hooks provided by github runners, beacuse my dockerized testsuite writes files to disk as root

Full Experience
Setting up your self-hosted runner on a dedicated server is as easy as following the github example :
1) Download

# Create a folder   
$ mkdir actions-runner && cd actions-runner 
# Download the latest runner package   
$ curl -o actions-runner-osx-x64-2.300.2.tar.gz -L https://github.com/actions/runner/releases/download/v2.300.2/actions-runner-osx-x64-2.300.2.tar.gz 
# Optional: Validate the hash   
$ echo "59814d103186d379123da8d2e7b002305a7b57f509fdd0cf34e4f86394dae9a4 actions-runner-osx-x64-2.300.2.tar.gz" | shasum -a 256 -c 
# Extract the installer   
$ tar xzf ./actions-runner-osx-x64-2.300.2.tar.gz

2) Configure

# Create the runner and start the configuration experience   
$ ./config.sh --url https://github.com/{owner}/{repo} --token AUN... 
# Last step, run it!   
$ ./run.sh

3) Using your self-hosted runner

# Use this YAML in your workflow file for each job   
runs-on: self-hosted

In my workflow all I have to do is to change from runs-on: ubuntu-22.04 to runs-on: self-hosted and the workflow is executed

jobs:
  my_job_name:
    #runs-on: ubuntu-22.04
    runs-on: self-hosted

On the first run, this works fine. However on the second run the workflow fails. The self-hosted runners don't clean up after the workflow has ended.
In my case this means :

docker-containers are still running
the _work folder used by default by the github runner is not emptied

So the major challenge with this approach lies in defining some cleanup logic per workflow. Even after I created all the cleanup logic, the workflow still failed to execute successfully on a second run. What happened ?

Self-hosted runners provide some hooks which allow a script to be executed before or after the workflow is run.
The easiest way is to create a file named .env within the self-hosted runner application directory with the following content

ACTIONS_RUNNER_HOOK_JOB_STARTED=/cleanup_before_workflow_starts.sh
ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/cleanup_after_workflow_ends.sh

Inside cleanup_before_workflow_starts.sh you'd write all the actions required before a workflow runs.
Inside cleanup_after_workflow_ends.sh you'd write all the actions required after a workflow has ended.
These scripts can be located anywhere. In this case they are located within the self-hosted runner application directory

So what I did was :

Create a script that deletes all the files inside _work folder
Modify my workflow to stop all docker containers at the end

My self-hosted runners use a special user with restricted rights.
My dockerized testsuite executes as root and writes files to disk as root.
So the cleanup hooks performed by the github runner did not have the required permissions to remove all files inside the _work folder created by the testsuite.

I stopped investigating further. I could have modified the dockerized testsuite to run with the same user as my self-hosted github runner. However the workflow runs correctly on github, so there must be a better way to make it work on my self-hosted runner.

Attempt 2: Dockerized self-hosted github runners

TLDR; This does not work either

running docker containers inside a dockerized github runner doesn't work well
the docker containers of my testsuite are still visible on the host machine
files written to disk by the testsuite are still shared with the host

Full Experience
While on the surface it looked very easy to setup a dockerized runner it only works well when you run non-dockerized workflows .
Running my dockerized testsuite inside a dockerized github-runner comes with a range of problems that are not trivial to work around.

One major limitation is mounting volumes. The mount path of the volume depends on the host machines’ absolute path instead of the github-runner relative path.
Another limitation is that dockerized github runner and the dockerized testsuite don’t share the same network by default. This means that we can’t curl into our testsuite.

Attemp 3 : self-hosted github runner inside LXD - almost there

TLDR; Almost... If you are comfortable with sharing your PAT with your work-server, this solution is for you!

2 of 3 criteria are met :

the dockerized testsuite runs correctly on each run
we can set an arbitrary number of github runners
However the repo PAT is shared with our dedicated work-server

This is an overview of the architecture

Full experience
Linux containers (LXD) is a next generation system container and virtual machine manager.
Linux containers can easily host docker containers.

I found an interesting project that creates github runner in LXD with 2 simple scripts.
The first script(prepare-instance.sh)

creates a base container with all the configuration needed for the self-hosted github runner.
adds a valid github registration-token is added to the base container
The second script (spawn.sh)
defines the numer of parallel github runners
creates a new LXD container with a unique name by copying the base container
starts the container (the github runner is now active and waiting to receive a workflow)

I encountered 2 issues with these scripts :

the disk space claimed by the workflow was not freed for new runners.
the registration-token expires after 60 minutes

Github offers an api to create a registration-token from a PAT.
So what I did was :

simplify the prepare_instance script and remove the registration of the self-hosted runner
merge the registration of the self-hosted runner with spawn script into a respawnscript.

The new respawn script takes among others a PAT as input argument and does the following :

convert the PAT into registration-token needed to register a container as self-hosted runner
update registration-token of the base container. So this container always has a valid registration-token
the base container is always stopped, so it never actually starts as a self-hosted github runner
respawn any missing self-hosted github runners based on the parallelism that is defined
add the possibility to run workers for multiple repos

So if you feel comfortable sharing the repo PAT with your dedicated work-server, this solution works very well.
Simply define a cronjob tha runs each minute to spin up new runners, if needed and renew the registration-token.

$ crontab -e
## Add the following line as cronjob. Replace the path and RUNNER_COUNT (=parallel runners)
* * * * * /path/to/lxd-runner/respawn gh-runner RUNNER_COUNT https://github.com/ORG/REPO PAT

The requirement to keep the PAT safe only came after this solution was implemented.
If you can't share the PAT with your dedicated work-server the next attempt will show you how to do it.

Final solution

Success on attempt 4 : self-hosted github runner inside LXD without compromising your PAT

The final solution that meets all the above requirements is available as github-project here

it runs any workflow that runs on github runners
you can define the number of active runners per repository
your PAT never leaves github

This is an overview of the final architecture.
We will go through this picture step by step

TLDR;

Fork the project https://github.com/gr0vity-dev/lxd-github-runner or create your own github project to host the infrastructure of self-hosted runners
Install and configure your LXD environment
Create one LXD base-container per repo that needs self-hosted runners
Renew the registration-token needed to start a self-hosted runner via a scheduled github workflow

1) Fork the project

In our final solution, you'll have a github workflow that makes sure your self-hosted runners always have a valid registration token.
To run a self-hosted github runner, you need a registration token.
You can generate a valid registration-token via github api and your PAT.
By generating the registration inside a github actions workflow, you make sure that your PAT never leaves github

2) Install and configure your LXD environment

Install LXD

sudo apt-get install lxd-client

Initialise LXD and keep the defaults

lxd init

Add a new storge pool called "docker" (for docker to run properly in a Linux container it needs btrfs.) The storage name is "docker". The scripts rely on keeping that name
Allocate 50GB for all github runners (you might adjust it for your needs)

lxc storage create docker btrfs size=50GB

3) Create one LXD base-container for every repo that needs self-hosted runners

The prepare_instance script creates the base images with all the required dependencies.
You might want to adjust the following lines of the script for your needs :

# By default,we use ubunutu 20 as the base container
./prepare-instance gh-runner-{repo-name}

! Make sure to specify a unique name per base-container. I recommend using the following : gh-runner-{repo-name}

4) Renew the registration-token needed to start a self-hosted runner via a scheduled github workflow

On github, setup a PAT (Personal access token) with access to the repos and orgs you want serviced.

Enable the scopes listed below for your PAT:

repo
workflow
admin:repo_hook

The following shows an example of the workflow that

renews our registration-token
ssh's into our work-server to update the base-container with the valid token.

name: Renew ephemeral self-hosted runner token
on:
  schedule:
    - cron: '0/30 * * * *'  # "every 30 minutes
jobs:
  renew_gh-runner_token:
    name: renew reg token
    runs-on: ubuntu-latest
    steps:
      - name: Convert PAT into registration-token
        id: get_token
        run: |          
          GH_RUNNER_TOKEN=$(curl \
          --location --request POST 'https://api.github.com/repos/{OWNER}/{REPO}/actions/runners/registration-token' \
          --header 'Authorization: Bearer ${{ secrets.GH_PAT }}' \
          | jq -r '.token')
          echo "::add-mask::$GH_RUNNER_TOKEN"
          echo "::set-output name=GH_RUNNER_TOKEN::$GH_RUNNER_TOKEN"

      - name: Execute remote ssh commands using user and private key
        uses: appleboy/ssh-action@v0.1.7
        with:
          host: ${{ secrets.GH_RUNNER_HOST }}
          username: ${{ secrets.GH_RUNNER_USER }}
          key: ${{ secrets.GH_RUNNER_PRV_KEY }}
          script: ./git/lxd-github-actions/renew_runner_token {work-server-gh-runner-name} https://github.com/{OWNER}/{REPO} ${{ steps.get_token.outputs.GH_RUNNER_TOKEN }}

github workflow to generate a registration token and update our self-hosted runner

The workflow that updates my self-hosted runners can be found here

! Make sure to create the following repo secrets

- secrets.GH_PAT (your personal access token used to create a registration-token for your repo)
- secrets.GH_RUNNER_HOST (work server ip address)
- secrets.GH_RUNNER_USER (work-server user)
- secrets.GH_RUNNER_PRV_KEY (private key to ssh into the work-server)

! Additionally make sure to replace the following variables in the example above with the relevant content:

- {OWNER} # your github user
- {REPO} # your repo that requires the self-hosted runner
- {work-server-gh-runner-name} # the gh-runner name you defined when running the 'prepare-instance' script.

Explanation of why this workflow is needed :

The registration-token expires after 60 minutes.
The workflow above runs every 30 minutes and updates the registration-token inside your base-container to make sure your token is always valid.
The workflow ssh's into your work-server to execute the renew_runner_token
A PAT is still required but in this case it never leaves github.

4) Create and renew self-hosted runners as soon as a workflow has finished

The respawn_runners script copies the configuration of the base-container and inherits its valid registration-token.

Run this script as a cronjob to make sure you always have available runners

$ crontab -e
## Add the following line as cronjob. Replace the path and RUNNER_COUNT (=parallel runners)
* * * * * /path/to/lxd-runner/respawn_runners gh-runner-{repo-name} RUNNER_COUNT https://github.com/ORG/REPO

! Make sure to replace the following variables in the above cronjob:

gh-runner-{repo-name} # the name of your base-container
ORG (your github user)
REPO (your github repo that nees the self-hosted runner)

If you like my work, please show your love 💛