Mastodon infrastructure/ops, technical
Had some time to work through the maintenance backlog for mastodon.at, so I thought now would be a good time to share some infrastructure/ops details, both for the sake of comparing notes with other admins and so users of this instance know how their data is stored and (hopefully) kept safe.
Thisāll be a long, technical thread and I might go off on tangents a time or ten. #MastoAdmin 1/n
Mastodon infrastructure/ops, technical
Letās start with servers. mastodon.at runs on five (virtual) servers.
The database server (Postgres) has two vCPUs, 4GB of RAM and 40GB of disk. The other servers all have one vCPU, 2GB of RAM and 20GB of disk.
Thereās one server each for Elasticsearch, Redis and two application servers (running sidekiq, puma and nginx).
All servers run on local NVMe SSD storage. Servers are hosted by Hetzner in Germany. 2/n
Mastodon infrastructure/ops, technical
The servers all run Debian stretch (soon to be buster) with a pretty straight-forward configuration.
Thereās minimal hardening that I apply on the OS/system software level, mostly configuring apt-transport-https, changing the OpenSSH configuration to roughly match Mozillaās guidelines (https://infosec.mozilla.org/guidelines/openssh), enabling unattended-upgrades, apparmor, installing sshguard (mostly to prevent log spam/slightly mitigate DoS) and auditd. 3/n
Mastodon infrastructure/ops, technical
I run fairly recent kernels from stretch-backports, mostly since I donāt believe the ārun ancient kernel and backport fixes/security patchesā approach works well.
Iād probably run something like CoreOS or any other rolling release distro and update aggressively, but alas thereās only so many hours in a day, so my approach is generally to use a stable distribution and update components in the critical part independently. 4/n
Mastodon infrastructure/ops, technical
Managing secrets for FDE is tricky. Do it manually and youāll need to be ready to enter the passphrase whenever your server reboots (not fun during a short outage at night). Since I didnāt want to take that availability trade-off, I use Mandos (https://www.recompile.se/mandos/man/intro.8mandos).
With Mandos, servers can boot unattended, but thereās a bit of a security and complexity trade-off (the Mandos documentation has a deep-dive on this, good read) 6/n
Mastodon infrastructure/ops, technical
CoreDNS is used as the caching DNS resolver, forwarding to Cloudflare via DNS over TLS.
Main reasons for using CoreDNS: itās written in a memory-safe language, offers great Prometheus integration and detailed query logging.
DNS logs offer great insights in incident response scenarios (āThis server was hacked at 12:15 UTC, which DNS queries were sent around that time?ā).
Cloudflare was chosen as the upstream because of performance and DoT support. 7/n
Mastodon infrastructure/ops, technical
Finally, networking. All servers run ufw (an iptables frontend) and block incoming traffic except for SSH and HTTP/HTTPS (on the application servers).
Thereās no blocking on outgoing traffic (would be good as additional defense-in-depth, but tricky to get right with federation, so itās in my backlog).
When servers need to talk to each other (e.g. app servers to redis or postgres), that happens on a private, firewalled network encrypted via Wireguard. 8/n
Mastodon infrastructure/ops, technical
Letās dive into the server roles, starting with Postgres, the database server.
I run Postgres 11, installed via the Postgres apt repository (https://wiki.postgresql.org/wiki/Apt). The server also runs an instance of pgbouncer, a connection pooler for Postgres.
I do some minimal tuning of Postgres, mostly in terms of buffer size. All of these depend highly on RAM, CPU and I/O constraints.
I also enable data checksums to prevent silent database corruption. 9/n
Mastodon infrastructure/ops, technical
Postgres is Mastodonās main data store, so backing up that data regularly is crucial.
I use three approaches to cover different data loss scenarios.
First, a disk snapshot of the server is created at least once a month as well as before and after any maintenance work on the server. The main purpose of this backup is to reduce recovery time by providing a fast way to spin up a new instance of Postgres with the right configuration. 10/n
Mastodon infrastructure/ops, technical
I also create a logical backup of the database using pg_dump once a day. The backup is encrypted and stored on a separate volume on the server. Itās also pulled by my home server once a day, where backup rotation keeps daily backups for a week as well as weekly backups for a month.
Completion and pulling of backups is monitored using something like a dead manās switch ā if no backup completes or is pulled in a 25-hour period, monitoring screams at me. 11/n
Mastodon infrastructure/ops, technical
Finally, I use wal-g (https://github.com/wal-g/wal-g) to ship encrypted WAL files to Amazonās S3 every 5 minutes.
WAL (Write-Ahead Logging) can be thought of as files containing changes the database made to its permanent records.
By continuously archiving WAL files, you get point-in-time recovery for your data. In the configuration I use, the maximum window for data loss is roughly 5 minutes (unless the failure scenario involves WAL archive failure). 12/n
Mastodon infrastructure/ops, technical
Being paranoid about backups, monitoring success and not putting all your eggs in one basket is an important lesson often learned the hard way.
Things you should think about:
Can I access my backups if my hosting provider throws me out?
Are my backups safe if my server is compromised?
Would I notice if my backups are silently failing?
Do I know restoration works?
Do I have a fallback if one of my backup tools has a bug that leads to corruption? 13/n
mastodon.at is a microblogging site that federates with most instances on the Fediverse. Note: This instance will shut down on February 29th, 2020.
Mastodon infrastructure/ops, technical
I use full disk encryption via LUKS.
I have mixed feelings on this topic ā whenever I encounter FDE in the wild, itās mostly implemented for compliance without any regard for threat modeling, often trying to make you believe that ticking the āencryption at restā checkbox means your data is safe.
All that being said, there are some scenarios where encryption at rest might help ā stolen disks, comprised disk snapshots and what not. 5/n