Lessons Learned from a Blog Post Error

Last month I wrote a blog post about how to set up Let’s Encrypt for a Docker-based Web application. However, it contained a major error which I only discovered later on. I corrected it in the original post, but as I thought about it some more, I realized that there were some lessons to be learned, both about the specific technologies and software in general.

The Error

The error stemmed from the fact that NGINX does not reload configuration while it is running, unless you tell it to. The part that can be overlooked is that HTTPS certificates are part of the configuration. So even if the certificate files themselves are updated, NGINX will not detect this and will continue to serve an old certificate.

The remedy was to trigger a reload of NGINX’s configuration whenever the renewal process runs. (This is the ExecStartPost line in the systemd unit.) This is not ideal because Certbot’s renewal process is designed to run every day or even multiple times a day, so a reload happens every day even if the certificates haven’t changed. I’m sure there are ways to get around this, but for my situation a daily reload of the Web server is acceptable.

The Lessons

Probably the biggest lesson here is test everything, which includes the entire lifecycle of a process and not just the initial phases. In my case, that means making sure that the renewal actually works. I did monitor the certificate renewal on my site when it was due, but in reality I should have run through the entire process while preparing my demo for the blog post, or at least held off on publishing the post until I made sure that renewal worked.

Closely related is monitor critical aspects of your application. Monitoring and alerts allow potential problems to be identified sooner, so they can be fixed with less impact on users. In my case, there are several tools and services that monitor HTTPS certificate expiration, which could have identified this problem if I didn’t check for myself.

Another is ask if the right tools are being used. The error I made arose from the particular setup that I have, and there are several alternatives that either eliminate this particular problem or present an alternative solution. Here are some questions I could ask about my setup:

  • Is NGINX the right Web server to use? One alternative is Caddy, which has HTTPS-related logic built in, including automatic Let’s Encrypt renewal and redirection from HTTP to HTTPS.
  • Should NGINX be running in a container? An alternative would be to run NGINX as a reverse proxy on the host machine while keeping the actual application in Docker. To be fair, the same error could still arise here, but this arrangement would allow me to use Certbot’s automatic setup, which handles the reload properly.
  • Is a systemd unit the right way to schedule automatic renewal? I mentioned some alternatives in the original blog post, and they are worth considering.
  • Should Docker Compose be used to manage the containers? I’m using Docker Compose because it’s relatively simple and works well enough for a low-traffic personal server. For a high-traffic application, something like Kubernetes might be better. Of course, Kubernetes presents its own challenges, even if it gets around this particular problem in my case.
  • Should containers be used at all? Containerized applications present new problems that must be solved, compared to non-containerized applications. Migrating an existing application to use containers is a significant feat, so there has to be a good reason for it.

All of these questions involve tradeoffs, which is the final lesson I’ll present here: Tradeoffs are inevitable in the realm of software. All of the alternative setups I’ve mentioned are “better” in some way, but they all have disadvantages. Whether the advantages outweigh the disadvantages depends on context: the application being built, the people building it, and the environment they operate in. All of these can change, which is why it’s good to periodically ask these questions and consider if a different tradeoff makes sense.

Closing Thoughts

A postmortem is a document written after a failure or outage that describes the causes of the failure as well as measures that can be taken to prevent similar failures in the future. It presents an opportunity to reflect on the bigger picture of what led to the failure and not just the immediate cause. In a way this follow-up serves as a postmortem for the error I made in my previous post. Even though it wasn’t a software failure in the usual sense, the lessons learned can be applied to many different aspects of building software systems.

Reply to this post via e-mail or on: Twitter, LinkedIn.
Philip Chung
Philip Chung
Software Developer