Securing communication across untrusted networks in modern infrastructure is a must. Even in cases where the particular traffic is not sensitive in nature, the industry is increasingly demanding trust between parties, if for no other reason than to know with whom you are communicating, but how do you get a valid certificate? Well, LetsEncrypt!

I’ve recently found myself with a need to secure traffic for private DNS zones and networks. If you’ve been around and paying attention, skip the background, as I’m sure you are aware. For those who don’t currently use LetsEncrypt, maybe a bit of background is in order.

Did someone order a bit of background?

The attitude towards securing communication across an untrusted link doesn’t always extend into the network segments we consider trusted. It might be a VPN, or a WPA key, or some other means of trusting those links, but we often consider this sufficient for getting business done. We have a single factor of trust in this scenario, and we shrug off the rest. The reason for this neglect often comes down to one simple fact; client management can be a pain, and many organizations choose not to do it at all.

There is much variability in what organizations are willing to invest in client machines, that its done very little outside of the larger enterprises.

Having spent the last few years working on startup infrastructure, its something that certainly goes by the wayside during the early years of a company, from my observation.

I’ve always loved the idea of rolling an internal and private PKI, deploying the trusted CA to each client, and in grown-up environments, even doing certificate-based client authentication for internal infrastructure. I’ve personally only dabbled here, and in my career I’ve only even seen this done well once, and it was there before I arrived. In that environment, there were over 6000 desktops, so the investment in managing those clients was a requirement for the business. Even in such a network there was enough clear HTTP traffic to make me nervous.

The required overhead in securing internal communication is less justifiable in small environments. Furthermore, at the smaller scales, and particular in young organizations, many of the polices, and the ability to enforce any corporate policy is completely lacking, leaving employees free to do whatever they want on their client machine, which might include a niche operating system, and like hell they’ll let the IT guy install some management software. Too, this freedom is actually what many people appreciate about working in these small environments, and as such, its defended. So the human element only adds to the complexity of client management.

Securing communication between server machines is often much simpler. Many service configurations simply take a path to where in the filesystem to load the certificate, key and CA. In such a case, passing your own CA certificate as the source of trust is just as simple as using a valid cartel-of-trust- certificate from the internet at large.

Using a public TLS certificate historically costs money, and managing an internal CA costs money to operate and maintain. For those that still run hardware, when was the last time you connected to your hardware’s out of band management interface and your browser was able to trust the session?

Much of these (perhaps legacy) attitudes to managing TLS can now be rectified by the glory that is free TLS certificates. LetsEncrypt has made quite a splash helping to provide FREE TLS certificates for whatever you want. Its cool shit, and I’ve wanted to use it for pretty much everything since I first heard of it.

The problem becomes my own

I have recently found myself in a position where securing the internal traffic had gone completely neglected. In the rare circumstance where TLS was used for communication, there was no trust between parties.

The standard scripts for LetsEncrypt expect hosts to be directly reachable by the internet, but almost none of my hosts are. Only the web infrastructure was available externally, and where all of the communication I was looking to secure was on private networks, using private DNS zones.

So I set out to solve this problem:

  • Secure traffic communication between machines
  • Secure internal communication between machines and humans.

Specifically, I’m just talking about TLS.

The solution will set you free…(from manual certificate renewal)

The solution I’ve applied is as follows, leaving out how a node comes online.

A node checks into Puppet for the first time, installs a service that will make use of the TLS certificate, registers a TLS service check for the monitoring certificate, and configures dehydrated (the renamed letsencrypt.sh script).

The service will fail to start due to the missing certificate on disk, but a sudo entry is put in place to allow the Icinga server to execute dehydrated.

The monitoring system then executes the check to determine the health of the TLS certificate, and fails on account of the service being unavailable. Upon failure, the monitoring system shells into the target node, passing in the environment variables necessary to modify the DNS records in Route53 through the SSH session, and executes the following command.

/opt/bin/dehydrated -c -d myservice.internal.example.com -t dns-01 -k /opt/bin/monitoring_renew_hook.sh

Here the monitoring_renew_hook.sh is just a wrapper script so that the correct python virtualenv is used. This was only to address an issue where the -k option wouldn’t take a command with space in it. The name myservice.internal.example.com is the name of the service check that has been exported from the node. The monitoring_renew_hook.sh contains the following.

#! /bin/bash
/opt/monitoring/bin/python3 /opt/bin/letsenncrypt_dns_hook.py "$@"

This executes the python script to manipulate the DNS during the validation phase of the dns-01 validation method of the dehydrated script. Since the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are available during script, the node modifies the Route53 DNS records to pass validation, after about a minute, validation succeeds, the certificate is written to disk, and the temporary DNS record is removed.

On the next Puppet run, the TLS certificate is available on the nodes disk, and deployed to a location where the service can use it, and the service is now able to start.

The monitoring server then checks the certificate, which now comes back clean.

This also means that when the certificate is nearing expiration, the Icinga check will fail, and the dehydrated script is called again to renew the certificate. This should mean that we get completely automated, valid TLS certificates that get generated when the are needed, and renew when they are nearing expiration.

Faking the DNS zone

The example zone that we’re modifying is example.com, but the certificate that we’re acquiring is myservice.internal.example.com. This means that the A record that gets created is myservice.internal in the zone example.com. The private zone is untouched, and the record created in the external zone only exists for a brief moment during validation and then is removed. You’ll see references to the external_zone in the hook script.

Security of the approach

The AWS credentials are only stored on the Icinga server and passed to the dehydrated script on the target system only when they are required, and over SSH. This is nice because if I need to change the credentials, I can do so in a single place, so key rotation gets a bit easier. It has the added benefit of not just leaving credentials laying around on every system that needs valid TLS.

This also means that the key is generated in place, and I’m not shipping keys around to systems when they require them. When a service needs a TLS certificate, the monitoring system generates the required data only where it is needed, and it never needs to leave that system, but this approach would still lend itself well to generating a bunch of keys on a single system and pushing them down to the nodes that require them, presumably though Puppet or some other secure distribution channel.

Results

We’re still in the process of rolling this out to more services, but so far we’ve approaching a dozen or so certificates that have been generated using this process, and most but not all of them have been Nginx. Time will tell us more, since certificates have to being expiring before they are renewed, but I expect this approach to scale as far as my employer will require.

Here is the hook script I worked up to support this approach. Maybe its useful to someone else.