While the frontend ELB works out of the box with Varnish (no surprises here), the backend ELB doesn’t work as expected with Varnish. The problem lies on the fact that Varnish is resolving the name assigned to the ELB, and it’s caching the IP addresses until the VCL get’s reloaded. Because of the dynamic nature of the ELB, the IPs linked to the cname can change at any time, resulting in Varnish routing traffic to an IP which is not linked to the correct ELB anymore.
The problem is discussed here and here but after Googling around I couldn't find any solution which didn’t involve doing:
ELB -> VARNISH -> NGINX (or HAproxy) -> ELB -> AUTOSCALING GROUP
Going through so many layers seemed too much, taking into consideration that Varnish can be used to load balance requests and perform health checks on the backend nodes without the need for an Internal ELB. The more I thought about it, the more I realised how simple it would be to implement a solution..... so I did it. Using Varnish to perform the load-balancing, removes the overhead of going through an internal ELB, and it will require reloading the backend nodes only when an autoscaling activity takes place.
The solution I've implemented uses varnishadmin command line tool, boto, and some bash scripting to glue all together.
First of all we need to get the backend nodes configured in Varnish and store them on a file:
varnishadm -T $HOSTPORT -S $SECRET backend.list > varnish_ips
Then, we will have to query the autoscaling group, and update the backends if any instance has been added/terminated. The following Python code does most of the job:
Let’s break it down:
- get_autoscaling_ips gets the IPs associated with instances added to a specific autoscaling group.
- get_varnish_ips loads the backend IPs in a Python array
- update_vlc_file compares the two list of IPs. If there is any difference (you might want to reconsider this aspect) in the two lists of IPs, it creates a new VCL file containing the IPs retrieved from the autoscaling group.
In order to decouple the VCL section which is used to define request handling and document caching policies (unlikely to change according to the autoscaling group) from the section which is used to configure the backends, the Python script outputs the new VCL in the following format:
The node definition and the director definition is dynamically generated by the script, while healthcheck.vcl is a static file where the healthchek conditions are defined (what a surprise:) and use.vcl is another static Varnish config file, which makes use of the director definition.
Once the new VCL is generated, it’s just a matter of reloading it, running:
varnishadm -T $HOSTPORT -S $SECRET vcl.load $NAME $FILE
varnishadm -T $HOSTPORT -S $SECRET vcl.use $NAME
Something I noticed when creating the script, is that backend.list returns the list of the configured backends, regardless if the VCL which defines them is in use or not. This behaviour makes the all exercise of comparing VCL backends with autoscaling IPs useless, so we need to remove all the previous VCL configs running:
varnishadm -T $HOSTPORT -S $SECRET vcl.discard $OLD_VCL
The three scripts can be glued together on a bash script which runs as a cron job on each Varnish server. The code above has not been used in production yet, so please do test thoroughly before usage. II’m always curious to hear of any feedback, so get in touch if you have any comments on this.
As usual, please reach out to us if you need any help or advice using AWS!