Reconfiguring Tableau Server without Restart: Graceful Restart (part2)
Graceful restart, the promise of restarting core components like vizqlserver without impacting the users. As of now there is no official support from Tableau Software to do black magic like this, but yes, you aren’t here for official supported stuff. You are here for actual solutions. Considering that you know how Tableau Server’s configuration template engine works and familiar with the basics of tableau server gateway architecture including its services’ sticky sessions let’s jump into the details.
Graceful Tableau Server Restart aka Draining Mode
The key to graceful restart (= restart Tableau Server services without impacting production, like disconnect established vizql sessions) is the load balancer’s Draining Mode. Just to remember Tableau Server gateway uses load balancer clusters to load balance across same kind of services. Like it has balancer://vizportal-cluster defined in httpd.conf containing all vizql server backend application servers. When someone hits URL starting with /vizql his requests will be forwarded to worker URLs. If his web session has vizql session id then apache gateway will pick the same worker for the same vizql session id.
If you check Tableau Server’s LoadBalancer status in the balancer-manager, you will see something like this:
You might remember that usual vizql session IDs look like:
The first part is a 128bit unique identifier, then a dash, then the route identifier as seen in the balancer configuration. Yes, this is the way how can you know which vizqlsession belongs to which vizqlserver worker.
Now the fun part.
Apache’s documentation defines drain mode as:
When worker is in drain mode it will only accept existing sticky sessions destined for itself and ignore all other requests.
Sounds exactly what we need: the load balancer will redirect existing sessions (like vizql or data server) to existing load balancer workers if their sessions are already established, otherwise skip the ones in drain mode.
Now that we can control who is doing what we can define the process flow for restarting vizql servers:
- Change configuration setting in workgroup.yml (like trusted hosts or log level) and issue tabadmin configure
- Pick the first worker/route from the LoadBalancer and put it to draining mode. This will ensure that no new sessions will be redirected to this worker
- Wait until all sessions finishes or time out (you can monitor it with JMX). Additionally you can add hard timeout for restart like 30-60 mins.
- Change worker state from Draining to Disabled
- Signal terminate to vizqlserver. If you’re sophisticated you can enable tomcat’s shutdown port, so you don’t have kill the process in a barbarian way.
- vizqlserver will restart automatically
- Change mode to Draining Mode = Off, Disabled = Off
- Go to next vizql worker
This is it. It applies to other services like data server, vizportal or saml-service.
How do we use?
Here, at Starschema we prefer to have our own Server administrator tool chain. In addition to Palette Center and Insight we built dozen of tools to support complex tasks like this phased, graceful restart. One of our tool of choice is tabadmin-cli, a readline based tabadmin shell built on top of tabadmin.jar. It makes things faster as you need to ramp up the JVM/Jruby stack once and use that JVM for all consequent calls, plus the convenient code completion. Graceful restart is also included in the tool and using the same process flow as described above.
e:\Tableau\Tableau Server\10.0\bin> tabadmin-cli tabadmin> graceful restart vizqlserver Locating vizqlserver-cluster workers from balancer-manager vizqlserver 0:0 http://localhost:9100 vizqlserver 0:1 http://localhost:9101 Graceful restart worker 0:0 Switch worker to Draining mode Connecting to JMX endpoint jmx://localhost:9400 Number of active sessions 3. Sleeping 60secs Number of active sessions 1. Sleeping 60secs No active sessions. Switch worker to Disabled mode Sending stop signal to process 1844. Sleeping 60 secs Switch worker to Non-disabled mode Graceful restart worker 0:1 Switch worker to Draining mode Connecting to JMX endpoint jmx://localhost:9401 Number of active sessions 1. Sleeping 60secs No active sessions. Switch worker to Disabled mode Sending stop signal to process 9176. Sleeping 60 secs Switch worker to Non-disabled mode Graceful restart complete tabadmin> status Status: RUNNING tabadmin>
Magical, isn’t it? I have to confess, not just because it’s my own creation, but I just love this tool.
If Tableau Support or Knowledge Base tells you that you have to restart your services to change log levels or add an IP to trusted hosts just ignore it. No, you definitely don’t have to. You just need to know how and what needs to be restarted. Understanding the gateway’s load balancing and rewrite rules helps to perform the necessary steps to avoid planned outages ensuring that your user base can see and understand data, without any interruption.
Graceful restart is nice, but frankly, who wants restart services if you can change a running process’ memory? With advanced reverse engineering, disassembler and debugger tools you can change any running process’ behaviour. Need to change the log level? No need to restart the server just change the memory address where it manage the log writes. Sounds scary? Just stay tuned, you’ll learn a lot assembly, linking, symbol hooking and memory patching in part3!