SSH slaves are great. Get a box, put Linux on it with a jenkins account, and tell the master to launch a slave on it. I've never had a problem with this1. However, when it comes to Windows, the slaves seem rather dodgey.
If the slave agent is running as a Windows service, it seems to be OK. If the slave service crashes, it recovers because it's set to auto-restart by default. The thing is, if you need to launch OpenGL processes in job steps, you can't run the job on a slave running as a Windows service. The process simply doesn't have the ability to access the hardware GDI. The only realistic option is to auto-login to a desktop and auto-launch the slave agent in a scheduled task.
To minimize the time spent tweaking, it's easier to write a batch file that launches the slave agent using its Jenkins URL. This can be launched on login using the Windows Task Scheduler.
java -jar slave.jar -jnlpUrl http://yourserver:port/computer/slave-name/slave-agent.jnlp
Done, right? No.
Unfortunately, the slave agent running on a Windows desktop has a habit of crashing for one reason or another. I would regularly come into the office in the morning and find all of the slaves offline. Most of the time, mine seemed to be a "Channel is already closed" exception. Others were just an apparent termination of the slave process with no log message. Lots of internet searches identified others with these issues and similar ones, but there were few solutions, none of which worked for me.
Even telling the Windows Task Scheduler to restart the process if it crashes didn't seem to work, which was puzzling.
The solution is a bit brute force. Modify the batch file into an infinite loop to hide the terrible stability of the slave agent.
loop: java -jar slave.jar -jnlpUrl http://yourserver:port/computer/slave-name/slave-agent.jnlp goto loop
Now the batch file just keeps launching the slave when the previous instance inexplicably exits.
The other side of the socket was also the cause of some instability. Sometimes the master process decided not to reconnect to the slaves, or had jobs freeze on the desktop windows executors. The longer the master's uptime, the more likely I would get "Channel already closed" exceptions on the slaves. I spent time looking into this too, and still have not found a proper solution.
In the mean time, doing a daily master reboot seems to have been the final piece of the cluster stability puzzle. I opted for a job on a fixed schedule that triggers a Groovy script to restart Jenkins when current jobs are finished. This requires the Groovy plugin.
My conclusion for this article is rather sad and obvious.
If it fails, just restart everything automatically.
OK, no infrastructure related problems, I still have problems knowing how to Linux all the things... ↩