On TripleO and the EnablePacemaker bool

Posted on: Fri 01 May 2015

Yes, EnablePacemaker! A little old fashioned and not very well suited for OpenStack you say heh? Let's try to talk a bit about this then. News is that we're adding into TripleO/Puppet an option to deploy Pacemaker on the Overcloud nodes. Puppet you said? Sure TripleO can use Puppet now for the configuration of the Overcloud nodes, ask Dan Prince :) Actually usage of Puppet is nowdays the preferred approach and many features are available only in the Puppet scenario ... amongst which the EnablePacemaker bool in subject.

To go back to Pacemaker, no this won't be the Active/Passive implementation described in the OpenStack HA docs. Not at all, there are no special resource agents for the OpenStack services and more importantly, there is no Active/Passive, when not needed. Instead, the architecture looks a lot more like the Active/Active scenario described in tha doc, except there is Pacemaker coping with a few uncovered bits.

Whoever tried to setup a RabbitMQ cluster from scratch

dan prince
ha docs

####### OLD STUFF #######

One of the topics discussed during the TripleO mid-cycle meetup in RDU was our status in relation to deploying OpenStack in a highly available manner. This had been worked on for some time and recently reached a usable state.

Majority of complications seem to come from two factors: 1) we need to guarantee availability of external services too, like the database and the message broker, which aren't exactly designed for a scale-out scenario, 2) despite the OpenStack services being designed around a scale-out concept, while attempting to achieve that in TripleO we spotted a number of weak angles, some of which could be worked around, others instead still need some changes in the core service. You're encouraged to try what we have available today and help with the rest.

So to try out OpenStack HA with TripleO you just set a number >= 3 for OVERCLOUD_CONTROLSCALE and continue with devtest as usual. Nodes will be configured appropriately:

export OVERCLOUD_CONTROLSCALE=3
source scripts/devtest_variables.sh
...

Don't forget this is only tested on a few distros for now, I'd pick some Fedora 20.

On the controller nodes, MariaDB with Galera (for Fedora) is going to provide for a reliable SQL. There is still some work in progress to make sure the Galera cluster can be restarted correctly should all the controllers go down at the same time but, for single node failures, this should be safe to use.

RabbitMQ nodes are clustered and balanced (via HAProxy), queues replicated.

And with regards to the OpenStack services, these are configured in a balancing manner (again, using HAProxy) except for those cases where this wouldn't have worked, notably the Neutron L3 agent and the Ceilometer Central agent, yet these are under control via Pacemaker and a single instance is expected to be running at all times. Cinder instead remains uncovered as volumes would require a shared storage for proper HA. A spec has been proposed for this though.

Also, behind the scenes, the Heat template language addon shipped as merge.py and included in tripleo-heat-templates, which allows for example for scaling of the resources definition, is currently going to be removed and replaced with code living entirely in Heat.

And there is more so once you tried, join us on #tripleo @ freenode for the real fun!

Category: techie – Tags: openstack, tripleo, high availability, pacemaker, fedoraplanet