Blog
Towards a highly available (HA) open cloud: an introduction to production OpenStack-->
Towards a highly available (HA) open cloud: an introduction to production OpenStackby
Oleg Gelbukh
August 21, 2012
When OpenStack needs to be deployed for production purposes, be it a small cluster for development environments in a start-up or a large-scale public cloud provider installation, there are several key demands that deployment must meet. The most frequent and thus most important requirements are:
At Mirantis we have developed an approach that allows you to satisfy all three of these demands. This article introduces a series of posts describing our approach and gives a bird’s-eye view of methods and tools used. High availability and redundancyOpenStack services can generally be divided into several groups, based here on the HA approach for a given service. API servicesThe first group includes API servers, namely:
As HTTP/REST services, they can be relatively simply made redundant with a load balancer added to the cluster. If the load balancer supports health checking, it suffices to provide basic high availability of API services. Note that in the 2012.1 (Essex) release of OpenStack platform, only the Swift API supports a specific “healthcheck” call. Other services require extensions to APIs to support such a call and make an actual check of service health. Compute servicesThe second group includes services that actually manage virtual servers and provide resources for them:
These services do not require specific redundancy in the production environment. The approach for this group of services is based on a fundamental paradigm for cloud computing, where we have many interchangeable workers, and loss of a single worker causes only temporary local disruption in manageability, not in service provided by the cluster. Thus, it is enough to monitor these services by using an external monitoring system and have basic recovery scenarios implemented as event handlers. A simple scenario is to send a notification to the administrator and attempt to restart the failed service. High availability of the networking service as provided by the multihost feature of nova-network is covered in the official OpenStack documentation. In actual production environments, however, a frequent change to this scheme is offloading the routing of project networks to the external hardware router. This leaves only DHCP functions to SchedulerRedundancy is built into the Queue serverThe RabbitMQ queue server is the main communication bus for all nova services, and it must be reliable in any production environment. Clustering and queue mirroring are supported natively by RabbitMQ, and a load balancer can be used to distribute connections between RabbitMQ servers running in clustered mode. Mirantis has also developed a patch for Nova RPC library implementation that allows it to fail over to the backup RabbitMQ server if the primary goes down and is unable to accept connections. State databaseThe most widely used DB for OpenStack deployments is MySQL, and it is most frequently used in deployments by Mirantis. Currently, there are a number of solutions that provide both high availability and scalability for MySQL. Among these solutions, the most common is MySQL-MMM (multi-master replication manager). This solution is used in more than one Mirantis deployment and works well enough, despite numerous known limitations. Even though we had no serious issues with MMM, we are looking at more state-of-the-art open source solutions for database HA, particularly the WSREP-based Galera clustering engine for MySQL. The Galera cluster features simple and transparent scalability mechanisms and supports high availability through synchronous multi-master replication provided by WSREP layer. The next post in this blog will cover solutions used by Mirantis to implement high availability of RabbitMQ and MySQL DB. ScalabilityOnce we know how to balance the load or parallelize the workload, we need a mechanism that allows us to add workers to the cluster and expand it to handle a bigger workload, also known as “horizontal” scaling. For most OpenStack platform components, it is simple to add an instance of the server, include it in the load balancer configuration, and have the cluster scaled out. However, this poses two specific problems in real-world production deployments:
Nodes and rolesWhile OpenStack services can be distributed among servers with high flexibility, the most common way to deploy the OpenStack platform is to have two types of nodes: a controller node and compute nodes. A typical development OpenStack installation includes a single controller node that runs all services except the compute group, and multiple compute nodes that run compute services and host virtual servers. Obviously, this architecture does not work for production purposes. For small production clusters, we tend to recommend that you make cluster nodes as self-sufficient as possible by installing API servers on compute nodes, and leaving only the database, queue server, and dashboard on the controller node. Controllers should run in a redundant configuration. The following node roles are defined in this architecture:
Configuration managementThe architecture proposed above requires a sequence of steps performed on every physical server in the cluster. Some of the steps are quite complex, and some involve more then one node; for example, load balancer configuration or multi-master replication setup. The complexity of the current OpenStack deployment process makes scripting of tasks and steps essential for success, which has given birth to more than one project already, including the well-known Devstack and Crowbar. Simple scripting of the deployment process is not enough to successfully install OpenStack in production environments, nor to ensure scalability of cluster. You would also need to develop new scripts if you wanted to change something in your architecture or upgrade versions of components. However, there are tools designed for these tasks: configuration managers. Most well-known among them are Puppet and Chef, and there are products based on them (e.g., the aforementioned Crowbar, which has Chef under the hood). We have used both Puppet and Chef to deploy OpenStack in a variety of projects. Naturally, each has its own limitations. From our own experience we know that the best results can be achieved when the configuration manager is supported by a centralized orchestration engine for seamless deployment. By combining it with a bare-metal provisioning application that configures the physical servers on a hardware level, and a test suite responsible for validation of deployment, we have an end-to-end approach that can quickly install the OpenStack platform in a wide range of hardware configurations and logical architectures. Automation of operationsUsing an orchestration engine with a configuration management system that recognizes node roles allows us to automate the deployment process to a very high degree. We can automate the scaling process as well. All of this reduces the costs of OpenStack operation and support. Most modern orchestrators have APIs, which allow you to create CLI or web-based user interfaces that will allow operators to perform administrative tasks across the whole cluster or its specific parts. We’ll talk more about this in blogs posts ahead. 4 comments4 ResponsesRSS feed for comments on this post.-->
Click here to cancel reply.
Some HTML is OK
|
Can you please detail what you meant by “offloading the routing of project networks to the external hardware router” ?
How exactly is this setup working?
Thanks in advance.
September 4, 2012 20:26