Showing posts with label Deployment. Show all posts

Friday, 27 March 2015

High-availability options for MySQL

The technologies allowing to build highly-available (HA) MySQL solutions are in constant evolution and they cover very different needs and use cases. In order to help people choose the best HA solution for their needs, we decided, Jay Janssen and I, to publish, on a regular basis (hopefully, this is the first), an update on the most common technologies and their state, with a focus on what type of workloads suite them best. We restricted ourselves to the open source solutions that provide automatic failover. Of course, don’t simply look at the number of Positives/Negatives items, they don’t have the same values. Should you pick any of these technologies, heavy testing is mandatory, HA is never beyond scenario that have been tested.

Percona XtraDB Cluster (PXC)

Percona XtraDB Cluster (PXC) is a version of Percona Server implementing the Galera replication protocol from Codeship.

POSITIVE POINTS	NEGATIVE POINTS
Almost synchronous replication, very small lag if any Automatic failover At best with small transactions All nodes are writable Very small read after write lag, usually no need to care about Scale reads very well and to some extent, writes New nodes are provisioned automatically through State Snapshot Transfer (SST) Multi-threaded apply, greater write capacity than regular replication Can do geographical disaster recovery (Geo DR) More resilient to unresponsive nodes (swapping) Can resolve split-brain situations by itself	Still under development, some rough edges Large transactions like multi-statement transactions or large write operations cause issues and are usually not a good fit For quorum reasons, 3 nodes are needed but one can be a lightweight arbitrator SST can be heavy over a Wan Commit are affected by the network latency, this impacts especially Geo DR To achieve HA, a load balancer, like haproxy, is needed Failover time is determined by the load balancer check frequency Performance is affected by the weakest/busiest node Foreign Keys are potential issues MyISAM should be avoided Can be mixed with regular async replication as master or slave but, slaves are not easy to reconfigure after a SST on their master Require careful setup of the host, swapping can lead to node expulsion from the cluster No manual failover mode Debugging some Galera protocol issues isn’t trivial

POSITIVE POINTS

NEGATIVE POINTS

Almost synchronous replication, very small lag if any
Automatic failover
At best with small transactions
All nodes are writable
Very small read after write lag, usually no need to care about
Scale reads very well and to some extent, writes
New nodes are provisioned automatically through State Snapshot Transfer (SST)
Multi-threaded apply, greater write capacity than regular replication
Can do geographical disaster recovery (Geo DR)
More resilient to unresponsive nodes (swapping)
Can resolve split-brain situations by itself

Still under development, some rough edges
Large transactions like multi-statement transactions or large write operations cause issues and are usually not a good fit
For quorum reasons, 3 nodes are needed but one can be a lightweight arbitrator
SST can be heavy over a Wan
Commit are affected by the network latency, this impacts especially Geo DR
To achieve HA, a load balancer, like haproxy, is needed
Failover time is determined by the load balancer check frequency
Performance is affected by the weakest/busiest node
Foreign Keys are potential issues
MyISAM should be avoided
Can be mixed with regular async replication as master or slave but, slaves are not easy to reconfigure after a SST on their master
Require careful setup of the host, swapping can lead to node expulsion from the cluster
No manual failover mode
Debugging some Galera protocol issues isn’t trivial

Percona replication manager (PRM)

Percona replication manager (PRM) uses the Linux HA Pacemaker resource manager to manage MySQL and replication and provide high-availability. Information about PRM can be found here, the official page on the Percona web site is in the making.

POSITIVE POINTS	NEGATIVE POINTS
Nothing specific regarding the workload Unlimited number of slaves Slaves can have different roles Typically VIP based access, typically 1 writer VIP and many reader VIPs Also works without VIP (see the fake_mysql_novip agent) Detects if slave lags too much and remove reader VIPs All nodes are monitored The best slaves is picked for master after failover Geographical Disaster recovery possilbe with the lightweight booth protocol Can be operated in manual failover mode Graceful failover is quick, under 2s in normal conditions Ungraceful failover under 30s Distributed operation with Pacemaker, no single point of failure Builtin pacemaker logic, stonith, etc. Very rich and flexible.	Still under development, some rough edges Transaction maybe lost is master crashes (async replication) For quorum reasons, 3 nodes are needed but one can be a lightweight arbitrator Only one node is writable Read after write may not be consistent (replication lag) Only scales reads Careful setup for the host, swapping can lead to node expulsion from the cluster Data inconsistency can happen if the master crashes (fix coming) Pacemaker is complex, logs are difficult to read and understand

POSITIVE POINTS

NEGATIVE POINTS

Nothing specific regarding the workload
Unlimited number of slaves
Slaves can have different roles
Typically VIP based access, typically 1 writer VIP and many reader VIPs
Also works without VIP (see the fake_mysql_novip agent)
Detects if slave lags too much and remove reader VIPs
All nodes are monitored
The best slaves is picked for master after failover
Geographical Disaster recovery possilbe with the lightweight booth protocol
Can be operated in manual failover mode
Graceful failover is quick, under 2s in normal conditions
Ungraceful failover under 30s
Distributed operation with Pacemaker, no single point of failure
Builtin pacemaker logic, stonith, etc. Very rich and flexible.

Still under development, some rough edges
Transaction maybe lost is master crashes (async replication)
For quorum reasons, 3 nodes are needed but one can be a lightweight arbitrator
Only one node is writable
Read after write may not be consistent (replication lag)
Only scales reads
Careful setup for the host, swapping can lead to node expulsion from the cluster
Data inconsistency can happen if the master crashes (fix coming)
Pacemaker is complex, logs are difficult to read and understand

MySQL master HA (MHA)

Like with PRM above, MySQL master HA (MHA), provides high-availability through replication. The approach is different, instead of relying on an HA framework like Pacemaker, it uses Perl scripts. Information about MHA can be found here.

POSITIVE POINTS	NEGATIVE POINTS
Mature Nothing specific regarding the workload No latency effects on writes Can have many slaves and slaves can have different roles Very good binlog/relaylog handling Work pretty hard to minimise data loss Can be operated in manual failover mode Graceful failover is quick, under 5s in normal conditions If the master crashes, slaves will be consistent The logic is fairly easy to understand	Transaction maybe lost is master crashes (async replication) Only one node is writable Read after write may not be consistent (replication lag) Only scales reads Monitoring and logic are centralized, single-point of failure, a network partition can cause a split-brain Custom fencing devices, custom VIP scripts, no reuse of other projects tools Most of the deployments are using manual failover (at least at Percona) Requires priviledged ssh access to read relay-logs, can be a security concern No monitoring of the slave to invalidate it if it lags too much or if replication is broken, need to be done by external tool like HAProxy Careful setup for the host, swapping can lead to node expulsion from the cluster

POSITIVE POINTS

NEGATIVE POINTS

Mature
Nothing specific regarding the workload
No latency effects on writes
Can have many slaves and slaves can have different roles
Very good binlog/relaylog handling
Work pretty hard to minimise data loss
Can be operated in manual failover mode
Graceful failover is quick, under 5s in normal conditions
If the master crashes, slaves will be consistent
The logic is fairly easy to understand

Transaction maybe lost is master crashes (async replication)
Only one node is writable
Read after write may not be consistent (replication lag)
Only scales reads
Monitoring and logic are centralized, single-point of failure, a network partition can cause a split-brain
Custom fencing devices, custom VIP scripts, no reuse of other projects tools
Most of the deployments are using manual failover (at least at Percona)
Requires priviledged ssh access to read relay-logs, can be a security concern
No monitoring of the slave to invalidate it if it lags too much or if replication is broken, need to be done by external tool like HAProxy
Careful setup for the host, swapping can lead to node expulsion from the cluster

NDB Cluster

NDB cluster is the most high-end form of high-availability configuration for MySQL. It is a complete shared nothing architecture where the storage engine is distributed over multiple servers (data nodes). Probably the best starting point with NDB is the official document, here.

POSITIVE POINTS	NEGATIVE POINTS
Mature Synchronous replication Very good at small transactions Very good at high concurrency (many client threads) Huge transaction capacity, more than 1M trx/s are not uncommon Failover can be ~1s No single point of failure Geographical disaster recovery capacity built-in Strong at async replication, applying by batches gives multithreaded apply at the data node level Can scale reads and writes, the framework implements sharding by hashes	Not a drop-in replacement for Innodb, you need to tune the schema and the queries Not a general purpose database, some loads like reporting are just bad Only the Read-commited isolation level is available Hardware heavy, need 4 servers mininum for full HA Memory (RAM) hungry, even with disk-based tables Complex to operate, lots of parameters to adjust Need a load balancer for failover Very new foreign key support, field reports scarce on it

POSITIVE POINTS

NEGATIVE POINTS

Mature
Synchronous replication
Very good at small transactions
Very good at high concurrency (many client threads)
Huge transaction capacity, more than 1M trx/s are not uncommon
Failover can be ~1s
No single point of failure
Geographical disaster recovery capacity built-in
Strong at async replication, applying by batches gives multithreaded apply at the data node level
Can scale reads and writes, the framework implements sharding by hashes

Not a drop-in replacement for Innodb, you need to tune the schema and the queries
Not a general purpose database, some loads like reporting are just bad
Only the Read-commited isolation level is available
Hardware heavy, need 4 servers mininum for full HA
Memory (RAM) hungry, even with disk-based tables
Complex to operate, lots of parameters to adjust
Need a load balancer for failover
Very new foreign key support, field reports scarce on it

Shared storage/DRBD

Achieving high-availability use a shared storage medium is an old and well known method. It is used by nearly all the major databases. The share storage can be a DAS connected to two servers, a LUN on SAN accessible from 2 servers or a DRBD partition replicated synchronously over the network. DRBD is by bar the most common shared storage device used in the MySQL world.

POSITIVE POINTS	NEGATIVE POINTS
Mature Synchronous replication (DRBD) Automatic failover is easy to implement VIP based access	Write capacity is impacted by network latency for DRBD SANs are expensive Only for InnoDB Standby node, a big server doing nothing Need a warmup period after failover to be fully operational Disk corruption can spread

Saturday, 10 January 2015

[How-to] Creating Highly Available Message Queues using RabbitMQ

Every day in the world of modern technology, high availability has become the key requirement of any layer in a technology. Message broker software has become a significant component of most stacks. In this article, we will discuss how to create highly available message queues using RabbitMQ.

RabbitMQ is an open source message broker software (also called message-oriented middleware) that implements the Advanced Message Queuing Protocol (AMQP). RabbitMQ server is written in the Erlang programming language.

The RabbitMQ Cluster

Clustering connects multiple nodes to form a single logical broker. Virtual hosts, exchanges, users and permissions are mirrored across all nodes in a cluster. A client connecting to any node can see all the queues in a cluster.

[Tweet "#RabbitMQ tolerates the failure of individual nodes. Nodes can be stopped and started in a cluster."]

Clustering enables high availability of queues and increases the throughput.

A node can be a Disc node or RAM node. RAM node keeps the message state in memory with the exception of queue contents which can reside on a disk if the queue is persistent or too big to fit into memory.

RAM nodes perform better than Disc nodes because they don’t have to write to a disk as much as disk nodes. But, it is always recommended to have disk nodes for persistent queues.

We’ll discuss how to create and convert RAM and Disk nodes later in the post.

Prerequisites:

Network connection between nodes must be reliable.

All nodes must run the same version of Erlang and RabbitMQ.

All TCP ports should be open between nodes.

We have used CentOS for the demo. Installation steps may vary for Ubuntu and OpenSuse. In this demo, we have launched two m1.small servers in AWS for master and slave nodes.

1. Install Rabbitmq

Install Rabbitmq in master and slave nodes.

$ yum install rabbitmq-server.noarch

2. Start Rabbitmq

/etc/init.d/rabbitmq-server start

3. Create the Cluster

Stop RabbitMQ in Master and slave nodes. Ensure service is stopped properly.

/etc/init.d/rabbitmq-server stop

Copy the file below to all nodes from the master. This cookie file needs to be the same across all nodes.

$ sudo cat /var/lib/rabbitmq/.erlang.cookie

Make sure you start all nodes after copying the cookie file from the master.

Start RabbitMQ in master and all nodes.

$ /etc/init.d/rabbitmq-server start

Then run the following commands in all the nodes, except the master node:

$ rabbitmqctl stop_app

$ rabbitmqctl reset

$ rabbitmqctl start_app

Now, run the following commands in the master node:

$ rabbitmqctl stop_app

$ rabbitmqctl reset

Do not start the app yet.

The following command is executed to join the slaves to the cluster:

$ rabbitmqctl join_cluster rabbit@slave1 rabbit@slave2

Update slave1 and slave2 with the hostnames/IP address of the slave nodes. You can add as many slave nodes as needed in the cluster.

Check the cluster status from any node in the cluster:

$ rabbitmqctl cluster_status

By default, the cluster stores messages on the disk. You can also choose to store Queues in Memory.

You can have a node as a RAM node while attaching it to the cluster:

$ rabbitmqctl stop_app

$ rabbitmqctl join_cluster --ram rabbit@slave1

It is recommended to have at least one disk node in the cluster so that messages are stored on a persistent disk and can avoid any loss of messages in case of a disaster.

The performance of RAM nodes are a little better than disk nodes, and gives you better throughput.

4. Set the HA Policy

The following command will sync all the queues across all nodes:

$ rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'

Ex: Policy where queues whose names begin with "ha." are mirrored to all nodes in the cluster: $ rabbitmqctl set_policy ha-all "^ha\." '{"ha-mode":"all"}'

5. Test the Queue mirror

We are going to run a sample python program to create a sample queue. You need the below packages installed from where you want to run the program.

Install python-pip

$ yum install python-pip.noarch

Install Pika

$ sudo pip install pika==0.9.8

Create send.py file and copy the content below. You need to update the “localhost” with name/ip of master/slave node.

#!/usr/bin/env python

import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(

              'localhost'))

channel = connection.channel()

channel.queue_declare(queue='hello')

def callback(ch, method, properties, body):

   print " [x] Received %r" % (body,)

channel.basic_consume(callback,

                     queue='hello',

                     no_ack=True)

print ' [*] Waiting for messages. To exit press CTRL+C'

channel.start_consuming()

Run the python script using the command:

$ python send.py

This will create a Queue (hello) with a message on the RabbitMQ cluster.

Check if the message is available across all nodes.

$ sudo rabbitmqctl list_queues

Now, create a file named receive.py and copy the content below.

#!/usr/bin/env python

import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(

              'localhost'))

channel = connection.channel()

channel.queue_declare(queue='hello')

def callback(ch, method, properties, body):

   print " [x] Received %r" % (body,)

channel.basic_consume(callback,

                     queue='hello',

                     no_ack=True)

print ' [*] Waiting for messages. To exit press CTRL+C'

channel.start_consuming()

Run the script and check the Queue in either the slave or master:

$ sudo rabbitmqctl list_queues

6. Setup Load Balancer

Now, we have multiple MQs running in a cluster. All are in sync and have the same queues.

How do we point our application to the cluster? We can’t point our application to a single node. If a node fails, we need a mechanism to auto failover to other nodes in cluster. There are multiple ways to achieve it. But, we prefer to use the load balancer.

There are two advantages in using the load balancer:

High availability
Better network throughput because the load is evenly distributed across nodes.

Create a load balance in front of it and map the backend MQ instance. You can choose either HAProxy or Apache or Nginx or any hardware load balancer you use in your organization.

If the servers are running in AWS inside a VPC, then choose internal load balancer. Update the application to point to the load balancer end point.

Thursday, 20 November 2014

Setting up Entity Framework for Production Use

Abstract: Over the last couple of years, Entity Framework has steadily become the de facto data access story from Microsoft. With EntityFramework Code First, it became even easier to get started with an application. While EF lets you get off the ground and running fast, it’s deployment story (both first time and subsequent upgrades) has been a little sketchy. With the release 4.2 and EF Migrations, the upgrade story has been simplified a little but the first time deployment story still has a couple of points to keep in mind. Today we will explore these points and see how we can deploy a Database that is accessed by EF (Code First). We keep Migrations for another day.

Over the last couple of years, Entity Framework (EF) has matured steadily to become the de facto Data Access story from Microsoft. We all love the speed with which we can get off the ground with our prototypes. However, deploying the database to a typical shared hosting solution has a few challenges. Today we will take a simple application and see what it takes to deploy the database on AppHarbor. Use of AppHarbor is a matter of convenience. We could use any hosting provider; most of them have similar restriction with respect to the database creation and deletion permissions.

The Sample Application

Let’s say we have a very simple application with a single table that maps to an entity called BlogPost. For additional tables, we will use the ASP.NET Internet Template that includes the Authentication and Authorization entities. The Default Providers will create the required tables for these as well.

Before the first run, let’s add the BlogPost entity and add a default set of views using the default scaffold tooling.

Free Resource for .NET Developers: Subscribe to the DotNetCurry Digital Magazine

The BlogPost Entity and Scaffolding Setup

- Add a new Class – BlogPost, in the Models folder. It has three properties, Id (of type int), Title (of type string) and PostDetails (of type string).

- We save and build our project at this point

- Next we right click on the Controller folder in Solution Explorer and select ‘Add New Controller’ to bring up the New Controller Wizard.

- We setup the Controller as follows:

Name: BlogPostController
Scaffolding Options: MVC Controller with read/write actions ad views, using Entity Framework
Model Class: Select the entity created above e.g. BlogPost
Data Context Class: Press down arrow to select ‘Add New’ option. This will show another popup where we select the default Context Class name provided. As shown below it should be something like [Solution Name].Models.[ProjectName]Context

- Click on ‘Add’ to continue.

- The Scaffolding infrastructure will try to connect to the default .\SQLEXPRESS database. If it fails to find the DB, it will give an error as follows

- In case of an Error, click on OK to dismiss the dialog and open Web.config file. Update the server name in the ‘Data Source’ to point to the correct SQL Server (below we see the SQL Express instance is called SQLEXPRESSR2 instead of the traditional SQLEXPRESS. Change the ‘Initial Catalog’ name to something like EfDeploySample, instead of the default string

Now is a good time to ensure there is only one connection string being used. If you use the ‘Internet’ template, a connection string with the name ‘DefaultConnection’ is added and the ‘Authentication’/’Authorization’/’Personalization’ providers use this ‘DefaultConnection’ connection string. Delete the connection from the <ConnectionStrings> section and replace the reference to it, with the above connection string wherever required.

- Repeat the Add Controller step above. If successful, the following will be added to the solution

A new folder BlogPost will be added in the ‘Views’ folder, with the default views for Create, Delete, Details, Edit and Index.
The BlogPostController is added to the Controllers folder
The EfDeploySampleContext is added to the Models folder

- To make use of the Authentication framework that came along, we will add the [Authorize] attribute on top of the BlogPostController, so that actions on it cannot be invoked unless a user is Logged in. With that, our application is set to go.

- Run the Application. Use the ‘Register’ link on the top to register a username. Then navigate to /BlogPost URL to get to the Index page of the Blog Posts. Add a sample Post. With our application up and running, it’s time to attend to the Database Migration piece. If we look at our database in Management Studio, it looks as follows:

- Pretty neat considering the fact that we didn’t write a scratch of SQL to generate those tables.

EF Database Initialization Strategies

EntityFramework comes with a set of default initialization strategies. Without any configuration, it is set up as ‘Create if schema does not exist’. You can explicitly specify it in the DB Context class by overriding the OnModelCreating method as follows:

In this strategy, EF creates the Database if it doesn’t find it. Once created, it will not update the schema unless you drop and recreate again (Yes, Migrations will update without a DB drop, but for now we are not looking at Migrations).

To test the Default strategy, let’s drop our data from our Database and Run the application. As we try to Navigate to the BlogPost index, we will get re-directed to the Login Page. Once we register, if we go back and check the DB the new database would have been created.

Next, login and navigate to BlogPost. Once you are at the Index page, if you refresh the database schema we will see the BlogPost table has been re-created.

Strategy to keep up with rapid changes during prototyping

So far so good, we are fine with our automatic schema setup. However in prototyping stage, we will be adding/modifying and deleting schema elements. In such a case, CreateDatabaseIfNotExists becomes a hindrance because if you don’t drop the database manually, it will throw an InvalidOperationException complaining that the backend has changed.

Thus to keep up with rapid changes in Schema during development, we change the strategy to DropCreateDatabaseIfModelChanges strategy

Now when EF detects a change in Model, it will drop the DB and create the schema again. Remember NO data will be saved.

Connection String and Schema Names

If we look at the constructor of the MVC Tooling generated EfDeploySampleContext, we will notice a string parameter that corresponds to the name of the connection string in the web.config.

As a part of clean setup, it is recommended that the connection string be passed into DB context via the constructor.

Next important thing is to pass the schema name. We often take access to ‘dbo’ for granted, but some hosting providers do not provide us with dbo access. If we come across such a scenario, we have to do additional mapping between entities and the tables.

To handle these two configurations, we update our EfDeploySampleContext constructor by passing the connectionName and schemaName parameters. We then pass the connection name to the base DBContext and use the schema name to map our entities to the tables in DB explicitly.

As highlighted above, the default constructor has been modified to take in the two parameters and a guard clause ensures that if the parameters are not present, the instantiation throws an exception.

To match the above changes, we have to update our controllers so that we can pass these new values. To keep the schema Name configurable, we will add a key in the AppSetting section of our Web.Config and pass that value into the DbContext’s constructor.

Our DbContext initialization code in the controllers would then look as follows

Going Live

With all the above changes we are almost set to go live. The only two things remain

1. Update the DropCreateDatabaseIfModelChanges strategy back to null. This way EntityFramework will not try to make any changes to the DB

2. Next use Management Studio Tasks to extract the DB Schema as a SQL script.

a. Launch the Generate Scripts Wizard

b. Choose Objects and Select all Tables/objects. Do not select the ‘Script entire database and all database objects’ option. It tries to create a new database. This is not what we want. Most hosting providers will not give us ‘create’ or ‘drop’ database rights.

c. Save the Script

Deploying to AppHarbor

Details of how to deploy code on AppHarbor have been explained in this article . We will look at the SQL Server setup more closely here.

- In AppHarbor, you have to create an application and select SQL Server as an AddOn. Most hosting providers will give you a connection string or server-name, user-name and password combination. AppHarbor provides you with the same.

Once the Add-On is installed you have access to one Database that you can access using the ‘Go To SQL Server’ link.

- The configuration page looks as follows

- In the ‘ConnectionString alias’ setting, edit the alias and provide the same name as the connection string in our web.config i.e. EfDeploySampleContext

This actually creates a web.config transform that replaces the Web.config’s connection string setting with this database’s connection string, when the code is deployed on AppHarbor
If you are not deploying to AppHarbor and do not have a transform defined, make sure you update your connection string manually before deploy.

- Connect to the server using the above connection information using the SQL Server Management Studio.

- Open the SQL Script (remove the ‘using [dbname]’ if required) and execute it on the server. That’s it. Done.

- Deploy the code and if AppHarbor does not throw an error, your application is now live. You are good till the next upgrade cycle comes across. We will look at EF Migrations in an upcoming article to handle that scenario.

Conclusion

To conclude, we saw how with a little bit of self-enforced processes we can easily migrate our EF prototypes to production environment. However, the real challenges in a production deployment are the updates and revisions and further schema changes. We will look into EF Migrations in the future that will help us resolve that scenario as well.

The code Repository is at https://github.com/dotnetcurry/EFDeploySample and you can alsodownload the Zip file

Thursday, 10 July 2014

cassandra Monitoring.

Guide to Cassandra Thread Pools

May 11, 2014 Chris Lohfink Cassandra

This guide provides a description of the different thread pools and how to monitor them. Includes what to alert on, common issues and solutions.

Concepts

Cassandra is based off of a Staged Event Driven Architecture (SEDA). This separates different tasks in stages that are connected by a messaging service. Each like task is grouped into a stage having a queue and thread pool (ScheduledThreadPoolExecutor more specifically for the Java folks). Some stages skip the messaging service and queue tasks immediately on a different stage if it exists on the same node. Each of these queues can be backed up if execution at a stage is being over run. This is a common indication of an issue or performance bottleneck.To demonstrate take for example a read request:

Note: For the sake of this example the data exists on a different node then the initial request, but in some scenarios this could all be on a single node. There are also multiple paths a read may take. This is just for demonstration purposes.

NODE1> The request is queued from node1 on the ReadStage of node2 with the task holding a reference to a callback
NODE2> The ReadStage will pull the task off the queue once one of its 32 threads become available
NODE2> The task will execute to and compute the resulting Row, which will be sent off to the RequestResponse stage of node1
NODE1> One of the four threads in the RequestResponseStage will process the task, and pass the resulting Row to the callback referenced from step 1 which will return to the client.
NODE1> It will also possibly kick off an async read repair task on node1 to be processed in the ReadRepairStage once one of its four threads become available

The ReadRepairStage and how long it takes to process its work is not in the feedback loop from the client. This means if the rate of reads occurs at a rate higher than the rate of read repairs the queue will grow until it is full without the client application ever noticing. At the point of the queue being full (varies between stages, more below) the act of enqueueing a task will block.

In some stages the processing of these tasks are not critical. These tasks are marked as “DROPPABLE” meaning the first thing the stage will do when executing it is to check if its exceeded past a timeout from when it was created. If the timeout has passed it will throw it away instead of processing.

Accessing metrics

TPStats

For manual debugging this is the output given by

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0         113702         0                 0
RequestResponseStage              0         0              0         0                 0
MutationStage                     0         0         164503         0                 0
ReadRepairStage                   0         0              0         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0              0         0                 0
AntiEntropyStage                  0         0              0         0                 0
MigrationStage                    0         0              0         0                 0
MemoryMeter                       0         0             35         0                 0
MemtablePostFlusher               0         0           1427         0                 0
FlushWriter                       0         0             44         0                 0
MiscStage                         0         0              0         0                 0
PendingRangeCalculator            0         0              1         0                 0
commitlog_archiver                0         0              0         0                 0
InternalResponseStage             0         0              0         0                 0
HintedHandoff                     0         0              0         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
PAGED_RANGE                  0
BINARY                       0
READ                         0
MUTATION                     0
_TRACE                       0
REQUEST_RESPONSE             0
COUNTER_MUTATION             0

The description of the pool values (Active, Pending, etc) is defined in the table below. The second table lists dropped tasks by message type, for more information on that read below.

JMX

These are all available via JMX as well in:

org.apache.cassandra.request:type=*

and

org.apache.cassandra.internal:type=*

with the attributes (relative to tpstats) being:

MBean Attribute	tpstats name	Description
ActiveCount	Active	Number of tasks pulled off the queue with a Thread currently processing.
PendingTasks	Pending	Number of tasks in queue waiting for a thread
CompletedTasks	Completed	Number of tasks completed
CurrentlyBlockedTasks	Blocked	When a pool reaches its max thread count (configurable or set per stage, more below) it will begin queuing until the max size is reached. When this is reached it will block until there is room in the queue.
TotalBlockedTasks	All time blocked	Total number of tasks that have been blocked

Metrics Reporters

Cassandra has pluggable metrics reporting capabilities you can read more about here

Droppable Messages

The second table in nodetool tpstats displays a list of messages that were DROPPABLE. These ran after a given timeout set per message type so was thrown away. In JMX these are accessible via org.apache.cassandra.net:MessagingService or org.apache.cassandra.metrics:DroppedMessage. The following table will provide a little information on each type of message.

Message Type	Stage	Notes
BINARY	n/a	This is deprecated and no longer has any use
_TRACE	n/a (special)	Used for recording traces (nodetool settraceprobability) Has a special executor (1 thread, 1000 queue depth) that throws away messages on insertion instead of within the execute
MUTATION	MutationStage	If a write message is processed after its timeout (write_request_timeout_in_ms) it either sent a failure to the client or it met its requested consistency level and will relay on hinted handoff and read repairs to do the mutation if it succeeded.
COUNTER_MUTATION	MutationStage	If a write message is processed after its timeout (write_request_timeout_in_ms) it either sent a failure to the client or it met its requested consistency level and will relay on hinted handoff and read repairs to do the mutation if it succeeded.
READ_REPAIR	MutationStage	Times out after write_request_timeout_in_ms
READ	ReadStage	Times out after read_request_timeout_in_ms. No point in servicing reads after that point since it would of returned error to client
RANGE_SLICE	ReadStage	Times out after range_request_timeout_in_ms.
PAGED_RANGE	ReadStage	Times out after request_timeout_in_ms.
REQUEST_RESPONSE	RequestResponseStage	Times out after request_timeout_in_ms. Response was completed and sent back but not before the timeout

Stage Details

Below lists a majority of exposed stages as of 2.0.7, but they tend to be a little volatile so if they do not exist don’t be surprised. The heading of each section is correlates with the name of the stage from nodetool tpstats. The notes on the stages mostly refer to the 2.x data but some references to things in 1.2. 1.1 and previous versions are not considered at all in making of this document.

Alerts; given as a boolean expression, are intended as a starting point and should be tailored to specific use cases and environments in the case of many false positives.

When listed as “number of processors” its value is retrieved first by the system property “cassandra.available_processors” which can be overridden by adding a

-Dcassandra.available_processors

to JVM_OPTS. Falls back to default of Runtime.availableProcessors

Index

ReadStage	RequestResponseStage	MutationStage	ReadRepairStage
ReplicateOnWriteStage	GossipStage	AntiEntropyStage	MigrationStage
MemtablePostFlusher	FlushWriter	MiscStage	InternalResponseStage
HintedHandoff	MemoryMeter	PendingRangeCalculator	commitlog_archiver
AntiEntropySessions

ReadStage

Performing a local read. Also includes deserializing data from row cache. If there are pending values this can cause increased read latency. This can spike due to disk problems, poor tuning, or over loading your cluster. In many cases (not disk failure) this is resolved by adding nodes or tuning the system.

JMX beans:	org.apache.cassandra.request.ReadStage
	org.apache.cassandra.metrics.ThreadPools.request.ReadStage
Number of threads:	concurrent_reads (default: 32)
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

RequestResponseStage

When a response to a request is received this is the stage used to execute any callbacks that were created with the original request

JMX beans:	org.apache.cassandra.request.RequestResponseStage
	org.apache.cassandra.metrics.ThreadPools.request.RequestResponseStage
Number of threads:	number of processors
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

MutationStage

Performing a local including:

insert/updates
Schema merges
commit log replays
hints in progress

Similar to ReadStage, an increase in pending tasks here can be caused by disk issues, over loading a system, or poor tuning. If messages are backed up in this stage, you can add nodes, tune hardware and configuration, or update the data model and use case.

JMX beans:	org.apache.cassandra.request.MutationStage
	org.apache.cassandra.metrics.ThreadPools.request.MutationStage
Number of threads:	concurrent_writers (default: 32)
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

ReadRepairStage

Performing read repairs. Chance of them occurring is configurable per column family with read_repair_chance. More likely to back up if using CL.ONE (and to lesser possibly other non-CL.ALL queries) for reads and using multiple data centers. It will then be kicked off asynchronously outside of the queries feedback loop, demonstrated in the diagram above. Note that this is not very likely to be a problem since does not happen on all queries and is fast providing good connectivity between replicas. The repair being droppable also means that after write_request_timeout_in_ms it will be thrown away which further mitigates this. If pending grows attempt to lower the rate for high read CFs:

ALTER TABLE column_family WITH read_repair_chance = 0.01;

JMX beans:	org.apache.cassandra.request.ReadRepairStage
	org.apache.cassandra.metrics.ThreadPools.request.ReadRepairStage
Number of threads:	number of processors
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

ReplicateOnWriteStage

CounterMutation in 2.1, also counters changing dramatically so post 2.0 should consider this obsolete. Performs counter writes on non-coordinator nodes and replicates after a local write. This includes a read so can be pretty expensive. Will back up if the rate of writes exceed the rate that the mutations can occur. Particularly possible with CL.ONE and high counter increment workloads.

JMX beans:	org.apache.cassandra.request.ReplicateOnWriteStage
	org.apache.cassandra.metrics.ThreadPools.request.ReplicateOnWriteStage
Number of threads:	concurrent_replicates (default: 32)
Max pending tasks:	1024 x number of processors
Alerts:	pending > 15 \|\| blocked > 0

GossipStage

Post 2.0.3 there should no longer be issue with pending tasks. Instead monitor logs for a message:

Gossip stage has {} pending tasks; skipping status check (no nodes will be marked down)

Before that change, in particular older versions of 1.2, with a lot of nodes (100+) while using vnodes can cause a lot of cpu intensive work that caused the stage to get behind.
Been known to of been caused with out of sync schemas. Check NTP working correctly and attempt nodetool resetlocalschema or the more drastic deleting of system column family folder.

JMX beans:	org.apache.cassandra.internal.GossipStage
	org.apache.cassandra.metrics.ThreadPools.internal.GossipStage
Number of threads:	1
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

AntiEntropyStage

Repairing consistency. Handle repair messages like merkle tree transfer (from Validation compaction) and streaming.

JMX beans:	org.apache.cassandra.internal.AntiEntropyStage
	org.apache.cassandra.metrics.ThreadPools.internal.AntiEntropyStage
Number of threads:	1
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

MigrationStage

Making schema changes

JMX beans:	org.apache.cassandra.internal.MigrationStage
	org.apache.cassandra.metrics.ThreadPools.internal.MigrationStage
Number of threads:	1
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

MemtablePostFlusher

Operations after flushing the memtable. Discard commit log files that have had all data in them in sstables. Flushing non-cf backed secondary indexes.

JMX beans:	org.apache.cassandra.internal.MemtablePostFlusher
	org.apache.cassandra.metrics.ThreadPools.internal.MemtablePostFlusher
Number of threads:	1
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

FlushWriter

Sort and write memtables to disk. A vast majority of time this backing up is from over running disk capability. The sorting can cause issues as well however. In the case of sorting being a problem, it is usually accompanied with high load but a small amount of actual flushes (seen in cfstats). Can be from huge rows with large column names. i.e. something inserting many large values into a cql collection. If overrunning disk capabilities, it is recommended to add nodes or tune the configuration.

JMX beans:	org.apache.cassandra.internal.FlushWriter
	org.apache.cassandra.metrics.ThreadPools.internal.FlushWriter
Number of threads:	memtable_flush_writers (1 per data directory)
Max pending tasks:	memtable_flush_queue_size (default: 4)
Alerts:	pending > 15 \|\| blocked > 0

MiscStage

Snapshotting, replicating data after node remove completed.

JMX beans:	org.apache.cassandra.internal.MiscStage
	org.apache.cassandra.metrics.ThreadPools.internal.MiscStage
Number of threads:	1
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

InternalResponseStage

Responding to non-client initiated messages, including bootstrapping and schema checking

JMX beans:	org.apache.cassandra.internal.InternalResponseStage
	org.apache.cassandra.metrics.ThreadPools.internal.InternalResponseStage
Number of threads:	number of processors
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

HintedHandoff

Sending missed mutations to other nodes. Usually a symptom of a problem elsewhere so dont treat as root issue. Can use nodetool disablehandoff to prevent further saving. Can also use JMX (nodetool truncatehints) to clear the handoffs for specific endpoints or all of them with org.apache.cassandra.db:HintedHandoffManager operations. This must be followed up with repairs!

JMX beans:	org.apache.cassandra.internal.HintedHandoff
	org.apache.cassandra.metrics.ThreadPools.internal.HintedHandoff
Number of threads:	max_hints_delivery_threads (default: 1)
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

MemoryMeter

Measures memory usage and live ratio of a memtable. The pending should not grow beyond size of number of memtables (front ended by set to prevent duplicates). This can take minutes for large memtables.

JMX beans:	org.apache.cassandra.internal.MemoryMeter
	org.apache.cassandra.metrics.ThreadPools.internal.MemoryMeter
Number of threads:	1
Max pending tasks:	231-1
Alerts:	pending > column_family_count \|\| blocked > 0

PendingRangeCalculator

Calculates the token ranges based on bootstrapping and leaving nodes. Instead of blocking this stage discards tasks when one already in progress so no value in monitoring

JMX beans:	org.apache.cassandra.internal.PendingRangeCalculator
	org.apache.cassandra.metrics.ThreadPools.internal.PendingRangeCalculator
Number of threads:	1
Max pending tasks:	1

commitlog_archiver

(Renamed CommitLogArchiver in 2.1) Executes a command to copy (or user defined command) commit log files for recovery. Read more at DataStax Dev Blog

JMX beans:	org.apache.cassandra.internal.commitlog_archiver
	org.apache.cassandra.metrics.ThreadPools.internal.commitlog_archiver
Number of threads:	1
Max pending tasks:	231-1
Alerts:	pending > 15 \|\| blocked > 0

AntiEntropySessions

Number of active repairs that are in progress. Will not show up until after a repair has been run so any monitoring utilities should be able to handle it not existing.

JMX beans:	org.apache.cassandra.internal.AntiEntropySessions
	org.apache.cassandra.metrics.ThreadPools.internal.AntiEntropySessions
Number of threads:	4
Max pending tasks:	231-1
Alerts:	pending > 1 \|\| blocked > 0

Friday, 27 March 2015

High-availability options for MySQL

Percona XtraDB Cluster (PXC)

Percona replication manager (PRM)

MySQL master HA (MHA)

NDB Cluster

Shared storage/DRBD

Saturday, 10 January 2015

[How-to] Creating Highly Available Message Queues using RabbitMQ

The RabbitMQ Cluster

Prerequisites:

1. Install Rabbitmq

2. Start Rabbitmq

3. Create the Cluster

4. Set the HA Policy

5. Test the Queue mirror

6. Setup Load Balancer

Thursday, 20 November 2014

Setting up Entity Framework for Production Use

The Sample Application

The BlogPost Entity and Scaffolding Setup

EF Database Initialization Strategies

Strategy to keep up with rapid changes during prototyping

Connection String and Schema Names

Going Live

Deploying to AppHarbor

Conclusion

Thursday, 10 July 2014

cassandra Monitoring.

Concepts

Note: For the sake of this example the data exists on a different node then the initial request, but in some scenarios this could all be on a single node. There are also multiple paths a read may take. This is just for demonstration purposes.

Accessing metrics

TPStats

JMX

Metrics Reporters

Droppable Messages

Stage Details

Index

ReadStage

RequestResponseStage

MutationStage

ReadRepairStage

ReplicateOnWriteStage

GossipStage

AntiEntropyStage

MigrationStage

MemtablePostFlusher

FlushWriter

MiscStage

InternalResponseStage

HintedHandoff

MemoryMeter

PendingRangeCalculator

commitlog_archiver

AntiEntropySessions

Angular Tutorial (Update to Angular 7)