Saturday, March 14, 2015

MS clustering

Here I am putting in some questions that might help you prepare yourself on Clustering. This is just to start and then you can follow it up with more in depth study and CBTs.



What is a Cluster?

A group of two or more servers together ensuring availability of a service or application even when one of its members goes down. Example of clusters are Microsoft Clustering Services (MSCS) and Microsoft NLB cluster.


Is cluster high available or fault tolerant solution?

When we talk about MSCS then we talk about a high available solution and when we talk about WLBS i.e. microsoft NLB we talk about fault tolerant solution.


How MSCS is different from NLB?

Most important thing to remember is, MSCS maintains the session state but NLB doesn't. For example in case of two node cluster (MSCS with two nodes) if a SQL database is hosted on it and a transaction is going on, there will be no effect on the transaction even if one of the nodes goes down. While the fail over happens the transaction would definitely stop but resume as soon as the other node takes over. Where as in the case of NLB, if a node goes down,any session associated with it would end and the client has to reconnect and establish a new session. For example OWA, if the node to which my owa connections is established goes down, I need to reconnect as my session would time out after the node serving my session fails. 


What is a quorum or What is a quorum resource or Quorum disk?

The quorum resource is a common resource in the cluster that is accessible by all of the cluster nodes. Normally a physical disk on the shared storage, the quorum resource maintains data integrity, cluster unity, and cluster operations—such as forming or joining a cluster—by performing the following tasks:
  • Enables a single node to gain and defend its physical control of the quorum resource — When the cluster is formed or when the cluster nodes fail to communicate, the quorum resource guarantees that only one set of active, communicating nodes is allowed to form a cluster.
  • Maintains cluster unity  The quorum resource allows cluster nodes that can communicate with the node containing the quorum resource to remain in the cluster. If a cluster node fails for any reason and the cluster node containing the quorum resource is unable to communicate with the remaining nodes in the cluster, MSCS automatically shuts down the node that does not control the quorum resource.
  • Stores the most current version of the cluster configuration database and state
    data 
     If a cluster node fails, the configuration database helps the cluster recover a failed resource or recreate the cluster in its current configuration.
The only type of resource supported by MSCS that can act as a quorum resource is the physical disk resource. However, developers can create their own quorum disk types for any resources that meet the arbitration and storage requirements.




What’s the advantage/disadvantage of having 1 node cluster?

One node cluster is used for situations where in we just want the ability to get the stopped service restarted automatically. There are services which doesn't have the capability to restart on its own, they are hosted on one node cluster as the cluster service would restart the failed service and we are good to go. However if the node itself fails, the service becomes unavailable.



How important is to have public and hearbeat network separate? Is cluster possible with just one NIC per node?

Its very important to have the public and private heart beat network separate as clubbing the two might induce the delay in hear beat packets reaching to the cluster service. This in turn would make the cluster fail over. Its not possible to have cluster with one single NIC but that is not a supported configuration. For more info please read support article.


What’s the port number for heartbeat communication between Cluster Nodes?

To ensure correct failover cluster functionality, add exceptions to firewall configuration settings for File and Printer Sharing (TCP 139/445 and UDP 137/1380. 


Is it possible to have heartbeat and public NIC’s on the same subnet without causing any problems?
No its not possible to have heart beat and public NIC's on the same subnet without causing any problems. Reason is, public NICs generate a lot of traffic and it might interfere with the  traffic of heartbeat NICs in terms of inducing delaydue to congestion. This would cause the cluster to fail over and we don't want this to happen.


Can 2 nodes belonging from multiple network subnet form a single Cluster?

While configuring a set of clustered nodes we need to have them on the same subnet.  


What is the difference between multicast and unicast in a Cluster? Under which scenario unicast/multicast is more viable solution that multicast/unicast?


The scenario in which Unicast/Multicast is more viable is given below. Its picked from here:
As the number of nodes in a server cluster increases, the node-to-node communication rises significantly. In Windows Server 2003, Enterprise Edition or Windows Server 2003, Datacenter Edition, for server clusters with 3 or more nodes, multiple unicast messages for two classes of intracluster traffic are replaced by single multicast messages. This reduces the intracluster traffic, resulting in lower network bandwidth consumption and improved node performance. The configuration of the server cluster must meet the following conditions for heartbeat multicasting: 
  • The number of nodes that are members of the server cluster (rather than the number of nodes that are currently up and actively participating in the cluster) must be 3 or greater.
  • All the nodes in the server cluster must be running Windows Server 2003, Enterprise Edition or Windows Server 2003, Datacenter Edition.
Important 
  • Both conditions above refer to nodes configured in the cluster membership rather than nodes that are currently up and actively participating in the cluster.
  • 2-node server clusters use unicast, not multicast, messaging for all intracluster traffic.
  • If you operate a server cluster of 3 or more nodes as a mixed version cluster (that is, Windows Server 2003, Enterprise Edition or Windows Server 2003, Datacenter Edition is installed on some nodes, and Windows 2000 on others), then the cluster as a whole will send unicast, not multicast, intracluster messages.
The two kinds of intracluster traffic affected by heartbeat multicasting are: 
  • Heartbeat messages sent between nodes.
  • Node-to-node communication to verify node failures during cluster configuration changes.



What kind of application is called cluster-aware?

An application is capable of being cluster-aware if it has the following characteristics:
   1)  It uses TCP/IP as a network protocol.
   2)  It maintains data in a configurable location.
   3)  It supports transaction processing.

To know more please visit this page.



Is every Windows application Cluster aware? And why or why not?

No, not every windows application is cluster aware. Reason is not all of them follow the three criterion laid out for an application to qualify as cluster aware. For example MS Office doesn't maintain its data at a configurable location and it doesn't support transaction processing as well. There could be better examples of cluster unaware applications but I can't think of as of now.

But a service like file and printer sharing or an application like SQL or exchange does use all the three criterion and hence they are called cluster aware applications.  


Is it very important to have shared storage for a Cluster 2 run? What kind of applications may not run if we don’t shared storage?

Yes its important to have shared storage for clusters to run. Without shared storage location, using clusters (exception is single node cluster) is futile as the services won't be available after failover as the common data source is not there to be accessed. MS SQL or MS Exchange are the example of applications that will not run if we don't use shared storage. 


What are the start options for Cluster service and under which scenario are they used?

 Following are some of the start options for cluster service. Please follow the MS KB article to know more about it in detail.

SwitchFunctionWindows 2003 Abbreviation
FixQuorumDo not mount the quorum device, and quorum logging turned off.            FQ
NoQuorumLoggingQuorum logging turned off.            NQ
DebugDisplays events during the start of Cluster service. For special syntax, see the "Debug" section later in this article.
LogLevel NSets the log level for debug mode.
DebugResMonThe Cluster service waits for a debugger to be attached to all Resource Monitor processes at their start.            DR

Windows 2000 and later only switches include the following. 
SwitchFunctionWindows 2003 Abbreviation
ResetQuorumLogDynamically re-creates the quorum log and checkpoint files (this functionality is automatic in Microsoft Windows NT 4.0).            RQ
NoRepEvtLoggingNo replication of Event Log entries.

Windows Server 2003 and later only switches include the following.
SwitchFunctionWindows 2003 Abbreviation
ForceQuorum orForce a majority node set with the node list N1, N2, and so forth. (Applicable only for Majority Node Set quorum.)                FO
NoGroupInfoEvtLoggingDo not log events to the event log related to group online and offline.                NG



What is a cluster log? What is its default location?


The cluster log is a diagnostic log that is a more complete record of cluster activity than the Microsoft Windows 2000/2003/2008 event log; the cluster log records the Cluster service activity that leads up to the events recorded in the event log. Although the event log can point you to a problem, the cluster log helps you get at its root. So, for diagnosis, check the event log first, then the cluster log.
The default location of cluster log is %system root%\cluster.In Windows 2008 you won't find it at the same location. Please see this to get to know where to find the cluster log file in Windows 2008. 


How many logging levels do we have in 2003? What is the default logging level of Cluster log in 2003?
We have the logging levels from 0 to 10. Out of them 5 are in use. From 6 to 10, it has been reserved for future use. Default logging level is 3 and setting it means we are asking the logging of Errors, Warning, Info to be done by cluster.
MS KB has some info but may not be entirely related to Windows 2003 or Windows 2008. Please do research to find the relevant ones for them.  




What could be the maximum size of a Cluster log and where can it be set/configured?

The default diagnostic cluster log size is 8 megabytes (MB), but can be changed in the manner described here.



Explain the steps or things happening in background, while Cluster disk is coming online?



What is the difference between 2000 and 2003 Cluster?



What is difference between NLB and Cluster? How do you decide whether Cluster or NLB will be required to achieve the high availability?


To know the answer read "How MSCS is different from NLB?" which is the 3rd question from the top. 


How many Cluster nodes you can have in 2003 Cluster and Windows 2008 cluster? (depends on Os edition). 
A server cluster can consist of up to eight nodes and may be configured in one of three ways: as a single node server cluster, as a single quorum device server cluster, or as a majority node set server cluster. For more information about these three server cluster models, see Choosing a Cluster Model.

In windows 2008 the number of nodes supported can go upto 16.

Visit this to find out about windows 2003.




Is it possible to have 2000 and 2003 node co-exist in the same Cluster?

Yes its possible. 
Please make a note that if you are running a mixed-mode cluster, the maximum number of supported nodes is that of the most restrictive node. For example, if you have a three-node Windows Server 2003 cluster (whose maximum number of nodes is eight), and you add a single Windows 2000 Datacenter Server node, the maximum number of nodes is reduced to four.


Is it possible to have x86 and x64 bit 2003 node in the same Cluster?

No we cannot mix x86 and x64 OS nodes in one cluster. 


What is a difference between a Cluster and a geographically-dispersed Cluster from administrative perspective?
Geographically dispersed clusters, also called stretched clusters or extended clusters, are clusters comprised of nodes that are placed in different physical sites. Geographically dispersed clusters are designed to provide failover in the event of a site loss due to power issues, natural disasters or other unforeseen events.

From administrative perspective the difference would come up due to the storage that will be used. It won't be a common storage available at the respective locations instead a replication between the two will have to be set up and managed accordingly. Managing failover will also be different than a normal cluster.  


Are dynamic disks support in 2003? What about GPT disks?

The Windows 2000 Advanced Server and Windows Server 2003 Cluster service cannot read disks that are dynamic, and makes dynamic disks unavailable to programs or services that are dependent on these disk resources in the server cluster. For this reason, the option to upgrade these disks to dynamic is unavailable.So to summarize, "Dynamic disks are not supported in Windows 2003 clusters."

GUID partition table disks are called GPT disks. Info on GPT disks - On GPT disks, you can create up to 128 partitions. Because GPT disks do not limit you to four partitions, extended partitions and logical drives are not available on GPT disks.
The GUID partition table (GPT) disks are not supported in a Windows Server 2003 server cluster if you do not apply hotfix 919117. As soon as you apply this hotfix to all nodes in the Windows Server 2003 server cluster, GPT disks can be added as physical disk resources in the cluster.


What is a File Share witness?


The file share witness is used to establish a majority node set. This is done by create a share on a server that gets a little file place into it automatically. The server hosting the cluster resource (which in the DAG I think is the Primary Activation Manager server) keeps an open file lock on this file. The other servers see this open file lock and interpret this as meaning another cluster node is online, healthy, and available.

A file share witness as mentioned is used when the DAG contains an even number of servers with in it. When you initially create the DAG you must specify the server and file location that will act as the file share witness regardless of how many servers are in the DAG (0 to start) to ensure that if you do add an even number of DAG members the FSW will be properly used.

You do not need a dedicated server for the FSWs and typically it is recommended to use a hub transport server in the primary data center. This is usually a safe thing to do as the Exchange team also manages the hub transport servers and the Exchange Trusted Sub sytem will already be a member of the local administrators group and have the necessary permissions to create the file share. Some people put the FSW on clustered file servers. 



What is a dependency tree and how is it crucial?
A dependency tree is a series of dependency relationships such that resource A depends on resource B, resource B depends on resource C, and so on.

Resources in a dependency tree obey the following rules:
  • A dependent resource and all of its dependencies must be in the same group.
  • The Cluster service takes a dependent resource offline before any of its dependencies are taken offline, and brings a dependent resource online after all of its dependencies are online, as determined by the dependency hierarchy.
  • Resource dependencies determine bindings. For example, clients will be bound to the particular IP address that a Network Name resource is dependent on.
If any of the three rules laid out for dependency tree is not obeyed, it'll be detrimental to the overall working of a clustered servers. For example a virtual server is a combination of two resources (for instance, an IP address resource and a network name resource). Both of them has to be in the same group. If not then cluster service won't work properly so that it can failover the group.


What is ISAlive and LooksAlive? what is their frequency? What happens if the IsAlive fails for a resource?

LooksAlive (Looks Alive) means general check of the resource. 
IsAlive (Is Alive) means detailed check of the resource
Both of them are used to determine if the resource is in the online state.
You can specify two polling intervals and a timeout value for resources. The polling intervals affect how often the MSCS Resource Monitor checks that the resource is available and operating. There are two levels of polling; they are known in Cluster Administrator as "Looks Alive" and "Is Alive." These values are named for the calls that the Resource Monitor makes to the resource to perform the polling. In "Looks Alive" polling, MSCS performs a cursory check to determine if the resource is available and running. In "Is Alive" polling, MSCS performs a more thorough check to determine if the resource is fully operational. The timeout value specifies how many seconds MSCS waits before it considers the resource failed. Read more in the section MSCS Failover fourth paragraph.


How many times Cluster will try to restart the group/resource, before it mark a group/resource as failed?

It all depends on how we configure it. You can configure the advanced resource properties using the Advanced tab in the resource Properties dialog box. Use theAdvanced tab to have MSCS perform the following tasks:
  • Restart a resource or allow the resource to fail.
    • To restart the resource, select Affect the group (if applicable).
    • To fail over the resource group to another cluster node when the resource fails, select Affect the group and then enter the appropriate values in Thresholdand Period. If you do not select Affect the group, the resource group will not fail over to the healthy cluster node.

The Threshold value determines the number of attempts by MSCS to restart the resource before the resource fails over to a healthy cluster node.

The Period value assigns a time requirement for the Threshold value to restart the resource.
  • Adjust the time parameters for Looks Alive (general check of the resource) or Is Alive (detailed check of the resource) to determine if the resource is in the online state.
  • Select the default number for the resource type.

To apply default number, select Use resource type value.
Specify the time parameter for a resource in a pending state (Online Pending or Offline Pending) to resolve its status before moving the resource to Offline or Failed status.



What are the improvements made in Windows 2008 clustering?


Some of the improvements worth mentioning here are given below:

  • New validation feature. With this feature, you can check that your system, storage, and network configuration is suitable for a cluster.
  • Support for GUID partition table (GPT) disks in cluster storage. GPT disks can have partitions larger than two terabytes and have built-in redundancy in the way partition information is stored, unlike master boot record (MBR) disks.
For the list of improvements please visit this


For some on demand webcasts and PPTs on Windows 2008 clustering please visit this page


Below are the technet resources which can be of help.

Introduction


Chapter 1 


Chapter 2


Chapter 3


Chapter 4


Chapter 5

I would like to inform you all that following resources has been referenced while writing these questionnaires. 

Microsoft Technet
Microsofot MSDN
support.dell.com
microsoft technet blog


Here are the links to some Demos making the understanding regarding Installation and Setup of MSCS on Windows 2008 R2 much easier. 

Windows Server 2008 R2 - Failover Clustering Introduction (Part 1 of 4)
http://www.youtube.com/watch?v=wcByPD_PuQE&playnext=1&list=PL1CCA511156C92DB2&feature=results_video

Windows server 2008 R2 - Cluster Host Setup (Part 2 of 4)
http://www.youtube.com/watch?v=0RuwYDTQ2eQ&playnext=1&list=PL1CCA511156C92DB2&feature=results_video

windows Server 2008 R2 - Cluster Installation (Part 3 of 4)
http://www.youtube.com/watch?v=rtD0dF9i8AA&playnext=1&list=PL1CCA511156C92DB2&feature=results_video

MS SQL Enterprise Cluster Install (Part 4 of 4) 

http://www.youtube.com/watch?v=eBm6gr39lTI 

No comments:

Post a Comment