Clustering scenario based
problems, Issues, Troubleshooting and Fixes-
1.
Re-Validate a cluster
Configuration: - Open Failover Cluster Management is selected
and then, under Management, click Validate a Configuration.
View report - %SystemRoot%\Cluster\Reports\Validation
Report date and time.html
2.
Network is down.
Usually,
failover cluster failure is to the result of one of two causes:
·
Hardware failure in one node of the two-node
cluster. This hardware failure could be caused by a failure in the SCSI card or
in the operating system.
To recover from this failure, remove the failed node from the
failover cluster using the SQL Server Setup program, address the hardware
failure with the computer offline, bring the machine back up, and then add the
repaired node back to the failover cluster instance.
·
Operating system failure. In this case, the node is
offline, but is not irretrievably broken.
To recover from an operating system failure, recover the node and
test failover. If the SQL Server instance does not fail over properly, you must
use the SQL Server Setup program to remove SQL Server from the failover
cluster, make necessary repairs, bring the computer back up, and then add the
repaired node back to the failover cluster instance.
Recovering from operating system failure this way can take time.
If the operating system failure can be recovered easily, avoid using this
technique.
The
following list describes common usage issues and explains how to resolve them.
Issue 1: It is difficult to diagnose
Setup issues when using the /qn
switch from the command prompt, as the /qn
switch suppresses all Setup dialog boxes and error messages. If the /qn switch is specified, all Setup
messages, including error messages, are written to Setup log files. For more
information about log files,
Resolution 1: Use the /qb switch instead of the /qn switch. If you use the /qb switch, the basic UI in
each step will be displayed, including error messages.
Issue 1: SQL Server service accounts
are unable to contact a domain controller
Resolution 1: Check your event logs for signs
of networking issues such as adapter failures or DNS problems. Verify that you
can ping your domain controller. (Open and check DNS manager also)
Issue 2: SQL Server service account
passwords are not identical on all cluster nodes, or the node does not restart
a SQL Server service that has migrated from a failed node.
Resolution
2: Change the
SQL Server service account passwords using SQL Server Configuration Manager. If
you do not, and you change the SQL Server service account passwords on one
node, you must also change the passwords on all other nodes. SQL Server
Configuration Manager does this automatically.
Issue 3: SQL Service account password expired.
Resolution
3: Change the
password and update on each node for MSQLSERVER service.
Issue 1: Firmware or drivers are not
updated on all nodes.
Resolution
1: Verify that
all nodes are using correct firmware versions and same driver versions.
Issue 2: A node cannot recover cluster
disks that have migrated from a failed node on a shared cluster disk with a
different drive letter.
Resolution
2: Disk drive
letters for the cluster disks must be the same on both servers. If they are
not, review your original installation of the operating system and Microsoft Cluster
Service (MSCS).
Resolution: To prevent the failure of specific
services from causing the SQL Server group to fail over, configure those
services using Cluster Administrator in Windows, as follows:
he Microsoft® Exchange Server Analyzer Tool
reads the following registry entry to determine whether you have configured a
failure of the Microsoft Distributed Transaction Coordinator (MSDTC) resource
to affect the group:
HKEY_LOCAL_MACHINE\Cluster\Resources\
If the value of ClusterMSDTCInstance is 2,
the Exchange Server Analyzer displays a warning.
The MSDTC resource must be present in an
Exchange cluster to support initial installation and service pack upgrades.
However, it is not required while Exchange is running.
By default, a failure to the MSDTC resource
will affect the group. Two examples of resource failure are as follows:
- The log file size exceeds the capacity of the
disk.
- The physical disk for the MSDTC resource
fails.
Specifically, a failure to the MSDTC resource
will cause a failover of all Exchange services that are running on that cluster
node to a different node in the cluster. However, because the MSDTC resource is
not a required resource, it does not have to be configured to affect the group.
To resolve this warning, configure the
Exchange Cluster so that a failure to the MSDTC resource does not affect the
group.
To configure Exchange so
that a failure to the MSDTC resource does not affect the group
1.
Log on to any node of the
cluster.
2.
Click Start,
point to All Programs, point to Administrative Tools,
and then click Cluster Administrator.
3.
Under Groups,
right-click the cluster group that includes the MSDTC resource.
4.
Right-click the MSDTC
resource, and then click Properties.
5.
On the Advanced tab,
clear the Affect the group check box, and then click OK.
·
Clear the Affect the
Group check box on the Advanced
tab of the Full Text Properties
dialog box. However, if SQL Server causes a failover, the full-text search
service restarts.
Resolution: Use Cluster Administrator in
MSCS to automatically start a failover cluster. The SQL Server service should
be set to start manually; the Cluster Administrator should be configured in
MSCS to start the SQL Server service.
Issue 1: DNS is failing with cluster
resource set to require DNS.
Resolution
1: Correct the
DNS problems.
Issue 2: A duplicate name is on the
network.
Resolution
2: Use NBTSTAT
to find the duplicate name and then correct the issue.
Issue 3: SQL Server is not
connecting using Named Pipes.
Resolution
3: To connect
using Named Pipes, create an alias using the SQL Server Configuration Manager
to connect to the appropriate computer. For example, if you have a cluster with
two nodes (Node A and Node B), and a failover
cluster instance (Virtsql)
with a default instance, you can connect to the server that has the Network
Name resource offline using the following steps:
1.
Determine on which node the group containing the instance of SQL
Server is running by using the Cluster Administrator. For this example, it is Node A.
2.
Start the SQL Server service on that computer using net start. For more
information about using net start.
3.
Start the SQL Server SQL Server Configuration Manager on Node A. View the pipe name on
which the server is listening. It should be similar to
\\.\$$\VIRTSQL\pipe\sql\query.
4.
On the client computer, start the SQL Server Configuration
Manager.
5.
Create an alias SQLTEST1 to connect through Named Pipes to this
pipe name. To do this, enter Node A
as the server name and edit the pipe name to be \\.\pipe\$$\VIRTSQL\sql\query.
6.
Connect to this instance using the alias SQLTEST1 as the server
name.
Issue : An orphan registry key in
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.X\Cluster]
Resolution: Make sure the MSSQL.X registry
hive is not currently in use, and then delete the cluster key.
Issue: This error is caused by a SCSI
shared drive that is not partitioned properly.
Resolution: Re-create a single partition on
the shared disk using the following steps:
1.
Delete the disk resource from the cluster.
2.
Delete all partitions on the disk.
3.
Verify in the disk properties that the disk is a basic disk.
4.
Create one partition on the shared disk, format the disk, and assign
a drive letter to the disk.
5.
Add the disk to the cluster using Cluster Administrator
(cluadmin).
6.
Run SQL Server Setup.
Issue: Because the Microsoft
Distributed Transaction Coordinator (MS DTC) is not completely configured in
Windows, applications may fail to enlist SQL Server resources in a distributed
transaction. This problem can affect linked servers, distributed queries, and
remote stored procedures that use distributed transactions. For more
information about how to configure MS DTC, see Before
Installing Failover Clustering.
Resolution: To prevent such problems, you
must fully enable MS DTC services on the servers where SQL Server is installed
and MS DTC is configured.
To fully
enable MS DTC, use the following steps:
1.
In Control Panel, open Administrative
Tools, and then open Computer
Management.
2.
In the left pane of Computer Management, expand Services and Applications,
and then click Services.
3.
In the right pane of Computer Management, right-click Distributed Transaction Coordinator,
and select Properties.
4.
In the Distributed
Transaction Coordinator window, click the General tab, and then click Stop to stop the service.
5.
In the Distributed
Transaction Coordinator window, click the Logon tab, and set the logon
account NT AUTHORITY\NetworkService.
6.
Click Apply
and OK to close the Distributed Transaction Coordinator
window. Close the Computer
Management window. Close the Administrative
Tools window.
(j)
Quorum
log too small
The Microsoft® Exchange Server Analyzer Tool reads the following
registry entry to determine the size of the quorum log configured for the
cluster:
HKEY_LOCAL_MACHINE\Cluster\Quorum\MaxQuorumLogSize
If the Exchange Server Analyzer finds the value for MaxQuorumLogSize less
than 4194304 decimal (0x400000 hexadecimal), a warning is displayed.
Warning: - The MaxQuorumLogSize registry value represents the currently configured value for the Reset
quorum log at cluster quorum
parameter. This warning is generated if the MaxQuorumLogSize is less than 4096 kilobytes (KB).
The cluster records all changes to the cluster database in the
quorum log file. When the quorum log attains the specified size, the cluster
saves the database and resets the log file. On Microsoft Windows® 2000
Server-based clusters, the default quorum size limit is 64 KB. On Windows
Server™ 2003-based clusters, the default quorum size limit is 4096 KB. For
Exchange Server clusters, it is recommended that the Reset quorum log at property be configured to 4096 KB.
This ensures that there will be sufficient space to hold the cluster
configuration information, such as which servers are part of the cluster, what
resources are installed in the cluster, and what state those resources are in
(for example, online or offline).
To correct this warning-
1.
Open Cluster
Administrator.
2.
In the left pane, right-click
the object that represents the cluster, and then click Properties.
3.
On the Quorum tab,
configure Reset quorum
log at with a
value of 4096.
4.
Click OK to
save the changes.
Please comment here other Clustering scenario based problems, Issues, Troubleshooting and Fixes - Jainendra Verma