Multipath Usage Guide for SANs

1. Overview

The connection from the server through the HBA to the storage controller is referred as a path. When multiple paths exists to a storage device(LUN) on a storage subsystem, it is referred as multipath connectivity. It is a enterprise level storage capability. Main purpose of multipath connectivity is to provide redundant access to the storage devices, i.e to have access to the storage device when one or more of the components in a path fail. Another advantage of multipathing is the increased throughput by way of load balancing.
  • Note: Multipathing protects against the failure of path(s) and not the failure of a specific storage device.
Common example of multipath is a SAN connected storage device. Usually one or more fibre channel HBAs from the host will be connected to the fabric switch and the storage controllers will be connected to the same switch.
A simple example of multipath could be: 2 HBAs connected to a switch to which the storage controllers are connected. In this case the storage controller can be accessed from either of the HBAs and hence we have multipath connectivity.

In Linux, a SCSI device is configured for a LUN seen on each path. i.e, if a LUN has 4 paths, then one will see four SCSI devices getting configured for the same device. Doing I/O to a LUN in a such an environment is unmanageable
  • applications/administrators do not know which SCSI device to use
  • all applications consistently using the same device
  • in case of a path failure, knowledge to retry the I/O on a different path
  • always using the storage device specific preferred path
  • spreading I/O between multiple valid paths

1.1. Device Mapper

Device mapper is a block subsystem that provides layering mechanism for block devices. One can write a device mapper to provide a specific functionality on top of a block device.
Currently the following functional layers are available:
  • concatenation
  • mirror
  • striping
  • encryption
  • flaky
  • delay
  • multipath
Multiple device mapper modules can be stacked to get the combined functionality.

1.2. Device Mapper Multipathing

Object of this document is to provide details on device mapper multipathing (DM-MP). DM-MP resolves all the issues that arise in accessing a multipathed device in Linux. It also provides a consistent user interface for storage devices provided by multiple vendors. There is only one block device (/dev/mapper/XXX) for a LUN. This is the device created by device mapper.
Paths are grouped into priority groups, and one of the priority group will be used for I/O, and is called active. A path selector selects a path in the priority group to be used for an I/O based on some load balancing algorithm (for example round-robin).
When a I/O fails in a path, that path gets disabled and the I/O is retried in a different path in the same priority group. If all paths in a priority group fails, a different priority group which is enabled will be selected to send I/O.
DM-MP consists of 4 components:
  1. DM MP kernel module - Kernel module that is responsible for making the multipathing decisions in normal and failure situations.
  2. multipath command - User space tool that allows the user with initial configuration, listing and deletion of multipathed devices.
  3. multipathd daemon - User space daemon that constantly monitors the paths. It marks a path as failed when it finds the path faulty and if all the paths in a priority group are faulty then it switches to the next enable priority group. It keeps checking the failed path, once the failed path comes alive, based on the failback policy, it can activate the path. It provides an CLI to monitor/manage individual paths. It automatically creates device mapper entries when new devices comes into existence.
  4. kpartx - User space command that creates device mapper entries for all the partitions in a multipathed disk/LUN. When the multipath command is invoked, this command automatically gets invoked. For DOS based partitions this command need to be run manually.

2. Terminology, Concepts and Usage

2.1. Output of multipath command

2.2. Terminology

Connection from the server through a HBA to a specific LUN. Without DM-MP, each path would appear as a separate device.
Path Group
Paths are grouped into a path groups. At any point of time only path group will be active. Path selector decides which path in the path group gets to send the next I/O. I/O will be sent only to the active path.
Path Priority
Each path has a specific priority. A priority callout program provides the priority for a given path. The user space commands use this priority value to choose an active path. In the group_by_prio path grouping policy, path priority is used to group the paths together and change their relative weight with the round robin path selector.
Path Group Priority
Sum of priorities of all non-faulty paths in a path group. By default, the multipathd daemon tries to keep the path group with the highest priority active.
Path Grouping Policy
Determines how the path group(s) are formed using the available paths. There are five different policies:
  1. multibus: One path group is formed with all paths to a LUN. Suitable for devices that are in Active/Active mode.
  2. failover: Each path group will have only one path.
  3. group_by_serial: One path group per storage controller(serial). All paths that connect to the LUN through a controller are assigned to a path group. Suitable for devices that are in Active/Passive mode.
  4. group_by_prio: Paths with same priority will be assigned to a path group.
  5. group_by_node_name: Paths with same target node name will be assigned to a path group.
Setting multibus as path grouping policy for a storage device in Active/Passive mode will reduce the I/O performance.
Path Selector
A kernel multipath component that determines which path will be chosen for the next I/O. Path selector can have an appropriate load balancing algorithm. Currently one one path selector exists, which is the round-robin.
Path Checker
Functionality in the user space that is used to check the availability of a path. This is implemented as a library function that is used by both multipath command and the multipathd daemon. Currently, there are 3 path checkers:
  1. readsector0: sends a read command to sector 0 at regular time interval. Produce lot of error messages in Active/Passive mode. Hence, suitable only for Active/Active mode.
  2. tur: sends a test unit ready command at regular interval.
  3. rdac: specific to the lsi-rdac device. Sends a inquiry command and sets the status of the path appropriately.
Path States
This refers to the physical state of a path. A path can be in one of the following states:
  1. ready: Path is up and can handle I/O requests.
  2. faulty: Path is down and cannot handle I/O requests.
  3. ghost: Path is a passive path. This state is shown in the passive path in Active/Passive mode.
  4. shaky: Path is up, but temporarily not available for I/O requests.
DM Path States
This refers to the DM module(kernel)'s view of the path's state. It can be in one of the two states:
  1. active: Last I/O sent to this path successfully completed. Analogous to ready path state.
  2. failed: Last I/O to this path failed. Analogous to faulty path state.
Path Group State
Path Groups can be in one of the following three states:
  1. active: I/O will be sent to the multipath device will be sent to this path group. Only one path group will be in this state.
  2. enabled: If none of the paths in the active path group is in the ready state, I/O will be sent these path groups. There can be one or more path groups in this state.
  3. disabled: In none of the paths in the active path group and enabled path group is in the ready state. I/O will be sent to these path groups. There can be one or more path groups in this state. This state is available only for certain storage devices.
UID Callout (or) WWID Callout
A standalone program that returns a globally unique identifier for a path. multipath/multipathd invokes this callout and uses the ID returned to coalesce multiple paths to a single multipath device.
Priority Callout
A standalone program that returns the priority for a path. multipath/multipathd invokes this callout and uses the priority value of the paths to determine the active path group.
Hardware Handler
Kernel personality module for storage devices that needs special handling. This module is responsible for enabling a path (at the device level) during initialization, failover and failback. It is also responsible for handling device specific sense error codes.
When all the paths in a path group are in faulty state, one of the enabled path group (path with highest priority) with any paths in ready state will be made active. If there is no paths in ready state in any of the enabled path groups, then one of the disabled path group (path with highest priority) will be made active. Making a new path group active is also referred as switching of path group. Original active path group's state will be changed to enabled.
A failed path can become active at any point of time. multipathd keeps checking the path. Once it finds a path is active, it will change the state of the path to ready. If this action makes one of the enabled path group's priority to be higher than the current active path group, multipathd may choose to failback to the highest priority path group.
Failback Policy
Under failback situations multipathd can do one of the following three things:
  1. immediate: Immediately failback to the highest priority path group.
  2. # of seconds: Wait for the specified number of seconds, for I/O to stabilize, then failback to the highest priority path group.
  3. do nothing: Do nothing, user explicitly fails back to the highest priority path group.
This policy selection can be set by the user through /etc/multipath.conf.
Storage devices with 2 controller can be configured in this mode. Active/Active means that both the controllers can process I/Os.
Storage devices with 2 controller can be configured in this mode. Active/Passive means that one of the controllers(active) can process I/Os, and the other one(passive) is in a standby mode. I/Os to the passive controller will fail.
A user friendly and/or user defined name for a DM device. By default, WWID is used for the DM device. This is the name that is listed in /dev/disk/by-name directory. When the user_friendly_names configuration option is set, the alias of a DM device will have the form of mpath<n>. User also has the option of setting a unique alias for each multipath device.

2.3. Configuration File (/etc/multipath.conf)

DM-Multipath allows many of the feature to be user configurable using the configuration file /etc/multipath.conf. multipath command and multipathd uses the configuration information from this file. This file is consulted only during the configuration of multipath devices. In other words, if the user makes any changes to this file, then the multipath command need to be rerun to configure the multipath devices (i.e the user has to do multipath -F followed by multipath).
Support for many of the devices (as listed below) is inbuilt in the user space component of DM-Multipath. If the support for a specific storage device is not inbuilt or the user wants to override some of the values only then the user need to modify this file.
This file has 5 sections:
  1. System level defaults ("defaults"): Where the user can specify system level default override.
  2. Black listed devices ("blacklist"): User can specify the list of devices they do not want to be under the control of DM-Multipath. These devices will be excluded.
  3. Black list exceptions ("blacklist_exceptions"): Specific devices to be treated as multipath candidates even if they exist in the blacklist.
  4. Storage controller specific settings ("devices"): User specified configuration settings will be applied to devices with specified "Vendor" and "Product" information.
  5. Device specific settings ("multipaths"): User can fine tune configuration settings for individual LUNs.
User can specify the values for the attributes in this file using regular expression syntax.
For detailed explanation of the different attributes and allowed values for the attributes please refer to multipath.conf.annotated file.
  • In Mainline, this file is located in the root directory of multipath-tools.
  • In RedHat, this file is located in the directory /usr/share/doc/device-mapper-multipath-X.Y.Z/.
  • In SuSE, this file is located in the directory /usr/share/doc/packages/multipath-tools/

2.3.1. Attribute value overrides

Attribute values are set at multiple levels (internally in multipath tools and through multipath.conf file). Following is the order in which the attribute values will be overwritten.
  1. Global internal defaults, as specified in the man page of multipath.conf.
  2. Device specific internal defaults, as defined in libmultipath/hwtable.c.
  3. Items described in defaults section of /etc/multipath.conf.
  4. Items defined in device section of /etc/multipath.conf.
    • Note that this will completely overwrite configuration information defined in (2) above. So, if even if you want to change/add only one attribute one have to provide the whole list for a device.
  5. Items defined in multipaths section of /etc/multipath.conf.

2.4. multipath, multipathd command usage

Man page of multipath/multipathd provides good details on the usage of the tools.
multipathd has a interactive mode option which can be used for querying and managing the paths and also to check the configuration details that will be used.
When multipathd is running, one has to invoke multipathd with the command line multipathd -k. multipathd will enter into a command line mode where user can invoke different commands. Checkout the man page for different commands.

3. Tips and Tricks

  1. Using alias: By default, the multipathed devices are named with the uid of the device, which one accesses through /dev/mapper/${uid_name}. When one uses user_friendly_names, devices will be named as mpath0, mpath1 etc., which may meet ones needs. User also have an option to define a alias in multipath.conf for each of the device.
Syntax is:
  1. Persistent device names: The names (uid_names or mpath names or alias names) that appear in /dev/mapper are persistent across boots, and the names dm-, dm-1 etc., can change between reboots. So, it is advisable to use the device names that appear under /dev/mapper and avoid using the dm-? names.
  2. Restart of tools after changing multipath,conf file: Once multipath.conf file is changed, the multipath tools need to be rerun for those configuration values to be effective. One has to kill multipathd, run multipath -F and then restart multipathd and multipath.
  3. Devices with paritions: Create device partitions before running multipath, as kpartx is configured to run to create multipathed partitions that way. Partions on device mpath0 appear as /dev/mapper/mpath0p1, /dev/mapper/mpath0p2, etc.,
  4. Using binding file in clustered environment: Bindings file holds the bindings between the device mapper names and the uid of the underlying device. By default the file is /var/lib/multipath/bindings, this can be changed by the multipath command line option -b. In a clustered environment, this file can be created in one node and can be transferred to another to get the same names.
    Note that the same effect can also be acheived by using alias and having the same multipath.conf file in all the nodes of the cluster.
  5. Getting the multipath device name corresponding to a SCSI device: If one knows the name of a SCSI device and wants to get the device mapper name associated with that the could use multipath -l /dev/sda, where sda is the SCSI device. On the other hand, if one knows the device mapper name and wants to know the underlying device names they could use the same command with the device mapper name. i.e multipath -l mpath0, where mpath0 is the device mapper name.
  6. When using LVM on dm-multipath devices, it is better to turn lvm scanning off on the underlying SCSI devices. This can be done by changing the filter parameter in /etc/lvm/lvm.conf to be filter = [ "a/dev/mapper/.*/", "r/dev/sd.*/" ].
    If your root device is also a multipathed lvm device, then make the above change before you create a new initrd image