Sys Admin Pocket Survival Guide

Lustre

A high performance file system that is guaranteed to make you buy Excedrin(TM) by the box load :) -- high performance, if you can get it to work.

Overview

MDS: Meta Data Server: manage file metadata, such as ower, attribute, lock, file pathname, file layout (how stripped, which OSS the file reside). Many config has a single MDS, which could be SPOF. Active/standby MDS can be implemented. Scalability is a bit hindered by single MDS, but it is only for meta data, not actual file "block" data io. Employment of multiple MDT helps with perf.
MDT: Meta Data Target. New in Lustre 2.4, where a MDS can have multiple MDT to spread the load.
OSS: Object Store Server. Handle file IO. Client bulk access is directly with OSS without interaction with MDS. Tend to have multiple OSS per cluster. Each OSS manages 1-8 OST.
OST: Object Store Target. OST maps to a single local FS.

Client side

Settings

/etc/fstab ::
10.46.8.150@tcp0:10.46.8.151@tcp0:/lstr2  /mnt/lustre  lustre  defaults,_netdev  0 0


/etc/modprobe.conf ::
options ksocklnd peer_timeout=0
options lnet networks=tcp0(eth0)
# (2nd line is being tested for statbility) :: 
# use eth1 if cluster node is multi-homed, 
# pick nic on same network as storage where mount takes place

Recommended settings

TCP settings, set before mounting lustre

echo 16777216 > /proc/sys/net/core/wmem_default 
echo 16777216 > /proc/sys/net/core/wmem_max 
echo 16777216 > /proc/sys/net/core/rmem_default 
echo 16777216 > /proc/sys/net/core/rmem_max
echo 4096 87380 16777216 > /proc/sys/net/ipv4/tcp_rmem 
echo 4096 87380 16777216 > /proc/sys/net/ipv4/tcp_wmem 
echo 30000 > /proc/sys/net/core/netdev_max_backlog
ifconfig eth1 txqueuelen ${TXQLEN:-40}
exit 0

Lustre settings, set after mounting

for i in /proc/fs/lustre/osc/*/checksums; do echo 0 > $i; done
for i in /proc/fs/lustre/osc/*/max_dirty_mb; do echo 512 > $i; done
for i in /proc/fs/lustre/osc/*/max_rpc*; do echo 32 > $i; done

Commands

lfs df -h		# df 
lfs osts		# list object stores
lfs mdts		# list metadata stores
lctl
lctl ping  10.46.8.50 
lctl lustre_build_version
lctl device_list
lctl list_nids			# see if using tcp or ib
lctl --net tcp peer_list
lctl --net tcp interface_list

Linux Performance Cmd

sudo ethtool eth1		# see nic speed, duplex


# ssh to any compute node (lustre client)
time -p tcpclient 10.46.8.151 50000 100000 -10000

# tcpclient binary obtained from lustre server.

Performance/Tuning

Lustre is a cluster-based file system. Striping across multiple servers and disk store is what get most performance. Typically setup is done so that each file stay w/in a single stripe, but then as different files are created, they will end up in different stripe (server).
But for really large file and it is accessed by many clients, best performance would be have if such file is actually stripped. To do this, use the lfs setstripe command.
To simplify life, settings are done per directory basis, but lustre apply to files created therein.

Check/Set striping

lfs getstripe -d .          # -d = --directory only
lfs getstripe    . | less   # long listing include stripe setting for all files in the dir
lfs getstripe my-file       # stripe setting for "my-file" only.  multiple obdidx would be stripping on such ost index number
lfs setstripe -c -1 .       # -c = --stripe-count -1=use all ost

changing stripe-count only affect future blocks it be appended to a file, does not restripe the existing blocks.

Method 0: Default


# sequentially create files that are 1G each.
# each file will land on different stripe
# (but each file in one stripe)

mkdir test-default
cd    test-default
for i in 0 1 2 3 4 5 6 7 ; do
  time -p dd if=/dev/zero of=$i.tmp bs=1024k count=1024
done

lfs getstripe * 
# will see that diff file have diff obdidx, 
# which means they are on diff OST servers

# note that the first number is tremendously skewed by caching
# using smaller file size will get even more caching and less indicative of sustained performance
# 1 gbps NIC, wirespeed is about 125 MB/s

Method 1: stripe w/in single file


# sequentially create files that are 1G each.
# each file will land on different stripe
# and within each file they will have diff stripe

# setstripe command does this
# -s NUMBER 	: for stripe size, 4194340 = 4MB
# -i -1 	: index of OST where files would go, -1 means any
# -c -1		: client number?  -1 = any random.   good for up to 8 OST, but as OST are added, especialy in imbalance, may need to tweak this.

mkdir test-file-stripe
#lfs   setstripe test-file-stripe -s 4194304 -i -1 -c -1
cd    test-file-stripe
for i in 0 1 2 3 4 5 6 7 ; do
  time -p dd if=/dev/zero of=$i.tmp bs=1024k count=1024
done

lfs getstripe *
# will see that each file has multiple odbidx number
# indicating striping w/in a single file

Method 2: Geek testing

Terascala created some script that create a series of directories at the top level of the lustre file system, where by diff dir would mean writting data with differnt stripping patters.


# login to lustre server (eg mds1)
cd /root/bin/misc/clients
./setup_directories


The above will result in dirs that are:

/mnt/lustre/ost
	ost-00      ost-01 ... ost-0n
	a series of ost dir, where files written to any of these dir will stay within each ost.
	This is good to benchmark to ensure all OST are performing equally.
	Test as root:
		for i in  /mnt/lustre/ost/ost* ; do
			time -p dd if=/dev/zero of=$i/dummy-file.tmp bs=1024k count=1024
		done


	lfs getstripe /mnt/lustre/ost/ost-00
	# will show obdidx all pointing to the same number 
	# (within each ost-0n dir)

/mnt/lustre/strip-xM
	# where x indicate stripe sizes. 1M maybe too small for optimal performance
	# 4M would be good starting point
	# different size dir are created so that 
	# performance against each of those can be done quickly



lfs getstripe *

compiling driver

kernel version specific. It would be nice if they use DKMS so that the lustre module is automatically recompiled each time a new kernel is installed on the system.

patched kernel req on the server side (ie MDS, OST), but not needed on client side (ie compute node mounting lustre fs).

To compile lustre module, the 'quilt' tool is helpful.

Server side

Config

MDS are metadata server, typically one is enough, as it is only to check file path info. 1 second standby maybe useful in HA. lustre ost are not cluster-based, thus the other node is standby and not active-active ??
OST: these are the file/data block servers. As many as needed to control the disks shelves and provide performance. Terascala make these in HA active/standby pair.

start up

/etc/init.d/tslagent start

cmd

ts 		# similar to tentakel or gsh , run command on all server nodes

for i in 1 2 3 4 5 6; do ts $i echo ======= \; hostname\; mount ; done

Ref/Links

http://wiki.whamcloud.com/display/PUB/Documentation

Copyright info about this work

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike2.5 License. Pocket Sys Admin Survival Guide: for content that I wrote, (CC) some rights reserved. 2005,2012 Tin Ho [ tin6150 (at) gmail.com ]
Some contents are "cached" here for easy reference. Sources include man pages, vendor documents, online references, discussion groups, etc. Copyright of those are obviously those of the vendor and original authors. I am merely caching them here for quick reference and avoid broken URL problems.

Where is PSG hosted these days?

http://tin6150.github.io/psg/psg2.html This new home page at github
http://tiny.cc/tin6150/ New home in 2011.06.
http://tiny.cc/tin6150a/ Alt home in 2011.06.
http://tin6150.s3-website-us-west-1.amazonaws.com/psg.html (won't be coming after all)
ftp://sn.is-a-geek.com/psg/psg.html My home "server". Up sporadically.
http://tin6150.github.io/psg/psg.html
http://www.fiu.edu/~tho01/psg/psg.html (no longer updated as of 2007-05)