Lustre
A high performance file system that is guaranteed to make you buy Excedrin(TM) by the box load :) -- high performance, if you can get it to work.
Overview
- MDS: Meta Data Server: manage file metadata, such as ower, attribute, lock, file pathname, file layout (how stripped, which OSS the file reside). Many config has a single MDS, which could be SPOF. Active/standby MDS can be implemented. Scalability is a bit hindered by single MDS, but it is only for meta data, not actual file "block" data io. Employment of multiple MDT helps with perf.
- MDT: Meta Data Target. New in Lustre 2.4, where a MDS can have multiple MDT to spread the load.
- OSS: Object Store Server. Handle file IO. Client bulk access is directly with OSS without interaction with MDS. Tend to have multiple OSS per cluster. Each OSS manages 1-8 OST.
- OST: Object Store Target. OST maps to a single local FS.
Client side
Settings
/etc/fstab :: 10.46.8.150@tcp0:10.46.8.151@tcp0:/lstr2 /mnt/lustre lustre defaults,_netdev 0 0 /etc/modprobe.conf :: options ksocklnd peer_timeout=0 options lnet networks=tcp0(eth0) # (2nd line is being tested for statbility) :: # use eth1 if cluster node is multi-homed, # pick nic on same network as storage where mount takes place
Recommended settings
TCP settings, set before mounting lustre
echo 16777216 > /proc/sys/net/core/wmem_default echo 16777216 > /proc/sys/net/core/wmem_max echo 16777216 > /proc/sys/net/core/rmem_default echo 16777216 > /proc/sys/net/core/rmem_max echo 4096 87380 16777216 > /proc/sys/net/ipv4/tcp_rmem echo 4096 87380 16777216 > /proc/sys/net/ipv4/tcp_wmem echo 30000 > /proc/sys/net/core/netdev_max_backlog ifconfig eth1 txqueuelen ${TXQLEN:-40} exit 0
Lustre settings, set after mounting
for i in /proc/fs/lustre/osc/*/checksums; do echo 0 > $i; done for i in /proc/fs/lustre/osc/*/max_dirty_mb; do echo 512 > $i; done for i in /proc/fs/lustre/osc/*/max_rpc*; do echo 32 > $i; done
Commands
lfs df -h # df lfs osts # list object stores lfs mdts # list metadata stores lctl lctl ping 10.46.8.50 lctl lustre_build_version lctl device_list lctl list_nids # see if using tcp or ib lctl --net tcp peer_list lctl --net tcp interface_list
Linux Performance Cmd
sudo ethtool eth1 # see nic speed, duplex # ssh to any compute node (lustre client) time -p tcpclient 10.46.8.151 50000 100000 -10000 # tcpclient binary obtained from lustre server.
Performance/Tuning
Lustre is a cluster-based file system. Striping across multiple servers and disk store is what get most performance. Typically setup is done so that each file stay w/in a single stripe, but then as different files are created, they will end up in different stripe (server).But for really large file and it is accessed by many clients, best performance would be have if such file is actually stripped. To do this, use the lfs setstripe command.
To simplify life, settings are done per directory basis, but lustre apply to files created therein.
Check/Set striping
lfs getstripe -d . # -d = --directory only lfs getstripe . | less # long listing include stripe setting for all files in the dir lfs getstripe my-file # stripe setting for "my-file" only. multiple obdidx would be stripping on such ost index number lfs setstripe -c -1 . # -c = --stripe-count -1=use all ost changing stripe-count only affect future blocks it be appended to a file, does not restripe the existing blocks.
Method 0: Default
# sequentially create files that are 1G each. # each file will land on different stripe # (but each file in one stripe) mkdir test-default cd test-default for i in 0 1 2 3 4 5 6 7 ; do time -p dd if=/dev/zero of=$i.tmp bs=1024k count=1024 done lfs getstripe * # will see that diff file have diff obdidx, # which means they are on diff OST servers # note that the first number is tremendously skewed by caching # using smaller file size will get even more caching and less indicative of sustained performance # 1 gbps NIC, wirespeed is about 125 MB/s
Method 1: stripe w/in single file
# sequentially create files that are 1G each. # each file will land on different stripe # and within each file they will have diff stripe # setstripe command does this # -s NUMBER : for stripe size, 4194340 = 4MB # -i -1 : index of OST where files would go, -1 means any # -c -1 : client number? -1 = any random. good for up to 8 OST, but as OST are added, especialy in imbalance, may need to tweak this. mkdir test-file-stripe #lfs setstripe test-file-stripe -s 4194304 -i -1 -c -1 cd test-file-stripe for i in 0 1 2 3 4 5 6 7 ; do time -p dd if=/dev/zero of=$i.tmp bs=1024k count=1024 done lfs getstripe * # will see that each file has multiple odbidx number # indicating striping w/in a single file
Method 2: Geek testing
Terascala created some script that create a series of directories at the top level of the lustre file system, where by diff dir would mean writting data with differnt stripping patters.# login to lustre server (eg mds1) cd /root/bin/misc/clients ./setup_directories The above will result in dirs that are: /mnt/lustre/ost ost-00 ost-01 ... ost-0n a series of ost dir, where files written to any of these dir will stay within each ost. This is good to benchmark to ensure all OST are performing equally. Test as root: for i in /mnt/lustre/ost/ost* ; do time -p dd if=/dev/zero of=$i/dummy-file.tmp bs=1024k count=1024 done lfs getstripe /mnt/lustre/ost/ost-00 # will show obdidx all pointing to the same number # (within each ost-0n dir) /mnt/lustre/strip-xM # where x indicate stripe sizes. 1M maybe too small for optimal performance # 4M would be good starting point # different size dir are created so that # performance against each of those can be done quickly lfs getstripe *
compiling driver
kernel version specific. It would be nice if they use DKMS so that the lustre module is automatically recompiled each time a new kernel is installed on the system.patched kernel req on the server side (ie MDS, OST), but not needed on client side (ie compute node mounting lustre fs).
To compile lustre module, the 'quilt' tool is helpful.
Server side
Config
- MDS are metadata server, typically one is enough, as it is only to check file path info. 1 second standby maybe useful in HA. lustre ost are not cluster-based, thus the other node is standby and not active-active ??
- OST: these are the file/data block servers. As many as needed to control the disks shelves and provide performance. Terascala make these in HA active/standby pair.
start up
/etc/init.d/tslagent start
cmd
ts # similar to tentakel or gsh , run command on all server nodes for i in 1 2 3 4 5 6; do ts $i echo ======= \; hostname\; mount ; done
Ref/Links
Copyright info about this work
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike2.5 License.
Pocket Sys Admin Survival Guide: for content that I wrote, (CC)
some rights reserved.
2005,2012 Tin Ho [ tin6150 (at) gmail.com ]
Some contents are "cached" here for easy reference. Sources include man pages,
vendor documents, online references, discussion groups, etc. Copyright of those
are obviously those of the vendor and original authors. I am merely caching them here for quick reference and avoid broken URL problems.
Where is PSG hosted these days?
http://tin6150.github.io/psg/psg2.html
This new home page at github
http://tiny.cc/tin6150/ New home in 2011.06.
http://tiny.cc/tin6150a/ Alt home in 2011.06.
http://tin6150.s3-website-us-west-1.amazonaws.com/psg.html (won't be coming after all)
ftp://sn.is-a-geek.com/psg/psg.html My home "server". Up sporadically.
http://tin6150.github.io/psg/psg.html
http://www.fiu.edu/~tho01/psg/psg.html (no longer updated as of 2007-05)
http://tiny.cc/tin6150/ New home in 2011.06.
http://tiny.cc/tin6150a/ Alt home in 2011.06.
http://tin6150.s3-website-us-west-1.amazonaws.com/psg.html (won't be coming after all)
ftp://sn.is-a-geek.com/psg/psg.html My home "server". Up sporadically.
http://tin6150.github.io/psg/psg.html
http://www.fiu.edu/~tho01/psg/psg.html (no longer updated as of 2007-05)