Evgeniy Polyakov | b8523c4 | 2009-02-09 17:02:34 +0300 | [diff] [blame] | 1 | POHMELFS: Parallel Optimized Host Message Exchange Layered File System. |
| 2 | |
| 3 | Evgeniy Polyakov <zbr@ioremap.net> |
| 4 | |
| 5 | Homepage: http://www.ioremap.net/projects/pohmelfs |
| 6 | |
| 7 | POHMELFS first began as a network filesystem with coherent local data and |
| 8 | metadata caches but is now evolving into a parallel distributed filesystem. |
| 9 | |
| 10 | Main features of this FS include: |
| 11 | * Locally coherent cache for data and metadata with (potentially) byte-range locks. |
| 12 | Since all Linux filesystems lock the whole inode during writing, algorithm |
| 13 | is very simple and does not use byte-ranges, although they are sent in |
| 14 | locking messages. |
| 15 | * Completely async processing of all events except creation of hard and symbolic |
| 16 | links, and rename events. |
| 17 | Object creation and data reading and writing are processed asynchronously. |
| 18 | * Flexible object architecture optimized for network processing. |
| 19 | Ability to create long paths to objects and remove arbitrarily huge |
| 20 | directories with a single network command. |
| 21 | (like removing the whole kernel tree via a single network command). |
| 22 | * Very high performance. |
| 23 | * Fast and scalable multithreaded userspace server. Being in userspace it works |
| 24 | with any underlying filesystem and still is much faster than async in-kernel NFS one. |
| 25 | * Client is able to switch between different servers (if one goes down, client |
| 26 | automatically reconnects to second and so on). |
| 27 | * Transactions support. Full failover for all operations. |
| 28 | Resending transactions to different servers on timeout or error. |
| 29 | * Read request (data read, directory listing, lookup requests) balancing between multiple servers. |
| 30 | * Write requests are replicated to multiple servers and completed only when all of them are acked. |
| 31 | * Ability to add and/or remove servers from the working set at run-time. |
| 32 | * Strong authentification and possible data encryption in network channel. |
| 33 | * Extended attributes support. |
| 34 | |
| 35 | POHMELFS is based on transactions, which are potentially long-standing objects that live |
| 36 | in the client's memory. Each transaction contains all the information needed to process a given |
| 37 | command (or set of commands, which is frequently used during data writing: single transactions |
| 38 | can contain creation and data writing commands). Transactions are committed by all the servers |
| 39 | to which they are sent and, in case of failures, are eventually resent or dropped with an error. |
| 40 | For example, reading will return an error if no servers are available. |
| 41 | |
| 42 | POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is |
| 43 | possible to detach replies from requests and, if the command requires data to be received, the |
| 44 | caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different |
| 45 | servers and async threads will pick up replies in parallel, find appropriate transactions in the |
| 46 | system and put the data where it belongs (like the page or inode cache). |
| 47 | |
| 48 | The main feature of POHMELFS is writeback data and the metadata cache. |
| 49 | Only a few non-performance critical operations use the write-through cache and |
| 50 | are synchronous: hard and symbolic link creation, and object rename. Creation, |
| 51 | removal of objects and data writing are asynchronous and are sent to |
| 52 | the server during system writeback. Only one writer at a time is allowed for any |
| 53 | given inode, which is guarded by an appropriate locking protocol. |
| 54 | Because of this feature, POHMELFS is extremely fast at metadata intensive |
| 55 | workloads and can fully utilize the bandwidth to the servers when doing bulk |
| 56 | data transfers. |
| 57 | |
| 58 | POHMELFS clients operate with a working set of servers and are capable of balancing read-only |
Evgeniy Polyakov | e0ca873 | 2009-03-27 15:04:29 +0300 | [diff] [blame] | 59 | operations (like lookups or directory listings) between them according to IO priorities. |
Evgeniy Polyakov | b8523c4 | 2009-02-09 17:02:34 +0300 | [diff] [blame] | 60 | Administrators can add or remove servers from the set at run-time via special commands (described |
Paul Bolle | 395cf96 | 2011-08-15 02:02:26 +0200 | [diff] [blame] | 61 | in Documentation/filesystems/pohmelfs/info.txt file). Writes are replicated to all servers, which |
| 62 | are connected with write permission turned on. IO priority and permissions can be changed in |
| 63 | run-time. |
Evgeniy Polyakov | b8523c4 | 2009-02-09 17:02:34 +0300 | [diff] [blame] | 64 | |
| 65 | POHMELFS is capable of full data channel encryption and/or strong crypto hashing. |
| 66 | One can select any kernel supported cipher, encryption mode, hash type and operation mode |
| 67 | (hmac or digest). It is also possible to use both or neither (default). Crypto configuration |
| 68 | is checked during mount time and, if the server does not support it, appropriate capabilities |
| 69 | will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified). |
| 70 | Crypto performance heavily depends on the number of crypto threads, which asynchronously perform |
| 71 | crypto operations and send the resulting data to server or submit it up the stack. This number |
| 72 | can be controlled via a mount option. |