| ORANGEFS |
| ======== |
| |
| OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal |
| for large storage problems faced by HPC, BigData, Streaming Video, |
| Genomics, Bioinformatics. |
| |
| Orangefs, originally called PVFS, was first developed in 1993 by |
| Walt Ligon and Eric Blumer as a parallel file system for Parallel |
| Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns |
| of parallel programs. |
| |
| Orangefs features include: |
| |
| * Distributes file data among multiple file servers |
| * Supports simultaneous access by multiple clients |
| * Stores file data and metadata on servers using local file system |
| and access methods |
| * Userspace implementation is easy to install and maintain |
| * Direct MPI support |
| * Stateless |
| |
| |
| MAILING LIST |
| ============ |
| |
| http://beowulf-underground.org/mailman/listinfo/pvfs2-users |
| |
| |
| DOCUMENTATION |
| ============= |
| |
| http://www.orangefs.org/documentation/ |
| |
| |
| USERSPACE FILESYSTEM SOURCE |
| =========================== |
| |
| http://www.orangefs.org/download |
| |
| Orangefs versions prior to 2.9.3 would not be compatible with the |
| upstream version of the kernel client. |
| |
| |
| BUILDING THE USERSPACE FILESYSTEM ON A SINGLE SERVER |
| ==================================================== |
| |
| When Orangefs is upstream, "--with-kernel" shouldn't be needed, but |
| until then the path to where the kernel with the Orangefs kernel client |
| patch was built is needed to ensure that pvfs2-client-core (the bridge |
| between kernel space and user space) will build properly. You can omit |
| --prefix if you don't care that things are sprinkled around in |
| /usr/local. |
| |
| ./configure --prefix=/opt/ofs --with-kernel=/path/to/orangefs/kernel |
| |
| make |
| |
| make install |
| |
| Create an orangefs config file: |
| /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf |
| |
| for "Enter hostnames", use the hostname, don't let it default to |
| localhost. |
| |
| create a pvfs2tab file in /etc: |
| cat /etc/pvfs2tab |
| tcp://myhostname:3334/orangefs /mymountpoint pvfs2 defaults,noauto 0 0 |
| |
| create the mount point you specified in the tab file if needed: |
| mkdir /mymountpoint |
| |
| bootstrap the server: |
| /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf -f |
| |
| start the server: |
| /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf |
| |
| Now the server is running. At this point you might like to |
| prove things are working with: |
| |
| /opt/osf/bin/pvfs2-ls /mymountpoint |
| |
| You might not want to enforce selinux, it doesn't seem to matter by |
| linux 3.11... |
| |
| If stuff seems to be working, turn on the client core: |
| /opt/osf/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core |
| |
| Mount your filesystem. |
| mount -t pvfs2 tcp://myhostname:3334/orangefs /mymountpoint |
| |
| |
| OPTIONS |
| ======= |
| |
| The following mount options are accepted: |
| |
| acl |
| Allow the use of Access Control Lists on files and directories. |
| |
| intr |
| Some operations between the kernel client and the user space |
| filesystem can be interruptible, such as changes in debug levels |
| and the setting of tunable parameters. |
| |
| local_lock |
| Enable posix locking from the perspective of "this" kernel. The |
| default file_operations lock action is to return ENOSYS. Posix |
| locking kicks in if the filesystem is mounted with -o local_lock. |
| Distributed locking is being worked on for the future. |
| |
| |
| DEBUGGING |
| ========= |
| |
| If you want the debug (GOSSIP) statements in a particular |
| source file (inode.c for example) go to syslog: |
| |
| echo inode > /sys/kernel/debug/orangefs/kernel-debug |
| |
| No debugging (the default): |
| |
| echo none > /sys/kernel/debug/orangefs/kernel-debug |
| |
| Debugging from several source files: |
| |
| echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug |
| |
| All debugging: |
| |
| echo all > /sys/kernel/debug/orangefs/kernel-debug |
| |
| Get a list of all debugging keywords: |
| |
| cat /sys/kernel/debug/orangefs/debug-help |
| |
| |
| PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE |
| ============================================ |
| |
| Orangefs is a user space filesystem and an associated kernel module. |
| We'll just refer to the user space part of Orangefs as "userspace" |
| from here on out. Orangefs descends from PVFS, and userspace code |
| still uses PVFS for function and variable names. Userspace typedefs |
| many of the important structures. Function and variable names in |
| the kernel module have been transitioned to "orangefs", and The Linux |
| Coding Style avoids typedefs, so kernel module structures that |
| correspond to userspace structures are not typedefed. |
| |
| The kernel module implements a pseudo device that userspace |
| can read from and write to. Userspace can also manipulate the |
| kernel module through the pseudo device with ioctl. |
| |
| THE BUFMAP: |
| |
| At startup userspace allocates two page-size-aligned (posix_memalign) |
| mlocked memory buffers, one is used for IO and one is used for readdir |
| operations. The IO buffer is 41943040 bytes and the readdir buffer is |
| 4194304 bytes. Each buffer contains logical chunks, or partitions, and |
| a pointer to each buffer is added to its own PVFS_dev_map_desc structure |
| which also describes its total size, as well as the size and number of |
| the partitions. |
| |
| A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a |
| mapping routine in the kernel module with an ioctl. The structure is |
| copied from user space to kernel space with copy_from_user and is used |
| to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which |
| then contains: |
| |
| * refcnt - a reference counter |
| * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's |
| partition size, which represents the filesystem's block size and |
| is used for s_blocksize in super blocks. |
| * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of |
| partitions in the IO buffer. |
| * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. |
| * total_size - the total size of the IO buffer. |
| * page_count - the number of 4096 byte pages in the IO buffer. |
| * page_array - a pointer to page_count * (sizeof(struct page*)) bytes |
| of kcalloced memory. This memory is used as an array of pointers |
| to each of the pages in the IO buffer through a call to get_user_pages. |
| * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc)) |
| bytes of kcalloced memory. This memory is further intialized: |
| |
| user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc |
| structure. user_desc->ptr points to the IO buffer. |
| |
| pages_per_desc = bufmap->desc_size / PAGE_SIZE |
| offset = 0 |
| |
| bufmap->desc_array[0].page_array = &bufmap->page_array[offset] |
| bufmap->desc_array[0].array_count = pages_per_desc = 1024 |
| bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) |
| offset += 1024 |
| . |
| . |
| . |
| bufmap->desc_array[9].page_array = &bufmap->page_array[offset] |
| bufmap->desc_array[9].array_count = pages_per_desc = 1024 |
| bufmap->desc_array[9].uaddr = (user_desc->ptr) + |
| (9 * 1024 * 4096) |
| offset += 1024 |
| |
| * buffer_index_array - a desc_count sized array of ints, used to |
| indicate which of the IO buffer's partitions are available to use. |
| * buffer_index_lock - a spinlock to protect buffer_index_array during update. |
| * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element |
| int array used to indicate which of the readdir buffer's partitions are |
| available to use. |
| * readdir_index_lock - a spinlock to protect readdir_index_array during |
| update. |
| |
| OPERATIONS: |
| |
| The kernel module builds an "op" (struct orangefs_kernel_op_s) when it |
| needs to communicate with userspace. Part of the op contains the "upcall" |
| which expresses the request to userspace. Part of the op eventually |
| contains the "downcall" which expresses the results of the request. |
| |
| The slab allocator is used to keep a cache of op structures handy. |
| |
| At init time the kernel module defines and initializes a request list |
| and an in_progress hash table to keep track of all the ops that are |
| in flight at any given time. |
| |
| Ops are stateful: |
| |
| * unknown - op was just initialized |
| * waiting - op is on request_list (upward bound) |
| * inprogr - op is in progress (waiting for downcall) |
| * serviced - op has matching downcall; ok |
| * purged - op has to start a timer since client-core |
| exited uncleanly before servicing op |
| * given up - submitter has given up waiting for it |
| |
| When some arbitrary userspace program needs to perform a |
| filesystem operation on Orangefs (readdir, I/O, create, whatever) |
| an op structure is initialized and tagged with a distinguishing ID |
| number. The upcall part of the op is filled out, and the op is |
| passed to the "service_operation" function. |
| |
| Service_operation changes the op's state to "waiting", puts |
| it on the request list, and signals the Orangefs file_operations.poll |
| function through a wait queue. Userspace is polling the pseudo-device |
| and thus becomes aware of the upcall request that needs to be read. |
| |
| When the Orangefs file_operations.read function is triggered, the |
| request list is searched for an op that seems ready-to-process. |
| The op is removed from the request list. The tag from the op and |
| the filled-out upcall struct are copy_to_user'ed back to userspace. |
| |
| If any of these (and some additional protocol) copy_to_users fail, |
| the op's state is set to "waiting" and the op is added back to |
| the request list. Otherwise, the op's state is changed to "in progress", |
| and the op is hashed on its tag and put onto the end of a list in the |
| in_progress hash table at the index the tag hashed to. |
| |
| When userspace has assembled the response to the upcall, it |
| writes the response, which includes the distinguishing tag, back to |
| the pseudo device in a series of io_vecs. This triggers the Orangefs |
| file_operations.write_iter function to find the op with the associated |
| tag and remove it from the in_progress hash table. As long as the op's |
| state is not "canceled" or "given up", its state is set to "serviced". |
| The file_operations.write_iter function returns to the waiting vfs, |
| and back to service_operation through wait_for_matching_downcall. |
| |
| Service operation returns to its caller with the op's downcall |
| part (the response to the upcall) filled out. |
| |
| The "client-core" is the bridge between the kernel module and |
| userspace. The client-core is a daemon. The client-core has an |
| associated watchdog daemon. If the client-core is ever signaled |
| to die, the watchdog daemon restarts the client-core. Even though |
| the client-core is restarted "right away", there is a period of |
| time during such an event that the client-core is dead. A dead client-core |
| can't be triggered by the Orangefs file_operations.poll function. |
| Ops that pass through service_operation during a "dead spell" can timeout |
| on the wait queue and one attempt is made to recycle them. Obviously, |
| if the client-core stays dead too long, the arbitrary userspace processes |
| trying to use Orangefs will be negatively affected. Waiting ops |
| that can't be serviced will be removed from the request list and |
| have their states set to "given up". In-progress ops that can't |
| be serviced will be removed from the in_progress hash table and |
| have their states set to "given up". |
| |
| Readdir and I/O ops are atypical with respect to their payloads. |
| |
| - readdir ops use the smaller of the two pre-allocated pre-partitioned |
| memory buffers. The readdir buffer is only available to userspace. |
| The kernel module obtains an index to a free partition before launching |
| a readdir op. Userspace deposits the results into the indexed partition |
| and then writes them to back to the pvfs device. |
| |
| - io (read and write) ops use the larger of the two pre-allocated |
| pre-partitioned memory buffers. The IO buffer is accessible from |
| both userspace and the kernel module. The kernel module obtains an |
| index to a free partition before launching an io op. The kernel module |
| deposits write data into the indexed partition, to be consumed |
| directly by userspace. Userspace deposits the results of read |
| requests into the indexed partition, to be consumed directly |
| by the kernel module. |
| |
| Responses to kernel requests are all packaged in pvfs2_downcall_t |
| structs. Besides a few other members, pvfs2_downcall_t contains a |
| union of structs, each of which is associated with a particular |
| response type. |
| |
| The several members outside of the union are: |
| - int32_t type - type of operation. |
| - int32_t status - return code for the operation. |
| - int64_t trailer_size - 0 unless readdir operation. |
| - char *trailer_buf - initialized to NULL, used during readdir operations. |
| |
| The appropriate member inside the union is filled out for any |
| particular response. |
| |
| PVFS2_VFS_OP_FILE_IO |
| fill a pvfs2_io_response_t |
| |
| PVFS2_VFS_OP_LOOKUP |
| fill a PVFS_object_kref |
| |
| PVFS2_VFS_OP_CREATE |
| fill a PVFS_object_kref |
| |
| PVFS2_VFS_OP_SYMLINK |
| fill a PVFS_object_kref |
| |
| PVFS2_VFS_OP_GETATTR |
| fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) |
| fill in a string with the link target when the object is a symlink. |
| |
| PVFS2_VFS_OP_MKDIR |
| fill a PVFS_object_kref |
| |
| PVFS2_VFS_OP_STATFS |
| fill a pvfs2_statfs_response_t with useless info <g>. It is hard for |
| us to know, in a timely fashion, these statistics about our |
| distributed network filesystem. |
| |
| PVFS2_VFS_OP_FS_MOUNT |
| fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref |
| except its members are in a different order and "__pad1" is replaced |
| with "id". |
| |
| PVFS2_VFS_OP_GETXATTR |
| fill a pvfs2_getxattr_response_t |
| |
| PVFS2_VFS_OP_LISTXATTR |
| fill a pvfs2_listxattr_response_t |
| |
| PVFS2_VFS_OP_PARAM |
| fill a pvfs2_param_response_t |
| |
| PVFS2_VFS_OP_PERF_COUNT |
| fill a pvfs2_perf_count_response_t |
| |
| PVFS2_VFS_OP_FSKEY |
| file a pvfs2_fs_key_response_t |
| |
| PVFS2_VFS_OP_READDIR |
| jamb everything needed to represent a pvfs2_readdir_response_t into |
| the readdir buffer descriptor specified in the upcall. |
| |
| Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests |
| made by the kernel side. |
| |
| A buffer_list containing: |
| - a pointer to the prepared response to the request from the |
| kernel (struct pvfs2_downcall_t). |
| - and also, in the case of a readdir request, a pointer to a |
| buffer containing descriptors for the objects in the target |
| directory. |
| ... is sent to the function (PINT_dev_write_list) which performs |
| the writev. |
| |
| PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; |
| |
| The first four elements of io_array are initialized like this for all |
| responses: |
| |
| io_array[0].iov_base = address of local variable "proto_ver" (int32_t) |
| io_array[0].iov_len = sizeof(int32_t) |
| |
| io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) |
| io_array[1].iov_len = sizeof(int32_t) |
| |
| io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) |
| io_array[2].iov_len = sizeof(int64_t) |
| |
| io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) |
| of global variable vfs_request (vfs_request_t) |
| io_array[3].iov_len = sizeof(pvfs2_downcall_t) |
| |
| Readdir responses initialize the fifth element io_array like this: |
| |
| io_array[4].iov_base = contents of member trailer_buf (char *) |
| from out_downcall member of global variable |
| vfs_request |
| io_array[4].iov_len = contents of member trailer_size (PVFS_size) |
| from out_downcall member of global variable |
| vfs_request |
| |
| Orangefs exploits the dcache in order to avoid sending redundant |
| requests to userspace. We keep object inode attributes up-to-date with |
| orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to |
| help it decide whether or not to update an inode: "new" and "bypass". |
| Orangefs keeps private data in an object's inode that includes a short |
| timeout value, getattr_time, which allows any iteration of |
| orangefs_inode_getattr to know how long it has been since the inode was |
| updated. When the object is not new (new == 0) and the bypass flag is not |
| set (bypass == 0) orangefs_inode_getattr returns without updating the inode |
| if getattr_time has not timed out. Getattr_time is updated each time the |
| inode is updated. |
| |
| Creation of a new object (file, dir, sym-link) includes the evaluation of |
| its pathname, resulting in a negative directory entry for the object. |
| A new inode is allocated and associated with the dentry, turning it from |
| a negative dentry into a "productive full member of society". Orangefs |
| obtains the new inode from Linux with new_inode() and associates |
| the inode with the dentry by sending the pair back to Linux with |
| d_instantiate(). |
| |
| The evaluation of a pathname for an object resolves to its corresponding |
| dentry. If there is no corresponding dentry, one is created for it in |
| the dcache. Whenever a dentry is modified or verified Orangefs stores a |
| short timeout value in the dentry's d_time, and the dentry will be trusted |
| for that amount of time. Orangefs is a network filesystem, and objects |
| can potentially change out-of-band with any particular Orangefs kernel module |
| instance, so trusting a dentry is risky. The alternative to trusting |
| dentries is to always obtain the needed information from userspace - at |
| least a trip to the client-core, maybe to the servers. Obtaining information |
| from a dentry is cheap, obtaining it from userspace is relatively expensive, |
| hence the motivation to use the dentry when possible. |
| |
| The timeout values d_time and getattr_time are jiffy based, and the |
| code is designed to avoid the jiffy-wrap problem: |
| |
| "In general, if the clock may have wrapped around more than once, there |
| is no way to tell how much time has elapsed. However, if the times t1 |
| and t2 are known to be fairly close, we can reliably compute the |
| difference in a way that takes into account the possibility that the |
| clock may have wrapped between times." |
| |
| from course notes by instructor Andy Wang |
| |