Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 1 | |
| 2 | relayfs - a high-speed data relay filesystem |
| 3 | ============================================ |
| 4 | |
| 5 | relayfs is a filesystem designed to provide an efficient mechanism for |
| 6 | tools and facilities to relay large and potentially sustained streams |
| 7 | of data from kernel space to user space. |
| 8 | |
| 9 | The main abstraction of relayfs is the 'channel'. A channel consists |
| 10 | of a set of per-cpu kernel buffers each represented by a file in the |
| 11 | relayfs filesystem. Kernel clients write into a channel using |
| 12 | efficient write functions which automatically log to the current cpu's |
| 13 | channel buffer. User space applications mmap() the per-cpu files and |
| 14 | retrieve the data as it becomes available. |
| 15 | |
| 16 | The format of the data logged into the channel buffers is completely |
| 17 | up to the relayfs client; relayfs does however provide hooks which |
Marcelo Tosatti | afeda2c | 2005-09-16 19:28:01 -0700 | [diff] [blame] | 18 | allow clients to impose some structure on the buffer data. Nor does |
Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 19 | relayfs implement any form of data filtering - this also is left to |
| 20 | the client. The purpose is to keep relayfs as simple as possible. |
| 21 | |
| 22 | This document provides an overview of the relayfs API. The details of |
| 23 | the function parameters are documented along with the functions in the |
| 24 | filesystem code - please see that for details. |
| 25 | |
| 26 | Semantics |
| 27 | ========= |
| 28 | |
| 29 | Each relayfs channel has one buffer per CPU, each buffer has one or |
| 30 | more sub-buffers. Messages are written to the first sub-buffer until |
| 31 | it is too full to contain a new message, in which case it it is |
| 32 | written to the next (if available). Messages are never split across |
| 33 | sub-buffers. At this point, userspace can be notified so it empties |
| 34 | the first sub-buffer, while the kernel continues writing to the next. |
| 35 | |
| 36 | When notified that a sub-buffer is full, the kernel knows how many |
| 37 | bytes of it are padding i.e. unused. Userspace can use this knowledge |
| 38 | to copy only valid data. |
| 39 | |
| 40 | After copying it, userspace can notify the kernel that a sub-buffer |
| 41 | has been consumed. |
| 42 | |
| 43 | relayfs can operate in a mode where it will overwrite data not yet |
| 44 | collected by userspace, and not wait for it to consume it. |
| 45 | |
| 46 | relayfs itself does not provide for communication of such data between |
Tom Zanussi | 6b34350 | 2006-01-08 01:02:32 -0800 | [diff] [blame] | 47 | userspace and kernel, allowing the kernel side to remain simple and |
| 48 | not impose a single interface on userspace. It does provide a set of |
| 49 | examples and a separate helper though, described below. |
Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 50 | |
Tom Zanussi | 6b34350 | 2006-01-08 01:02:32 -0800 | [diff] [blame] | 51 | klog and relay-apps example code |
| 52 | ================================ |
Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 53 | |
Tom Zanussi | 6b34350 | 2006-01-08 01:02:32 -0800 | [diff] [blame] | 54 | relayfs itself is ready to use, but to make things easier, a couple |
| 55 | simple utility functions and a set of examples are provided. |
Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 56 | |
Tom Zanussi | 6b34350 | 2006-01-08 01:02:32 -0800 | [diff] [blame] | 57 | The relay-apps example tarball, available on the relayfs sourceforge |
| 58 | site, contains a set of self-contained examples, each consisting of a |
| 59 | pair of .c files containing boilerplate code for each of the user and |
| 60 | kernel sides of a relayfs application; combined these two sets of |
| 61 | boilerplate code provide glue to easily stream data to disk, without |
| 62 | having to bother with mundane housekeeping chores. |
Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 63 | |
Tom Zanussi | 6b34350 | 2006-01-08 01:02:32 -0800 | [diff] [blame] | 64 | The 'klog debugging functions' patch (klog.patch in the relay-apps |
| 65 | tarball) provides a couple of high-level logging functions to the |
| 66 | kernel which allow writing formatted text or raw data to a channel, |
| 67 | regardless of whether a channel to write into exists or not, or |
| 68 | whether relayfs is compiled into the kernel or is configured as a |
| 69 | module. These functions allow you to put unconditional 'trace' |
| 70 | statements anywhere in the kernel or kernel modules; only when there |
| 71 | is a 'klog handler' registered will data actually be logged (see the |
| 72 | klog and kleak examples for details). |
| 73 | |
| 74 | It is of course possible to use relayfs from scratch i.e. without |
| 75 | using any of the relay-apps example code or klog, but you'll have to |
| 76 | implement communication between userspace and kernel, allowing both to |
| 77 | convey the state of buffers (full, empty, amount of padding). |
| 78 | |
| 79 | klog and the relay-apps examples can be found in the relay-apps |
| 80 | tarball on http://relayfs.sourceforge.net |
| 81 | |
Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 82 | |
| 83 | The relayfs user space API |
| 84 | ========================== |
| 85 | |
| 86 | relayfs implements basic file operations for user space access to |
| 87 | relayfs channel buffer data. Here are the file operations that are |
| 88 | available and some comments regarding their behavior: |
| 89 | |
| 90 | open() enables user to open an _existing_ buffer. |
| 91 | |
| 92 | mmap() results in channel buffer being mapped into the caller's |
| 93 | memory space. Note that you can't do a partial mmap - you must |
| 94 | map the entire file, which is NRBUF * SUBBUFSIZE. |
| 95 | |
| 96 | read() read the contents of a channel buffer. The bytes read are |
| 97 | 'consumed' by the reader i.e. they won't be available again |
| 98 | to subsequent reads. If the channel is being used in |
| 99 | no-overwrite mode (the default), it can be read at any time |
| 100 | even if there's an active kernel writer. If the channel is |
| 101 | being used in overwrite mode and there are active channel |
| 102 | writers, results may be unpredictable - users should make |
| 103 | sure that all logging to the channel has ended before using |
| 104 | read() with overwrite mode. |
| 105 | |
| 106 | poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are |
| 107 | notified when sub-buffer boundaries are crossed. |
| 108 | |
| 109 | close() decrements the channel buffer's refcount. When the refcount |
| 110 | reaches 0 i.e. when no process or kernel client has the buffer |
| 111 | open, the channel buffer is freed. |
| 112 | |
| 113 | |
| 114 | In order for a user application to make use of relayfs files, the |
| 115 | relayfs filesystem must be mounted. For example, |
| 116 | |
| 117 | mount -t relayfs relayfs /mnt/relay |
| 118 | |
| 119 | NOTE: relayfs doesn't need to be mounted for kernel clients to create |
| 120 | or use channels - it only needs to be mounted when user space |
| 121 | applications need access to the buffer data. |
| 122 | |
| 123 | |
| 124 | The relayfs kernel API |
| 125 | ====================== |
| 126 | |
| 127 | Here's a summary of the API relayfs provides to in-kernel clients: |
| 128 | |
| 129 | |
| 130 | channel management functions: |
| 131 | |
| 132 | relay_open(base_filename, parent, subbuf_size, n_subbufs, |
| 133 | callbacks) |
| 134 | relay_close(chan) |
| 135 | relay_flush(chan) |
| 136 | relay_reset(chan) |
| 137 | relayfs_create_dir(name, parent) |
| 138 | relayfs_remove_dir(dentry) |
Tom Zanussi | 925ac8a | 2006-01-08 01:02:27 -0800 | [diff] [blame] | 139 | relayfs_create_file(name, parent, mode, fops, data) |
| 140 | relayfs_remove_file(dentry) |
Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 141 | |
| 142 | channel management typically called on instigation of userspace: |
| 143 | |
| 144 | relay_subbufs_consumed(chan, cpu, subbufs_consumed) |
| 145 | |
| 146 | write functions: |
| 147 | |
| 148 | relay_write(chan, data, length) |
| 149 | __relay_write(chan, data, length) |
| 150 | relay_reserve(chan, length) |
| 151 | |
| 152 | callbacks: |
| 153 | |
| 154 | subbuf_start(buf, subbuf, prev_subbuf, prev_padding) |
| 155 | buf_mapped(buf, filp) |
| 156 | buf_unmapped(buf, filp) |
Tom Zanussi | df49af8 | 2006-01-08 01:02:30 -0800 | [diff] [blame] | 157 | create_buf_file(filename, parent, mode, buf, is_global) |
Tom Zanussi | 03d78d1 | 2006-01-08 01:02:29 -0800 | [diff] [blame] | 158 | remove_buf_file(dentry) |
Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 159 | |
| 160 | helper functions: |
| 161 | |
| 162 | relay_buf_full(buf) |
| 163 | subbuf_start_reserve(buf, length) |
| 164 | |
| 165 | |
| 166 | Creating a channel |
| 167 | ------------------ |
| 168 | |
| 169 | relay_open() is used to create a channel, along with its per-cpu |
| 170 | channel buffers. Each channel buffer will have an associated file |
| 171 | created for it in the relayfs filesystem, which can be opened and |
| 172 | mmapped from user space if desired. The files are named |
| 173 | basename0...basenameN-1 where N is the number of online cpus, and by |
| 174 | default will be created in the root of the filesystem. If you want a |
| 175 | directory structure to contain your relayfs files, you can create it |
| 176 | with relayfs_create_dir() and pass the parent directory to |
| 177 | relay_open(). Clients are responsible for cleaning up any directory |
| 178 | structure they create when the channel is closed - use |
| 179 | relayfs_remove_dir() for that. |
| 180 | |
| 181 | The total size of each per-cpu buffer is calculated by multiplying the |
| 182 | number of sub-buffers by the sub-buffer size passed into relay_open(). |
| 183 | The idea behind sub-buffers is that they're basically an extension of |
| 184 | double-buffering to N buffers, and they also allow applications to |
| 185 | easily implement random-access-on-buffer-boundary schemes, which can |
| 186 | be important for some high-volume applications. The number and size |
| 187 | of sub-buffers is completely dependent on the application and even for |
| 188 | the same application, different conditions will warrant different |
| 189 | values for these parameters at different times. Typically, the right |
| 190 | values to use are best decided after some experimentation; in general, |
| 191 | though, it's safe to assume that having only 1 sub-buffer is a bad |
| 192 | idea - you're guaranteed to either overwrite data or lose events |
| 193 | depending on the channel mode being used. |
| 194 | |
| 195 | Channel 'modes' |
| 196 | --------------- |
| 197 | |
| 198 | relayfs channels can be used in either of two modes - 'overwrite' or |
| 199 | 'no-overwrite'. The mode is entirely determined by the implementation |
| 200 | of the subbuf_start() callback, as described below. In 'overwrite' |
| 201 | mode, also known as 'flight recorder' mode, writes continuously cycle |
| 202 | around the buffer and will never fail, but will unconditionally |
| 203 | overwrite old data regardless of whether it's actually been consumed. |
| 204 | In no-overwrite mode, writes will fail i.e. data will be lost, if the |
| 205 | number of unconsumed sub-buffers equals the total number of |
| 206 | sub-buffers in the channel. It should be clear that if there is no |
| 207 | consumer or if the consumer can't consume sub-buffers fast enought, |
| 208 | data will be lost in either case; the only difference is whether data |
| 209 | is lost from the beginning or the end of a buffer. |
| 210 | |
| 211 | As explained above, a relayfs channel is made of up one or more |
| 212 | per-cpu channel buffers, each implemented as a circular buffer |
| 213 | subdivided into one or more sub-buffers. Messages are written into |
| 214 | the current sub-buffer of the channel's current per-cpu buffer via the |
| 215 | write functions described below. Whenever a message can't fit into |
| 216 | the current sub-buffer, because there's no room left for it, the |
| 217 | client is notified via the subbuf_start() callback that a switch to a |
| 218 | new sub-buffer is about to occur. The client uses this callback to 1) |
| 219 | initialize the next sub-buffer if appropriate 2) finalize the previous |
| 220 | sub-buffer if appropriate and 3) return a boolean value indicating |
| 221 | whether or not to actually go ahead with the sub-buffer switch. |
| 222 | |
| 223 | To implement 'no-overwrite' mode, the userspace client would provide |
| 224 | an implementation of the subbuf_start() callback something like the |
| 225 | following: |
| 226 | |
| 227 | static int subbuf_start(struct rchan_buf *buf, |
| 228 | void *subbuf, |
| 229 | void *prev_subbuf, |
| 230 | unsigned int prev_padding) |
| 231 | { |
| 232 | if (prev_subbuf) |
| 233 | *((unsigned *)prev_subbuf) = prev_padding; |
| 234 | |
| 235 | if (relay_buf_full(buf)) |
| 236 | return 0; |
| 237 | |
| 238 | subbuf_start_reserve(buf, sizeof(unsigned int)); |
| 239 | |
| 240 | return 1; |
| 241 | } |
| 242 | |
| 243 | If the current buffer is full i.e. all sub-buffers remain unconsumed, |
| 244 | the callback returns 0 to indicate that the buffer switch should not |
| 245 | occur yet i.e. until the consumer has had a chance to read the current |
| 246 | set of ready sub-buffers. For the relay_buf_full() function to make |
| 247 | sense, the consumer is reponsible for notifying relayfs when |
| 248 | sub-buffers have been consumed via relay_subbufs_consumed(). Any |
| 249 | subsequent attempts to write into the buffer will again invoke the |
| 250 | subbuf_start() callback with the same parameters; only when the |
| 251 | consumer has consumed one or more of the ready sub-buffers will |
| 252 | relay_buf_full() return 0, in which case the buffer switch can |
| 253 | continue. |
| 254 | |
| 255 | The implementation of the subbuf_start() callback for 'overwrite' mode |
| 256 | would be very similar: |
| 257 | |
| 258 | static int subbuf_start(struct rchan_buf *buf, |
| 259 | void *subbuf, |
| 260 | void *prev_subbuf, |
| 261 | unsigned int prev_padding) |
| 262 | { |
| 263 | if (prev_subbuf) |
| 264 | *((unsigned *)prev_subbuf) = prev_padding; |
| 265 | |
| 266 | subbuf_start_reserve(buf, sizeof(unsigned int)); |
| 267 | |
| 268 | return 1; |
| 269 | } |
| 270 | |
| 271 | In this case, the relay_buf_full() check is meaningless and the |
| 272 | callback always returns 1, causing the buffer switch to occur |
| 273 | unconditionally. It's also meaningless for the client to use the |
| 274 | relay_subbufs_consumed() function in this mode, as it's never |
| 275 | consulted. |
| 276 | |
| 277 | The default subbuf_start() implementation, used if the client doesn't |
| 278 | define any callbacks, or doesn't define the subbuf_start() callback, |
| 279 | implements the simplest possible 'no-overwrite' mode i.e. it does |
| 280 | nothing but return 0. |
| 281 | |
| 282 | Header information can be reserved at the beginning of each sub-buffer |
| 283 | by calling the subbuf_start_reserve() helper function from within the |
| 284 | subbuf_start() callback. This reserved area can be used to store |
| 285 | whatever information the client wants. In the example above, room is |
| 286 | reserved in each sub-buffer to store the padding count for that |
| 287 | sub-buffer. This is filled in for the previous sub-buffer in the |
| 288 | subbuf_start() implementation; the padding value for the previous |
| 289 | sub-buffer is passed into the subbuf_start() callback along with a |
| 290 | pointer to the previous sub-buffer, since the padding value isn't |
| 291 | known until a sub-buffer is filled. The subbuf_start() callback is |
| 292 | also called for the first sub-buffer when the channel is opened, to |
| 293 | give the client a chance to reserve space in it. In this case the |
| 294 | previous sub-buffer pointer passed into the callback will be NULL, so |
| 295 | the client should check the value of the prev_subbuf pointer before |
| 296 | writing into the previous sub-buffer. |
| 297 | |
| 298 | Writing to a channel |
| 299 | -------------------- |
| 300 | |
| 301 | kernel clients write data into the current cpu's channel buffer using |
| 302 | relay_write() or __relay_write(). relay_write() is the main logging |
| 303 | function - it uses local_irqsave() to protect the buffer and should be |
| 304 | used if you might be logging from interrupt context. If you know |
| 305 | you'll never be logging from interrupt context, you can use |
| 306 | __relay_write(), which only disables preemption. These functions |
| 307 | don't return a value, so you can't determine whether or not they |
| 308 | failed - the assumption is that you wouldn't want to check a return |
| 309 | value in the fast logging path anyway, and that they'll always succeed |
| 310 | unless the buffer is full and no-overwrite mode is being used, in |
| 311 | which case you can detect a failed write in the subbuf_start() |
| 312 | callback by calling the relay_buf_full() helper function. |
| 313 | |
| 314 | relay_reserve() is used to reserve a slot in a channel buffer which |
| 315 | can be written to later. This would typically be used in applications |
| 316 | that need to write directly into a channel buffer without having to |
| 317 | stage data in a temporary buffer beforehand. Because the actual write |
| 318 | may not happen immediately after the slot is reserved, applications |
| 319 | using relay_reserve() can keep a count of the number of bytes actually |
| 320 | written, either in space reserved in the sub-buffers themselves or as |
| 321 | a separate array. See the 'reserve' example in the relay-apps tarball |
| 322 | at http://relayfs.sourceforge.net for an example of how this can be |
| 323 | done. Because the write is under control of the client and is |
| 324 | separated from the reserve, relay_reserve() doesn't protect the buffer |
| 325 | at all - it's up to the client to provide the appropriate |
| 326 | synchronization when using relay_reserve(). |
| 327 | |
| 328 | Closing a channel |
| 329 | ----------------- |
| 330 | |
| 331 | The client calls relay_close() when it's finished using the channel. |
| 332 | The channel and its associated buffers are destroyed when there are no |
| 333 | longer any references to any of the channel buffers. relay_flush() |
| 334 | forces a sub-buffer switch on all the channel buffers, and can be used |
| 335 | to finalize and process the last sub-buffers before the channel is |
| 336 | closed. |
| 337 | |
Tom Zanussi | 925ac8a | 2006-01-08 01:02:27 -0800 | [diff] [blame] | 338 | Creating non-relay files |
| 339 | ------------------------ |
| 340 | |
| 341 | relay_open() automatically creates files in the relayfs filesystem to |
| 342 | represent the per-cpu kernel buffers; it's often useful for |
| 343 | applications to be able to create their own files alongside the relay |
| 344 | files in the relayfs filesystem as well e.g. 'control' files much like |
| 345 | those created in /proc or debugfs for similar purposes, used to |
| 346 | communicate control information between the kernel and user sides of a |
| 347 | relayfs application. For this purpose the relayfs_create_file() and |
| 348 | relayfs_remove_file() API functions exist. For relayfs_create_file(), |
| 349 | the caller passes in a set of user-defined file operations to be used |
| 350 | for the file and an optional void * to a user-specified data item, |
| 351 | which will be accessible via inode->u.generic_ip (see the relay-apps |
| 352 | tarball for examples). The file_operations are a required parameter |
| 353 | to relayfs_create_file() and thus the semantics of these files are |
| 354 | completely defined by the caller. |
| 355 | |
| 356 | See the relay-apps tarball at http://relayfs.sourceforge.net for |
| 357 | examples of how these non-relay files are meant to be used. |
| 358 | |
Tom Zanussi | 03d78d1 | 2006-01-08 01:02:29 -0800 | [diff] [blame] | 359 | Creating relay files in other filesystems |
| 360 | ----------------------------------------- |
| 361 | |
| 362 | By default of course, relay_open() creates relay files in the relayfs |
| 363 | filesystem. Because relay_file_operations is exported, however, it's |
| 364 | also possible to create and use relay files in other pseudo-filesytems |
| 365 | such as debugfs. |
| 366 | |
| 367 | For this purpose, two callback functions are provided, |
| 368 | create_buf_file() and remove_buf_file(). create_buf_file() is called |
| 369 | once for each per-cpu buffer from relay_open() to allow the client to |
| 370 | create a file to be used to represent the corresponding buffer; if |
| 371 | this callback is not defined, the default implementation will create |
| 372 | and return a file in the relayfs filesystem to represent the buffer. |
| 373 | The callback should return the dentry of the file created to represent |
| 374 | the relay buffer. Note that the parent directory passed to |
| 375 | relay_open() (and passed along to the callback), if specified, must |
| 376 | exist in the same filesystem the new relay file is created in. If |
| 377 | create_buf_file() is defined, remove_buf_file() must also be defined; |
| 378 | it's responsible for deleting the file(s) created in create_buf_file() |
| 379 | and is called during relay_close(). |
| 380 | |
Tom Zanussi | df49af8 | 2006-01-08 01:02:30 -0800 | [diff] [blame] | 381 | The create_buf_file() implementation can also be defined in such a way |
| 382 | as to allow the creation of a single 'global' buffer instead of the |
| 383 | default per-cpu set. This can be useful for applications interested |
| 384 | mainly in seeing the relative ordering of system-wide events without |
| 385 | the need to bother with saving explicit timestamps for the purpose of |
| 386 | merging/sorting per-cpu files in a postprocessing step. |
| 387 | |
| 388 | To have relay_open() create a global buffer, the create_buf_file() |
| 389 | implementation should set the value of the is_global outparam to a |
| 390 | non-zero value in addition to creating the file that will be used to |
| 391 | represent the single buffer. In the case of a global buffer, |
| 392 | create_buf_file() and remove_buf_file() will be called only once. The |
| 393 | normal channel-writing functions e.g. relay_write() can still be used |
| 394 | - writes from any cpu will transparently end up in the global buffer - |
| 395 | but since it is a global buffer, callers should make sure they use the |
| 396 | proper locking for such a buffer, either by wrapping writes in a |
| 397 | spinlock, or by copying a write function from relayfs_fs.h and |
| 398 | creating a local version that internally does the proper locking. |
| 399 | |
Tom Zanussi | 03d78d1 | 2006-01-08 01:02:29 -0800 | [diff] [blame] | 400 | See the 'exported-relayfile' examples in the relay-apps tarball for |
| 401 | examples of creating and using relay files in debugfs. |
| 402 | |
Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 403 | Misc |
| 404 | ---- |
| 405 | |
| 406 | Some applications may want to keep a channel around and re-use it |
| 407 | rather than open and close a new channel for each use. relay_reset() |
| 408 | can be used for this purpose - it resets a channel to its initial |
| 409 | state without reallocating channel buffer memory or destroying |
| 410 | existing mappings. It should however only be called when it's safe to |
| 411 | do so i.e. when the channel isn't currently being written to. |
| 412 | |
| 413 | Finally, there are a couple of utility callbacks that can be used for |
| 414 | different purposes. buf_mapped() is called whenever a channel buffer |
| 415 | is mmapped from user space and buf_unmapped() is called when it's |
| 416 | unmapped. The client can use this notification to trigger actions |
| 417 | within the kernel application, such as enabling/disabling logging to |
| 418 | the channel. |
| 419 | |
| 420 | |
| 421 | Resources |
| 422 | ========= |
| 423 | |
| 424 | For news, example code, mailing list, etc. see the relayfs homepage: |
| 425 | |
| 426 | http://relayfs.sourceforge.net |
| 427 | |
| 428 | |
| 429 | Credits |
| 430 | ======= |
| 431 | |
| 432 | The ideas and specs for relayfs came about as a result of discussions |
| 433 | on tracing involving the following: |
| 434 | |
| 435 | Michel Dagenais <michel.dagenais@polymtl.ca> |
| 436 | Richard Moore <richardj_moore@uk.ibm.com> |
| 437 | Bob Wisniewski <bob@watson.ibm.com> |
| 438 | Karim Yaghmour <karim@opersys.com> |
| 439 | Tom Zanussi <zanussi@us.ibm.com> |
| 440 | |
| 441 | Also thanks to Hubertus Franke for a lot of useful suggestions and bug |
| 442 | reports. |