blob: 43e94ea6d2cad8d0e17715a7ecf4d1e31b9b9e0e [file] [log] [blame]
Andy Groverce876852014-10-01 16:07:04 -07001Contents:
2
31) TCM Userspace Design
4 a) Background
5 b) Benefits
6 c) Design constraints
7 d) Implementation overview
8 i. Mailbox
9 ii. Command ring
10 iii. Data Area
11 e) Device discovery
12 f) Device events
13 g) Other contingencies
142) Writing a user pass-through handler
15 a) Discovering and configuring TCMU uio devices
16 b) Waiting for events on the device(s)
17 c) Managing the command ring
183) Command filtering and pass_level
194) A final note
20
21
22TCM Userspace Design
23--------------------
24
25TCM is another name for LIO, an in-kernel iSCSI target (server).
26Existing TCM targets run in the kernel. TCMU (TCM in Userspace)
27allows userspace programs to be written which act as iSCSI targets.
28This document describes the design.
29
30The existing kernel provides modules for different SCSI transport
31protocols. TCM also modularizes the data storage. There are existing
32modules for file, block device, RAM or using another SCSI device as
33storage. These are called "backstores" or "storage engines". These
34built-in modules are implemented entirely as kernel code.
35
36Background:
37
38In addition to modularizing the transport protocol used for carrying
39SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
40the actual data storage as well. These are referred to as "backstores"
41or "storage engines". The target comes with backstores that allow a
42file, a block device, RAM, or another SCSI device to be used for the
43local storage needed for the exported SCSI LUN. Like the rest of LIO,
44these are implemented entirely as kernel code.
45
46These backstores cover the most common use cases, but not all. One new
47use case that other non-kernel target solutions, such as tgt, are able
48to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
49target then serves as a translator, allowing initiators to store data
50in these non-traditional networked storage systems, while still only
51using standard protocols themselves.
52
53If the target is a userspace process, supporting these is easy. tgt,
54for example, needs only a small adapter module for each, because the
55modules just use the available userspace libraries for RBD and GLFS.
56
57Adding support for these backstores in LIO is considerably more
58difficult, because LIO is entirely kernel code. Instead of undertaking
59the significant work to port the GLFS or RBD APIs and protocols to the
60kernel, another approach is to create a userspace pass-through
61backstore for LIO, "TCMU".
62
63
64Benefits:
65
66In addition to allowing relatively easy support for RBD and GLFS, TCMU
67will also allow easier development of new backstores. TCMU combines
68with the LIO loopback fabric to become something similar to FUSE
69(Filesystem in Userspace), but at the SCSI layer instead of the
70filesystem layer. A SUSE, if you will.
71
72The disadvantage is there are more distinct components to configure, and
73potentially to malfunction. This is unavoidable, but hopefully not
74fatal if we're careful to keep things as simple as possible.
75
76Design constraints:
77
78- Good performance: high throughput, low latency
79- Cleanly handle if userspace:
80 1) never attaches
81 2) hangs
82 3) dies
83 4) misbehaves
84- Allow future flexibility in user & kernel implementations
85- Be reasonably memory-efficient
86- Simple to configure & run
87- Simple to write a userspace backend
88
89
90Implementation overview:
91
92The core of the TCMU interface is a memory region that is shared
93between kernel and userspace. Within this region is: a control area
94(mailbox); a lockless producer/consumer circular buffer for commands
95to be passed up, and status returned; and an in/out data buffer area.
96
97TCMU uses the pre-existing UIO subsystem. UIO allows device driver
98development in userspace, and this is conceptually very close to the
99TCMU use case, except instead of a physical device, TCMU implements a
100memory-mapped layout designed for SCSI commands. Using UIO also
101benefits TCMU by handling device introspection (e.g. a way for
102userspace to determine how large the shared region is) and signaling
103mechanisms in both directions.
104
105There are no embedded pointers in the memory region. Everything is
106expressed as an offset from the region's starting address. This allows
107the ring to still work if the user process dies and is restarted with
108the region mapped at a different virtual address.
109
110See target_core_user.h for the struct definitions.
111
112The Mailbox:
113
114The mailbox is always at the start of the shared memory region, and
115contains a version, details about the starting offset and size of the
116command ring, and head and tail pointers to be used by the kernel and
117userspace (respectively) to put commands on the ring, and indicate
118when the commands are completed.
119
120version - 1 (userspace should abort if otherwise)
121flags - none yet defined.
122cmdr_off - The offset of the start of the command ring from the start
123of the memory region, to account for the mailbox size.
124cmdr_size - The size of the command ring. This does *not* need to be a
125power of two.
126cmd_head - Modified by the kernel to indicate when a command has been
127placed on the ring.
128cmd_tail - Modified by userspace to indicate when it has completed
129processing of a command.
130
131The Command Ring:
132
133Commands are placed on the ring by the kernel incrementing
134mailbox.cmd_head by the size of the command, modulo cmdr_size, and
135then signaling userspace via uio_event_notify(). Once the command is
136completed, userspace updates mailbox.cmd_tail in the same way and
137signals the kernel via a 4-byte write(). When cmd_head equals
138cmd_tail, the ring is empty -- no commands are currently waiting to be
139processed by userspace.
140
Andy Grover0ad46af2015-04-14 17:30:04 -0700141TCMU commands are 8-byte aligned. They start with a common header
142containing "len_op", a 32-bit value that stores the length, as well as
143the opcode in the lowest unused bits. It also contains cmd_id and
144flags fields for setting by the kernel (kflags) and userspace
145(uflags).
Andy Groverce876852014-10-01 16:07:04 -0700146
Andy Grover0ad46af2015-04-14 17:30:04 -0700147Currently only two opcodes are defined, TCMU_OP_CMD and TCMU_OP_PAD.
Andy Groverce876852014-10-01 16:07:04 -0700148
Andy Grover0ad46af2015-04-14 17:30:04 -0700149When the opcode is CMD, the entry in the command ring is a struct
150tcmu_cmd_entry. Userspace finds the SCSI CDB (Command Data Block) via
151tcmu_cmd_entry.req.cdb_off. This is an offset from the start of the
152overall shared memory region, not the entry. The data in/out buffers
153are accessible via tht req.iov[] array. iov_cnt contains the number of
154entries in iov[] needed to describe either the Data-In or Data-Out
155buffers. For bidirectional commands, iov_cnt specifies how many iovec
156entries cover the Data-Out area, and iov_bidi_count specifies how many
157iovec entries immediately after that in iov[] cover the Data-In
158area. Just like other fields, iov.iov_base is an offset from the start
159of the region.
Andy Groverce876852014-10-01 16:07:04 -0700160
161When completing a command, userspace sets rsp.scsi_status, and
162rsp.sense_buffer if necessary. Userspace then increments
163mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the
164kernel via the UIO method, a 4-byte write to the file descriptor.
165
Andy Grover0ad46af2015-04-14 17:30:04 -0700166When the opcode is PAD, userspace only updates cmd_tail as above --
167it's a no-op. (The kernel inserts PAD entries to ensure each CMD entry
168is contiguous within the command ring.)
169
170More opcodes may be added in the future. If userspace encounters an
171opcode it does not handle, it must set UNKNOWN_OP bit (bit 0) in
172hdr.uflags, update cmd_tail, and proceed with processing additional
173commands, if any.
174
Andy Groverce876852014-10-01 16:07:04 -0700175The Data Area:
176
177This is shared-memory space after the command ring. The organization
178of this area is not defined in the TCMU interface, and userspace
179should access only the parts referenced by pending iovs.
180
181
182Device Discovery:
183
184Other devices may be using UIO besides TCMU. Unrelated user processes
185may also be handling different sets of TCMU devices. TCMU userspace
186processes must find their devices by scanning sysfs
187class/uio/uio*/name. For TCMU devices, these names will be of the
188format:
189
190tcm-user/<hba_num>/<device_name>/<subtype>/<path>
191
192where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num>
193and <device_name> allow userspace to find the device's path in the
194kernel target's configfs tree. Assuming the usual mount point, it is
195found at:
196
197/sys/kernel/config/target/core/user_<hba_num>/<device_name>
198
199This location contains attributes such as "hw_block_size", that
200userspace needs to know for correct operation.
201
202<subtype> will be a userspace-process-unique string to identify the
203TCMU device as expecting to be backed by a certain handler, and <path>
204will be an additional handler-specific string for the user process to
205configure the device, if needed. The name cannot contain ':', due to
206LIO limitations.
207
208For all devices so discovered, the user handler opens /dev/uioX and
209calls mmap():
210
211mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
212
213where size must be equal to the value read from
214/sys/class/uio/uioX/maps/map0/size.
215
216
217Device Events:
218
219If a new device is added or removed, a notification will be broadcast
220over netlink, using a generic netlink family name of "TCM-USER" and a
221multicast group named "config". This will include the UIO name as
222described in the previous section, as well as the UIO minor
223number. This should allow userspace to identify both the UIO device and
224the LIO device, so that after determining the device is supported
225(based on subtype) it can take the appropriate action.
226
227
228Other contingencies:
229
230Userspace handler process never attaches:
231
232- TCMU will post commands, and then abort them after a timeout period
233 (30 seconds.)
234
235Userspace handler process is killed:
236
237- It is still possible to restart and re-connect to TCMU
238 devices. Command ring is preserved. However, after the timeout period,
239 the kernel will abort pending tasks.
240
241Userspace handler process hangs:
242
243- The kernel will abort pending tasks after a timeout period.
244
245Userspace handler process is malicious:
246
247- The process can trivially break the handling of devices it controls,
248 but should not be able to access kernel memory outside its shared
249 memory areas.
250
251
252Writing a user pass-through handler (with example code)
253-------------------------------------------------------
254
255A user process handing a TCMU device must support the following:
256
257a) Discovering and configuring TCMU uio devices
258b) Waiting for events on the device(s)
259c) Managing the command ring: Parsing operations and commands,
260 performing work as needed, setting response fields (scsi_status and
261 possibly sense_buffer), updating cmd_tail, and notifying the kernel
262 that work has been finished
263
264First, consider instead writing a plugin for tcmu-runner. tcmu-runner
265implements all of this, and provides a higher-level API for plugin
266authors.
267
268TCMU is designed so that multiple unrelated processes can manage TCMU
269devices separately. All handlers should make sure to only open their
270devices, based opon a known subtype string.
271
272a) Discovering and configuring TCMU UIO devices:
273
274(error checking omitted for brevity)
275
276int fd, dev_fd;
277char buf[256];
278unsigned long long map_len;
279void *map;
280
281fd = open("/sys/class/uio/uio0/name", O_RDONLY);
282ret = read(fd, buf, sizeof(buf));
283close(fd);
284buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
285
286/* we only want uio devices whose name is a format we expect */
287if (strncmp(buf, "tcm-user", 8))
288 exit(-1);
289
290/* Further checking for subtype also needed here */
291
292fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY);
293ret = read(fd, buf, sizeof(buf));
294close(fd);
295str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
296
297map_len = strtoull(buf, NULL, 0);
298
299dev_fd = open("/dev/uio0", O_RDWR);
300map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0);
301
302
303b) Waiting for events on the device(s)
304
305while (1) {
306 char buf[4];
307
308 int ret = read(dev_fd, buf, 4); /* will block */
309
310 handle_device_events(dev_fd, map);
311}
312
313
314c) Managing the command ring
315
316#include <linux/target_core_user.h>
317
318int handle_device_events(int fd, void *map)
319{
320 struct tcmu_mailbox *mb = map;
321 struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
322 int did_some_work = 0;
323
324 /* Process events from cmd ring until we catch up with cmd_head */
325 while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) {
326
327 if (tcmu_hdr_get_op(&ent->hdr) == TCMU_OP_CMD) {
328 uint8_t *cdb = (void *)mb + ent->req.cdb_off;
329 bool success = true;
330
331 /* Handle command here. */
332 printf("SCSI opcode: 0x%x\n", cdb[0]);
333
334 /* Set response fields */
335 if (success)
336 ent->rsp.scsi_status = SCSI_NO_SENSE;
337 else {
338 /* Also fill in rsp->sense_buffer here */
339 ent->rsp.scsi_status = SCSI_CHECK_CONDITION;
340 }
341 }
342 else {
343 /* Do nothing for PAD entries */
344 }
345
346 /* update cmd_tail */
347 mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size;
348 ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
349 did_some_work = 1;
350 }
351
352 /* Notify the kernel that work has been finished */
353 if (did_some_work) {
354 uint32_t buf = 0;
355
356 write(fd, &buf, 4);
357 }
358
359 return 0;
360}
361
362
363Command filtering and pass_level
364--------------------------------
365
366TCMU supports a "pass_level" option with valid values of 0 or 1. When
367the value is 0 (the default), nearly all SCSI commands received for
368the device are passed through to the handler. This allows maximum
369flexibility but increases the amount of code required by the handler,
370to support all mandatory SCSI commands. If pass_level is set to 1,
371then only IO-related commands are presented, and the rest are handled
372by LIO's in-kernel command emulation. The commands presented at level
3731 include all versions of:
374
375READ
376WRITE
377WRITE_VERIFY
378XDWRITEREAD
379WRITE_SAME
380COMPARE_AND_WRITE
381SYNCHRONIZE_CACHE
382UNMAP
383
384
385A final note
386------------
387
388Please be careful to return codes as defined by the SCSI
389specifications. These are different than some values defined in the
390scsi/scsi.h include file. For example, CHECK CONDITION's status code
391is 2, not 1.