Ian Kent | 4b22ff1 | 2008-10-15 22:02:53 -0700 | [diff] [blame^] | 1 | |
| 2 | Miscellaneous Device control operations for the autofs4 kernel module |
| 3 | ==================================================================== |
| 4 | |
| 5 | The problem |
| 6 | =========== |
| 7 | |
| 8 | There is a problem with active restarts in autofs (that is to say |
| 9 | restarting autofs when there are busy mounts). |
| 10 | |
| 11 | During normal operation autofs uses a file descriptor opened on the |
| 12 | directory that is being managed in order to be able to issue control |
| 13 | operations. Using a file descriptor gives ioctl operations access to |
| 14 | autofs specific information stored in the super block. The operations |
| 15 | are things such as setting an autofs mount catatonic, setting the |
| 16 | expire timeout and requesting expire checks. As is explained below, |
| 17 | certain types of autofs triggered mounts can end up covering an autofs |
| 18 | mount itself which prevents us being able to use open(2) to obtain a |
| 19 | file descriptor for these operations if we don't already have one open. |
| 20 | |
| 21 | Currently autofs uses "umount -l" (lazy umount) to clear active mounts |
| 22 | at restart. While using lazy umount works for most cases, anything that |
| 23 | needs to walk back up the mount tree to construct a path, such as |
| 24 | getcwd(2) and the proc file system /proc/<pid>/cwd, no longer works |
| 25 | because the point from which the path is constructed has been detached |
| 26 | from the mount tree. |
| 27 | |
| 28 | The actual problem with autofs is that it can't reconnect to existing |
| 29 | mounts. Immediately one thinks of just adding the ability to remount |
| 30 | autofs file systems would solve it, but alas, that can't work. This is |
| 31 | because autofs direct mounts and the implementation of "on demand mount |
| 32 | and expire" of nested mount trees have the file system mounted directly |
| 33 | on top of the mount trigger directory dentry. |
| 34 | |
| 35 | For example, there are two types of automount maps, direct (in the kernel |
| 36 | module source you will see a third type called an offset, which is just |
| 37 | a direct mount in disguise) and indirect. |
| 38 | |
| 39 | Here is a master map with direct and indirect map entries: |
| 40 | |
| 41 | /- /etc/auto.direct |
| 42 | /test /etc/auto.indirect |
| 43 | |
| 44 | and the corresponding map files: |
| 45 | |
| 46 | /etc/auto.direct: |
| 47 | |
| 48 | /automount/dparse/g6 budgie:/autofs/export1 |
| 49 | /automount/dparse/g1 shark:/autofs/export1 |
| 50 | and so on. |
| 51 | |
| 52 | /etc/auto.indirect: |
| 53 | |
| 54 | g1 shark:/autofs/export1 |
| 55 | g6 budgie:/autofs/export1 |
| 56 | and so on. |
| 57 | |
| 58 | For the above indirect map an autofs file system is mounted on /test and |
| 59 | mounts are triggered for each sub-directory key by the inode lookup |
| 60 | operation. So we see a mount of shark:/autofs/export1 on /test/g1, for |
| 61 | example. |
| 62 | |
| 63 | The way that direct mounts are handled is by making an autofs mount on |
| 64 | each full path, such as /automount/dparse/g1, and using it as a mount |
| 65 | trigger. So when we walk on the path we mount shark:/autofs/export1 "on |
| 66 | top of this mount point". Since these are always directories we can |
| 67 | use the follow_link inode operation to trigger the mount. |
| 68 | |
| 69 | But, each entry in direct and indirect maps can have offsets (making |
| 70 | them multi-mount map entries). |
| 71 | |
| 72 | For example, an indirect mount map entry could also be: |
| 73 | |
| 74 | g1 \ |
| 75 | / shark:/autofs/export5/testing/test \ |
| 76 | /s1 shark:/autofs/export/testing/test/s1 \ |
| 77 | /s2 shark:/autofs/export5/testing/test/s2 \ |
| 78 | /s1/ss1 shark:/autofs/export1 \ |
| 79 | /s2/ss2 shark:/autofs/export2 |
| 80 | |
| 81 | and a similarly a direct mount map entry could also be: |
| 82 | |
| 83 | /automount/dparse/g1 \ |
| 84 | / shark:/autofs/export5/testing/test \ |
| 85 | /s1 shark:/autofs/export/testing/test/s1 \ |
| 86 | /s2 shark:/autofs/export5/testing/test/s2 \ |
| 87 | /s1/ss1 shark:/autofs/export2 \ |
| 88 | /s2/ss2 shark:/autofs/export2 |
| 89 | |
| 90 | One of the issues with version 4 of autofs was that, when mounting an |
| 91 | entry with a large number of offsets, possibly with nesting, we needed |
| 92 | to mount and umount all of the offsets as a single unit. Not really a |
| 93 | problem, except for people with a large number of offsets in map entries. |
| 94 | This mechanism is used for the well known "hosts" map and we have seen |
| 95 | cases (in 2.4) where the available number of mounts are exhausted or |
| 96 | where the number of privileged ports available is exhausted. |
| 97 | |
| 98 | In version 5 we mount only as we go down the tree of offsets and |
| 99 | similarly for expiring them which resolves the above problem. There is |
| 100 | somewhat more detail to the implementation but it isn't needed for the |
| 101 | sake of the problem explanation. The one important detail is that these |
| 102 | offsets are implemented using the same mechanism as the direct mounts |
| 103 | above and so the mount points can be covered by a mount. |
| 104 | |
| 105 | The current autofs implementation uses an ioctl file descriptor opened |
| 106 | on the mount point for control operations. The references held by the |
| 107 | descriptor are accounted for in checks made to determine if a mount is |
| 108 | in use and is also used to access autofs file system information held |
| 109 | in the mount super block. So the use of a file handle needs to be |
| 110 | retained. |
| 111 | |
| 112 | |
| 113 | The Solution |
| 114 | ============ |
| 115 | |
| 116 | To be able to restart autofs leaving existing direct, indirect and |
| 117 | offset mounts in place we need to be able to obtain a file handle |
| 118 | for these potentially covered autofs mount points. Rather than just |
| 119 | implement an isolated operation it was decided to re-implement the |
| 120 | existing ioctl interface and add new operations to provide this |
| 121 | functionality. |
| 122 | |
| 123 | In addition, to be able to reconstruct a mount tree that has busy mounts, |
| 124 | the uid and gid of the last user that triggered the mount needs to be |
| 125 | available because these can be used as macro substitution variables in |
| 126 | autofs maps. They are recorded at mount request time and an operation |
| 127 | has been added to retrieve them. |
| 128 | |
| 129 | Since we're re-implementing the control interface, a couple of other |
| 130 | problems with the existing interface have been addressed. First, when |
| 131 | a mount or expire operation completes a status is returned to the |
| 132 | kernel by either a "send ready" or a "send fail" operation. The |
| 133 | "send fail" operation of the ioctl interface could only ever send |
| 134 | ENOENT so the re-implementation allows user space to send an actual |
| 135 | status. Another expensive operation in user space, for those using |
| 136 | very large maps, is discovering if a mount is present. Usually this |
| 137 | involves scanning /proc/mounts and since it needs to be done quite |
| 138 | often it can introduce significant overhead when there are many entries |
| 139 | in the mount table. An operation to lookup the mount status of a mount |
| 140 | point dentry (covered or not) has also been added. |
| 141 | |
| 142 | Current kernel development policy recommends avoiding the use of the |
| 143 | ioctl mechanism in favor of systems such as Netlink. An implementation |
| 144 | using this system was attempted to evaluate its suitability and it was |
| 145 | found to be inadequate, in this case. The Generic Netlink system was |
| 146 | used for this as raw Netlink would lead to a significant increase in |
| 147 | complexity. There's no question that the Generic Netlink system is an |
| 148 | elegant solution for common case ioctl functions but it's not a complete |
| 149 | replacement probably because it's primary purpose in life is to be a |
| 150 | message bus implementation rather than specifically an ioctl replacement. |
| 151 | While it would be possible to work around this there is one concern |
| 152 | that lead to the decision to not use it. This is that the autofs |
| 153 | expire in the daemon has become far to complex because umount |
| 154 | candidates are enumerated, almost for no other reason than to "count" |
| 155 | the number of times to call the expire ioctl. This involves scanning |
| 156 | the mount table which has proved to be a big overhead for users with |
| 157 | large maps. The best way to improve this is try and get back to the |
| 158 | way the expire was done long ago. That is, when an expire request is |
| 159 | issued for a mount (file handle) we should continually call back to |
| 160 | the daemon until we can't umount any more mounts, then return the |
| 161 | appropriate status to the daemon. At the moment we just expire one |
| 162 | mount at a time. A Generic Netlink implementation would exclude this |
| 163 | possibility for future development due to the requirements of the |
| 164 | message bus architecture. |
| 165 | |
| 166 | |
| 167 | autofs4 Miscellaneous Device mount control interface |
| 168 | ==================================================== |
| 169 | |
| 170 | The control interface is opening a device node, typically /dev/autofs. |
| 171 | |
| 172 | All the ioctls use a common structure to pass the needed parameter |
| 173 | information and return operation results: |
| 174 | |
| 175 | struct autofs_dev_ioctl { |
| 176 | __u32 ver_major; |
| 177 | __u32 ver_minor; |
| 178 | __u32 size; /* total size of data passed in |
| 179 | * including this struct */ |
| 180 | __s32 ioctlfd; /* automount command fd */ |
| 181 | |
| 182 | __u32 arg1; /* Command parameters */ |
| 183 | __u32 arg2; |
| 184 | |
| 185 | char path[0]; |
| 186 | }; |
| 187 | |
| 188 | The ioctlfd field is a mount point file descriptor of an autofs mount |
| 189 | point. It is returned by the open call and is used by all calls except |
| 190 | the check for whether a given path is a mount point, where it may |
| 191 | optionally be used to check a specific mount corresponding to a given |
| 192 | mount point file descriptor, and when requesting the uid and gid of the |
| 193 | last successful mount on a directory within the autofs file system. |
| 194 | |
| 195 | The fields arg1 and arg2 are used to communicate parameters and results of |
| 196 | calls made as described below. |
| 197 | |
| 198 | The path field is used to pass a path where it is needed and the size field |
| 199 | is used account for the increased structure length when translating the |
| 200 | structure sent from user space. |
| 201 | |
| 202 | This structure can be initialized before setting specific fields by using |
| 203 | the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *). |
| 204 | |
| 205 | All of the ioctls perform a copy of this structure from user space to |
| 206 | kernel space and return -EINVAL if the size parameter is smaller than |
| 207 | the structure size itself, -ENOMEM if the kernel memory allocation fails |
| 208 | or -EFAULT if the copy itself fails. Other checks include a version check |
| 209 | of the compiled in user space version against the module version and a |
| 210 | mismatch results in a -EINVAL return. If the size field is greater than |
| 211 | the structure size then a path is assumed to be present and is checked to |
| 212 | ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is |
| 213 | returned. Following these checks, for all ioctl commands except |
| 214 | AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and |
| 215 | AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is |
| 216 | not a valid descriptor or doesn't correspond to an autofs mount point |
| 217 | an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is |
| 218 | returned. |
| 219 | |
| 220 | |
| 221 | The ioctls |
| 222 | ========== |
| 223 | |
| 224 | An example of an implementation which uses this interface can be seen |
| 225 | in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the |
| 226 | distribution tar available for download from kernel.org in directory |
| 227 | /pub/linux/daemons/autofs/v5. |
| 228 | |
| 229 | The device node ioctl operations implemented by this interface are: |
| 230 | |
| 231 | |
| 232 | AUTOFS_DEV_IOCTL_VERSION |
| 233 | ------------------------ |
| 234 | |
| 235 | Get the major and minor version of the autofs4 device ioctl kernel module |
| 236 | implementation. It requires an initialized struct autofs_dev_ioctl as an |
| 237 | input parameter and sets the version information in the passed in structure. |
| 238 | It returns 0 on success or the error -EINVAL if a version mismatch is |
| 239 | detected. |
| 240 | |
| 241 | |
| 242 | AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD |
| 243 | ------------------------------------------------------------------ |
| 244 | |
| 245 | Get the major and minor version of the autofs4 protocol version understood |
| 246 | by loaded module. This call requires an initialized struct autofs_dev_ioctl |
| 247 | with the ioctlfd field set to a valid autofs mount point descriptor |
| 248 | and sets the requested version number in structure field arg1. These |
| 249 | commands return 0 on success or one of the negative error codes if |
| 250 | validation fails. |
| 251 | |
| 252 | |
| 253 | AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT |
| 254 | ---------------------------------------------------------- |
| 255 | |
| 256 | Obtain and release a file descriptor for an autofs managed mount point |
| 257 | path. The open call requires an initialized struct autofs_dev_ioctl with |
| 258 | the the path field set and the size field adjusted appropriately as well |
| 259 | as the arg1 field set to the device number of the autofs mount. The |
| 260 | device number can be obtained from the mount options shown in |
| 261 | /proc/mounts. The close call requires an initialized struct |
| 262 | autofs_dev_ioct with the ioctlfd field set to the descriptor obtained |
| 263 | from the open call. The release of the file descriptor can also be done |
| 264 | with close(2) so any open descriptors will also be closed at process exit. |
| 265 | The close call is included in the implemented operations largely for |
| 266 | completeness and to provide for a consistent user space implementation. |
| 267 | |
| 268 | |
| 269 | AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD |
| 270 | -------------------------------------------------------- |
| 271 | |
| 272 | Return mount and expire result status from user space to the kernel. |
| 273 | Both of these calls require an initialized struct autofs_dev_ioctl |
| 274 | with the ioctlfd field set to the descriptor obtained from the open |
| 275 | call and the arg1 field set to the wait queue token number, received |
| 276 | by user space in the foregoing mount or expire request. The arg2 field |
| 277 | is set to the status to be returned. For the ready call this is always |
| 278 | 0 and for the fail call it is set to the errno of the operation. |
| 279 | |
| 280 | |
| 281 | AUTOFS_DEV_IOCTL_SETPIPEFD_CMD |
| 282 | ------------------------------ |
| 283 | |
| 284 | Set the pipe file descriptor used for kernel communication to the daemon. |
| 285 | Normally this is set at mount time using an option but when reconnecting |
| 286 | to a existing mount we need to use this to tell the autofs mount about |
| 287 | the new kernel pipe descriptor. In order to protect mounts against |
| 288 | incorrectly setting the pipe descriptor we also require that the autofs |
| 289 | mount be catatonic (see next call). |
| 290 | |
| 291 | The call requires an initialized struct autofs_dev_ioctl with the |
| 292 | ioctlfd field set to the descriptor obtained from the open call and |
| 293 | the arg1 field set to descriptor of the pipe. On success the call |
| 294 | also sets the process group id used to identify the controlling process |
| 295 | (eg. the owning automount(8) daemon) to the process group of the caller. |
| 296 | |
| 297 | |
| 298 | AUTOFS_DEV_IOCTL_CATATONIC_CMD |
| 299 | ------------------------------ |
| 300 | |
| 301 | Make the autofs mount point catatonic. The autofs mount will no longer |
| 302 | issue mount requests, the kernel communication pipe descriptor is released |
| 303 | and any remaining waits in the queue released. |
| 304 | |
| 305 | The call requires an initialized struct autofs_dev_ioctl with the |
| 306 | ioctlfd field set to the descriptor obtained from the open call. |
| 307 | |
| 308 | |
| 309 | AUTOFS_DEV_IOCTL_TIMEOUT_CMD |
| 310 | ---------------------------- |
| 311 | |
| 312 | Set the expire timeout for mounts withing an autofs mount point. |
| 313 | |
| 314 | The call requires an initialized struct autofs_dev_ioctl with the |
| 315 | ioctlfd field set to the descriptor obtained from the open call. |
| 316 | |
| 317 | |
| 318 | AUTOFS_DEV_IOCTL_REQUESTER_CMD |
| 319 | ------------------------------ |
| 320 | |
| 321 | Return the uid and gid of the last process to successfully trigger a the |
| 322 | mount on the given path dentry. |
| 323 | |
| 324 | The call requires an initialized struct autofs_dev_ioctl with the path |
| 325 | field set to the mount point in question and the size field adjusted |
| 326 | appropriately as well as the arg1 field set to the device number of the |
| 327 | containing autofs mount. Upon return the struct field arg1 contains the |
| 328 | uid and arg2 the gid. |
| 329 | |
| 330 | When reconstructing an autofs mount tree with active mounts we need to |
| 331 | re-connect to mounts that may have used the original process uid and |
| 332 | gid (or string variations of them) for mount lookups within the map entry. |
| 333 | This call provides the ability to obtain this uid and gid so they may be |
| 334 | used by user space for the mount map lookups. |
| 335 | |
| 336 | |
| 337 | AUTOFS_DEV_IOCTL_EXPIRE_CMD |
| 338 | --------------------------- |
| 339 | |
| 340 | Issue an expire request to the kernel for an autofs mount. Typically |
| 341 | this ioctl is called until no further expire candidates are found. |
| 342 | |
| 343 | The call requires an initialized struct autofs_dev_ioctl with the |
| 344 | ioctlfd field set to the descriptor obtained from the open call. In |
| 345 | addition an immediate expire, independent of the mount timeout, can be |
| 346 | requested by setting the arg1 field to 1. If no expire candidates can |
| 347 | be found the ioctl returns -1 with errno set to EAGAIN. |
| 348 | |
| 349 | This call causes the kernel module to check the mount corresponding |
| 350 | to the given ioctlfd for mounts that can be expired, issues an expire |
| 351 | request back to the daemon and waits for completion. |
| 352 | |
| 353 | AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD |
| 354 | ------------------------------ |
| 355 | |
| 356 | Checks if an autofs mount point is in use. |
| 357 | |
| 358 | The call requires an initialized struct autofs_dev_ioctl with the |
| 359 | ioctlfd field set to the descriptor obtained from the open call and |
| 360 | it returns the result in the arg1 field, 1 for busy and 0 otherwise. |
| 361 | |
| 362 | |
| 363 | AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD |
| 364 | --------------------------------- |
| 365 | |
| 366 | Check if the given path is a mountpoint. |
| 367 | |
| 368 | The call requires an initialized struct autofs_dev_ioctl. There are two |
| 369 | possible variations. Both use the path field set to the path of the mount |
| 370 | point to check and the size field adjusted appropriately. One uses the |
| 371 | ioctlfd field to identify a specific mount point to check while the other |
| 372 | variation uses the path and optionaly arg1 set to an autofs mount type. |
| 373 | The call returns 1 if this is a mount point and sets arg1 to the device |
| 374 | number of the mount and field arg2 to the relevant super block magic |
| 375 | number (described below) or 0 if it isn't a mountpoint. In both cases |
| 376 | the the device number (as returned by new_encode_dev()) is returned |
| 377 | in field arg1. |
| 378 | |
| 379 | If supplied with a file descriptor we're looking for a specific mount, |
| 380 | not necessarily at the top of the mounted stack. In this case the path |
| 381 | the descriptor corresponds to is considered a mountpoint if it is itself |
| 382 | a mountpoint or contains a mount, such as a multi-mount without a root |
| 383 | mount. In this case we return 1 if the descriptor corresponds to a mount |
| 384 | point and and also returns the super magic of the covering mount if there |
| 385 | is one or 0 if it isn't a mountpoint. |
| 386 | |
| 387 | If a path is supplied (and the ioctlfd field is set to -1) then the path |
| 388 | is looked up and is checked to see if it is the root of a mount. If a |
| 389 | type is also given we are looking for a particular autofs mount and if |
| 390 | a match isn't found a fail is returned. If the the located path is the |
| 391 | root of a mount 1 is returned along with the super magic of the mount |
| 392 | or 0 otherwise. |
| 393 | |