Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 1 | |
| 2 | Overview |
| 3 | ======== |
| 4 | |
| 5 | This readme tries to provide some background on the hows and whys of RDS, |
| 6 | and will hopefully help you find your way around the code. |
| 7 | |
| 8 | In addition, please see this email about RDS origins: |
| 9 | http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html |
| 10 | |
| 11 | RDS Architecture |
| 12 | ================ |
| 13 | |
| 14 | RDS provides reliable, ordered datagram delivery by using a single |
| 15 | reliable connection between any two nodes in the cluster. This allows |
| 16 | applications to use a single socket to talk to any other process in the |
| 17 | cluster - so in a cluster with N processes you need N sockets, in contrast |
| 18 | to N*N if you use a connection-oriented socket transport like TCP. |
| 19 | |
| 20 | RDS is not Infiniband-specific; it was designed to support different |
| 21 | transports. The current implementation used to support RDS over TCP as well |
santosh.shilimkar@oracle.com | dcdede0 | 2016-03-01 15:20:42 -0800 | [diff] [blame] | 22 | as IB. |
Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 23 | |
| 24 | The high-level semantics of RDS from the application's point of view are |
| 25 | |
| 26 | * Addressing |
| 27 | RDS uses IPv4 addresses and 16bit port numbers to identify |
| 28 | the end point of a connection. All socket operations that involve |
| 29 | passing addresses between kernel and user space generally |
| 30 | use a struct sockaddr_in. |
| 31 | |
| 32 | The fact that IPv4 addresses are used does not mean the underlying |
| 33 | transport has to be IP-based. In fact, RDS over IB uses a |
| 34 | reliable IB connection; the IP address is used exclusively to |
| 35 | locate the remote node's GID (by ARPing for the given IP). |
| 36 | |
| 37 | The port space is entirely independent of UDP, TCP or any other |
| 38 | protocol. |
| 39 | |
| 40 | * Socket interface |
| 41 | RDS sockets work *mostly* as you would expect from a BSD |
| 42 | socket. The next section will cover the details. At any rate, |
| 43 | all I/O is performed through the standard BSD socket API. |
| 44 | Some additions like zerocopy support are implemented through |
| 45 | control messages, while other extensions use the getsockopt/ |
| 46 | setsockopt calls. |
| 47 | |
| 48 | Sockets must be bound before you can send or receive data. |
| 49 | This is needed because binding also selects a transport and |
| 50 | attaches it to the socket. Once bound, the transport assignment |
| 51 | does not change. RDS will tolerate IPs moving around (eg in |
| 52 | a active-active HA scenario), but only as long as the address |
| 53 | doesn't move to a different transport. |
| 54 | |
| 55 | * sysctls |
| 56 | RDS supports a number of sysctls in /proc/sys/net/rds |
| 57 | |
| 58 | |
| 59 | Socket Interface |
| 60 | ================ |
| 61 | |
| 62 | AF_RDS, PF_RDS, SOL_RDS |
Sowmini Varadhan | ebe96e6 | 2015-04-08 12:33:45 -0400 | [diff] [blame] | 63 | AF_RDS and PF_RDS are the domain type to be used with socket(2) |
| 64 | to create RDS sockets. SOL_RDS is the socket-level to be used |
| 65 | with setsockopt(2) and getsockopt(2) for RDS specific socket |
| 66 | options. |
Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 67 | |
| 68 | fd = socket(PF_RDS, SOCK_SEQPACKET, 0); |
| 69 | This creates a new, unbound RDS socket. |
| 70 | |
| 71 | setsockopt(SOL_SOCKET): send and receive buffer size |
| 72 | RDS honors the send and receive buffer size socket options. |
| 73 | You are not allowed to queue more than SO_SNDSIZE bytes to |
| 74 | a socket. A message is queued when sendmsg is called, and |
| 75 | it leaves the queue when the remote system acknowledges |
| 76 | its arrival. |
| 77 | |
| 78 | The SO_RCVSIZE option controls the maximum receive queue length. |
| 79 | This is a soft limit rather than a hard limit - RDS will |
| 80 | continue to accept and queue incoming messages, even if that |
| 81 | takes the queue length over the limit. However, it will also |
| 82 | mark the port as "congested" and send a congestion update to |
| 83 | the source node. The source node is supposed to throttle any |
| 84 | processes sending to this congested port. |
| 85 | |
| 86 | bind(fd, &sockaddr_in, ...) |
| 87 | This binds the socket to a local IP address and port, and a |
Sowmini Varadhan | d67214a | 2016-07-14 03:51:04 -0700 | [diff] [blame] | 88 | transport, if one has not already been selected via the |
| 89 | SO_RDS_TRANSPORT socket option |
Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 90 | |
| 91 | sendmsg(fd, ...) |
| 92 | Sends a message to the indicated recipient. The kernel will |
| 93 | transparently establish the underlying reliable connection |
| 94 | if it isn't up yet. |
| 95 | |
| 96 | An attempt to send a message that exceeds SO_SNDSIZE will |
| 97 | return with -EMSGSIZE |
| 98 | |
| 99 | An attempt to send a message that would take the total number |
| 100 | of queued bytes over the SO_SNDSIZE threshold will return |
| 101 | EAGAIN. |
| 102 | |
| 103 | An attempt to send a message to a destination that is marked |
| 104 | as "congested" will return ENOBUFS. |
| 105 | |
| 106 | recvmsg(fd, ...) |
| 107 | Receives a message that was queued to this socket. The sockets |
| 108 | recv queue accounting is adjusted, and if the queue length |
| 109 | drops below SO_SNDSIZE, the port is marked uncongested, and |
| 110 | a congestion update is sent to all peers. |
| 111 | |
| 112 | Applications can ask the RDS kernel module to receive |
| 113 | notifications via control messages (for instance, there is a |
| 114 | notification when a congestion update arrived, or when a RDMA |
| 115 | operation completes). These notifications are received through |
| 116 | the msg.msg_control buffer of struct msghdr. The format of the |
| 117 | messages is described in manpages. |
| 118 | |
| 119 | poll(fd) |
| 120 | RDS supports the poll interface to allow the application |
| 121 | to implement async I/O. |
| 122 | |
| 123 | POLLIN handling is pretty straightforward. When there's an |
| 124 | incoming message queued to the socket, or a pending notification, |
| 125 | we signal POLLIN. |
| 126 | |
| 127 | POLLOUT is a little harder. Since you can essentially send |
| 128 | to any destination, RDS will always signal POLLOUT as long as |
| 129 | there's room on the send queue (ie the number of bytes queued |
| 130 | is less than the sendbuf size). |
| 131 | |
| 132 | However, the kernel will refuse to accept messages to |
| 133 | a destination marked congested - in this case you will loop |
| 134 | forever if you rely on poll to tell you what to do. |
| 135 | This isn't a trivial problem, but applications can deal with |
| 136 | this - by using congestion notifications, and by checking for |
| 137 | ENOBUFS errors returned by sendmsg. |
| 138 | |
| 139 | setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) |
| 140 | This allows the application to discard all messages queued to a |
| 141 | specific destination on this particular socket. |
| 142 | |
| 143 | This allows the application to cancel outstanding messages if |
| 144 | it detects a timeout. For instance, if it tried to send a message, |
| 145 | and the remote host is unreachable, RDS will keep trying forever. |
| 146 | The application may decide it's not worth it, and cancel the |
| 147 | operation. In this case, it would use RDS_CANCEL_SENT_TO to |
| 148 | nuke any pending messages. |
| 149 | |
Sowmini Varadhan | d67214a | 2016-07-14 03:51:04 -0700 | [diff] [blame] | 150 | setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) |
| 151 | getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) |
| 152 | Set or read an integer defining the underlying |
| 153 | encapsulating transport to be used for RDS packets on the |
| 154 | socket. When setting the option, integer argument may be |
| 155 | one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the |
| 156 | value, RDS_TRANS_NONE will be returned on an unbound socket. |
| 157 | This socket option may only be set exactly once on the socket, |
| 158 | prior to binding it via the bind(2) system call. Attempts to |
| 159 | set SO_RDS_TRANSPORT on a socket for which the transport has |
| 160 | been previously attached explicitly (by SO_RDS_TRANSPORT) or |
| 161 | implicitly (via bind(2)) will return an error of EOPNOTSUPP. |
| 162 | An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will |
| 163 | always return EINVAL. |
Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 164 | |
| 165 | RDMA for RDS |
| 166 | ============ |
| 167 | |
| 168 | see rds-rdma(7) manpage (available in rds-tools) |
| 169 | |
| 170 | |
| 171 | Congestion Notifications |
| 172 | ======================== |
| 173 | |
| 174 | see rds(7) manpage |
| 175 | |
| 176 | |
| 177 | RDS Protocol |
| 178 | ============ |
| 179 | |
| 180 | Message header |
| 181 | |
| 182 | The message header is a 'struct rds_header' (see rds.h): |
| 183 | Fields: |
| 184 | h_sequence: |
| 185 | per-packet sequence number |
| 186 | h_ack: |
| 187 | piggybacked acknowledgment of last packet received |
| 188 | h_len: |
| 189 | length of data, not including header |
| 190 | h_sport: |
| 191 | source port |
| 192 | h_dport: |
| 193 | destination port |
| 194 | h_flags: |
| 195 | CONG_BITMAP - this is a congestion update bitmap |
| 196 | ACK_REQUIRED - receiver must ack this packet |
| 197 | RETRANSMITTED - packet has previously been sent |
| 198 | h_credit: |
| 199 | indicate to other end of connection that |
| 200 | it has more credits available (i.e. there is |
| 201 | more send room) |
| 202 | h_padding[4]: |
| 203 | unused, for future use |
| 204 | h_csum: |
| 205 | header checksum |
| 206 | h_exthdr: |
| 207 | optional data can be passed here. This is currently used for |
| 208 | passing RDMA-related information. |
| 209 | |
| 210 | ACK and retransmit handling |
| 211 | |
| 212 | One might think that with reliable IB connections you wouldn't need |
| 213 | to ack messages that have been received. The problem is that IB |
| 214 | hardware generates an ack message before it has DMAed the message |
| 215 | into memory. This creates a potential message loss if the HCA is |
| 216 | disabled for any reason between when it sends the ack and before |
| 217 | the message is DMAed and processed. This is only a potential issue |
| 218 | if another HCA is available for fail-over. |
| 219 | |
| 220 | Sending an ack immediately would allow the sender to free the sent |
| 221 | message from their send queue quickly, but could cause excessive |
| 222 | traffic to be used for acks. RDS piggybacks acks on sent data |
| 223 | packets. Ack-only packets are reduced by only allowing one to be |
| 224 | in flight at a time, and by the sender only asking for acks when |
| 225 | its send buffers start to fill up. All retransmissions are also |
| 226 | acked. |
| 227 | |
| 228 | Flow Control |
| 229 | |
| 230 | RDS's IB transport uses a credit-based mechanism to verify that |
| 231 | there is space in the peer's receive buffers for more data. This |
| 232 | eliminates the need for hardware retries on the connection. |
| 233 | |
| 234 | Congestion |
| 235 | |
| 236 | Messages waiting in the receive queue on the receiving socket |
| 237 | are accounted against the sockets SO_RCVBUF option value. Only |
| 238 | the payload bytes in the message are accounted for. If the |
| 239 | number of bytes queued equals or exceeds rcvbuf then the socket |
| 240 | is congested. All sends attempted to this socket's address |
| 241 | should return block or return -EWOULDBLOCK. |
| 242 | |
| 243 | Applications are expected to be reasonably tuned such that this |
| 244 | situation very rarely occurs. An application encountering this |
| 245 | "back-pressure" is considered a bug. |
| 246 | |
| 247 | This is implemented by having each node maintain bitmaps which |
| 248 | indicate which ports on bound addresses are congested. As the |
| 249 | bitmap changes it is sent through all the connections which |
| 250 | terminate in the local address of the bitmap which changed. |
| 251 | |
| 252 | The bitmaps are allocated as connections are brought up. This |
| 253 | avoids allocation in the interrupt handling path which queues |
| 254 | sages on sockets. The dense bitmaps let transports send the |
| 255 | entire bitmap on any bitmap change reasonably efficiently. This |
| 256 | is much easier to implement than some finer-grained |
| 257 | communication of per-port congestion. The sender does a very |
| 258 | inexpensive bit test to test if the port it's about to send to |
| 259 | is congested or not. |
| 260 | |
| 261 | |
| 262 | RDS Transport Layer |
| 263 | ================== |
| 264 | |
| 265 | As mentioned above, RDS is not IB-specific. Its code is divided |
| 266 | into a general RDS layer and a transport layer. |
| 267 | |
| 268 | The general layer handles the socket API, congestion handling, |
| 269 | loopback, stats, usermem pinning, and the connection state machine. |
| 270 | |
| 271 | The transport layer handles the details of the transport. The IB |
| 272 | transport, for example, handles all the queue pairs, work requests, |
| 273 | CM event handlers, and other Infiniband details. |
| 274 | |
| 275 | |
| 276 | RDS Kernel Structures |
| 277 | ===================== |
| 278 | |
| 279 | struct rds_message |
| 280 | aka possibly "rds_outgoing", the generic RDS layer copies data to |
| 281 | be sent and sets header fields as needed, based on the socket API. |
| 282 | This is then queued for the individual connection and sent by the |
| 283 | connection's transport. |
| 284 | struct rds_incoming |
| 285 | a generic struct referring to incoming data that can be handed from |
| 286 | the transport to the general code and queued by the general code |
| 287 | while the socket is awoken. It is then passed back to the transport |
| 288 | code to handle the actual copy-to-user. |
| 289 | struct rds_socket |
| 290 | per-socket information |
| 291 | struct rds_connection |
| 292 | per-connection information |
| 293 | struct rds_transport |
| 294 | pointers to transport-specific functions |
| 295 | struct rds_statistics |
| 296 | non-transport-specific statistics |
| 297 | struct rds_cong_map |
| 298 | wraps the raw congestion bitmap, contains rbnode, waitq, etc. |
| 299 | |
| 300 | Connection management |
| 301 | ===================== |
| 302 | |
| 303 | Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and |
| 304 | ERROR states. |
| 305 | |
| 306 | The first time an attempt is made by an RDS socket to send data to |
| 307 | a node, a connection is allocated and connected. That connection is |
| 308 | then maintained forever -- if there are transport errors, the |
| 309 | connection will be dropped and re-established. |
| 310 | |
| 311 | Dropping a connection while packets are queued will cause queued or |
| 312 | partially-sent datagrams to be retransmitted when the connection is |
| 313 | re-established. |
| 314 | |
| 315 | |
| 316 | The send path |
| 317 | ============= |
| 318 | |
| 319 | rds_sendmsg() |
| 320 | struct rds_message built from incoming data |
| 321 | CMSGs parsed (e.g. RDMA ops) |
| 322 | transport connection alloced and connected if not already |
| 323 | rds_message placed on send queue |
| 324 | send worker awoken |
| 325 | rds_send_worker() |
| 326 | calls rds_send_xmit() until queue is empty |
| 327 | rds_send_xmit() |
| 328 | transmits congestion map if one is pending |
| 329 | may set ACK_REQUIRED |
| 330 | calls transport to send either non-RDMA or RDMA message |
| 331 | (RDMA ops never retransmitted) |
| 332 | rds_ib_xmit() |
| 333 | allocs work requests from send ring |
| 334 | adds any new send credits available to peer (h_credits) |
| 335 | maps the rds_message's sg list |
| 336 | piggybacks ack |
| 337 | populates work requests |
| 338 | post send to connection's queue pair |
| 339 | |
| 340 | The recv path |
| 341 | ============= |
| 342 | |
| 343 | rds_ib_recv_cq_comp_handler() |
| 344 | looks at write completions |
| 345 | unmaps recv buffer from device |
| 346 | no errors, call rds_ib_process_recv() |
| 347 | refill recv ring |
| 348 | rds_ib_process_recv() |
| 349 | validate header checksum |
| 350 | copy header to rds_ib_incoming struct if start of a new datagram |
| 351 | add to ibinc's fraglist |
| 352 | if competed datagram: |
| 353 | update cong map if datagram was cong update |
| 354 | call rds_recv_incoming() otherwise |
| 355 | note if ack is required |
| 356 | rds_recv_incoming() |
| 357 | drop duplicate packets |
| 358 | respond to pings |
| 359 | find the sock associated with this datagram |
| 360 | add to sock queue |
| 361 | wake up sock |
| 362 | do some congestion calculations |
| 363 | rds_recvmsg |
| 364 | copy data into user iovec |
| 365 | handle CMSGs |
| 366 | return to application |
| 367 | |
Sowmini Varadhan | 09204a6 | 2016-07-14 03:51:05 -0700 | [diff] [blame] | 368 | Multipath RDS (mprds) |
| 369 | ===================== |
| 370 | Mprds is multipathed-RDS, primarily intended for RDS-over-TCP |
| 371 | (though the concept can be extended to other transports). The classical |
| 372 | implementation of RDS-over-TCP is implemented by demultiplexing multiple |
| 373 | PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, |
| 374 | port]) over a single TCP socket between the 2 IP addresses involved. This |
| 375 | has the limitation that it ends up funneling multiple RDS flows over a |
| 376 | single TCP flow, thus it is |
| 377 | (a) upper-bounded to the single-flow bandwidth, |
| 378 | (b) suffers from head-of-line blocking for all the RDS sockets. |
| 379 | |
| 380 | Better throughput (for a fixed small packet size, MTU) can be achieved |
| 381 | by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed |
| 382 | RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp |
| 383 | connection. RDS sockets will be attached to a path based on some hash |
| 384 | (e.g., of local address and RDS port number) and packets for that RDS |
| 385 | socket will be sent over the attached path using TCP to segment/reassemble |
| 386 | RDS datagrams on that path. |
| 387 | |
| 388 | Multipathed RDS is implemented by splitting the struct rds_connection into |
| 389 | a common (to all paths) part, and a per-path struct rds_conn_path. All |
| 390 | I/O workqs and reconnect threads are driven from the rds_conn_path. |
| 391 | Transports such as TCP that are multipath capable may then set up a |
| 392 | TPC socket per rds_conn_path, and this is managed by the transport via |
| 393 | the transport privatee cp_transport_data pointer. |
| 394 | |
| 395 | Transports announce themselves as multipath capable by setting the |
| 396 | t_mp_capable bit during registration with the rds core module. When the |
| 397 | transport is multipath-capable, rds_sendmsg() hashes outgoing traffic |
| 398 | across multiple paths. The outgoing hash is computed based on the |
| 399 | local address and port that the PF_RDS socket is bound to. |
| 400 | |
| 401 | Additionally, even if the transport is MP capable, we may be |
| 402 | peering with some node that does not support mprds, or supports |
| 403 | a different number of paths. As a result, the peering nodes need |
| 404 | to agree on the number of paths to be used for the connection. |
| 405 | This is done by sending out a control packet exchange before the |
| 406 | first data packet. The control packet exchange must have completed |
| 407 | prior to outgoing hash completion in rds_sendmsg() when the transport |
| 408 | is mutlipath capable. |
| 409 | |
| 410 | The control packet is an RDS ping packet (i.e., packet to rds dest |
| 411 | port 0) with the ping packet having a rds extension header option of |
| 412 | type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the |
| 413 | number of paths supported by the sender. The "probe" ping packet will |
| 414 | get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>) |
| 415 | The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately |
| 416 | be able to compute the min(sender_paths, rcvr_paths). The pong |
| 417 | sent in response to a probe-ping should contain the rcvr's npaths |
| 418 | when the rcvr is mprds-capable. |
| 419 | |
| 420 | If the rcvr is not mprds-capable, the exthdr in the ping will be |
| 421 | ignored. In this case the pong will not have any exthdrs, so the sender |
| 422 | of the probe-ping can default to single-path mprds. |
Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 423 | |