Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame^] | 1 | |
| 2 | Overview |
| 3 | ======== |
| 4 | |
| 5 | This readme tries to provide some background on the hows and whys of RDS, |
| 6 | and will hopefully help you find your way around the code. |
| 7 | |
| 8 | In addition, please see this email about RDS origins: |
| 9 | http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html |
| 10 | |
| 11 | RDS Architecture |
| 12 | ================ |
| 13 | |
| 14 | RDS provides reliable, ordered datagram delivery by using a single |
| 15 | reliable connection between any two nodes in the cluster. This allows |
| 16 | applications to use a single socket to talk to any other process in the |
| 17 | cluster - so in a cluster with N processes you need N sockets, in contrast |
| 18 | to N*N if you use a connection-oriented socket transport like TCP. |
| 19 | |
| 20 | RDS is not Infiniband-specific; it was designed to support different |
| 21 | transports. The current implementation used to support RDS over TCP as well |
| 22 | as IB. Work is in progress to support RDS over iWARP, and using DCE to |
| 23 | guarantee no dropped packets on Ethernet, it may be possible to use RDS over |
| 24 | UDP in the future. |
| 25 | |
| 26 | The high-level semantics of RDS from the application's point of view are |
| 27 | |
| 28 | * Addressing |
| 29 | RDS uses IPv4 addresses and 16bit port numbers to identify |
| 30 | the end point of a connection. All socket operations that involve |
| 31 | passing addresses between kernel and user space generally |
| 32 | use a struct sockaddr_in. |
| 33 | |
| 34 | The fact that IPv4 addresses are used does not mean the underlying |
| 35 | transport has to be IP-based. In fact, RDS over IB uses a |
| 36 | reliable IB connection; the IP address is used exclusively to |
| 37 | locate the remote node's GID (by ARPing for the given IP). |
| 38 | |
| 39 | The port space is entirely independent of UDP, TCP or any other |
| 40 | protocol. |
| 41 | |
| 42 | * Socket interface |
| 43 | RDS sockets work *mostly* as you would expect from a BSD |
| 44 | socket. The next section will cover the details. At any rate, |
| 45 | all I/O is performed through the standard BSD socket API. |
| 46 | Some additions like zerocopy support are implemented through |
| 47 | control messages, while other extensions use the getsockopt/ |
| 48 | setsockopt calls. |
| 49 | |
| 50 | Sockets must be bound before you can send or receive data. |
| 51 | This is needed because binding also selects a transport and |
| 52 | attaches it to the socket. Once bound, the transport assignment |
| 53 | does not change. RDS will tolerate IPs moving around (eg in |
| 54 | a active-active HA scenario), but only as long as the address |
| 55 | doesn't move to a different transport. |
| 56 | |
| 57 | * sysctls |
| 58 | RDS supports a number of sysctls in /proc/sys/net/rds |
| 59 | |
| 60 | |
| 61 | Socket Interface |
| 62 | ================ |
| 63 | |
| 64 | AF_RDS, PF_RDS, SOL_RDS |
| 65 | These constants haven't been assigned yet, because RDS isn't in |
| 66 | mainline yet. Currently, the kernel module assigns some constant |
| 67 | and publishes it to user space through two sysctl files |
| 68 | /proc/sys/net/rds/pf_rds |
| 69 | /proc/sys/net/rds/sol_rds |
| 70 | |
| 71 | fd = socket(PF_RDS, SOCK_SEQPACKET, 0); |
| 72 | This creates a new, unbound RDS socket. |
| 73 | |
| 74 | setsockopt(SOL_SOCKET): send and receive buffer size |
| 75 | RDS honors the send and receive buffer size socket options. |
| 76 | You are not allowed to queue more than SO_SNDSIZE bytes to |
| 77 | a socket. A message is queued when sendmsg is called, and |
| 78 | it leaves the queue when the remote system acknowledges |
| 79 | its arrival. |
| 80 | |
| 81 | The SO_RCVSIZE option controls the maximum receive queue length. |
| 82 | This is a soft limit rather than a hard limit - RDS will |
| 83 | continue to accept and queue incoming messages, even if that |
| 84 | takes the queue length over the limit. However, it will also |
| 85 | mark the port as "congested" and send a congestion update to |
| 86 | the source node. The source node is supposed to throttle any |
| 87 | processes sending to this congested port. |
| 88 | |
| 89 | bind(fd, &sockaddr_in, ...) |
| 90 | This binds the socket to a local IP address and port, and a |
| 91 | transport. |
| 92 | |
| 93 | sendmsg(fd, ...) |
| 94 | Sends a message to the indicated recipient. The kernel will |
| 95 | transparently establish the underlying reliable connection |
| 96 | if it isn't up yet. |
| 97 | |
| 98 | An attempt to send a message that exceeds SO_SNDSIZE will |
| 99 | return with -EMSGSIZE |
| 100 | |
| 101 | An attempt to send a message that would take the total number |
| 102 | of queued bytes over the SO_SNDSIZE threshold will return |
| 103 | EAGAIN. |
| 104 | |
| 105 | An attempt to send a message to a destination that is marked |
| 106 | as "congested" will return ENOBUFS. |
| 107 | |
| 108 | recvmsg(fd, ...) |
| 109 | Receives a message that was queued to this socket. The sockets |
| 110 | recv queue accounting is adjusted, and if the queue length |
| 111 | drops below SO_SNDSIZE, the port is marked uncongested, and |
| 112 | a congestion update is sent to all peers. |
| 113 | |
| 114 | Applications can ask the RDS kernel module to receive |
| 115 | notifications via control messages (for instance, there is a |
| 116 | notification when a congestion update arrived, or when a RDMA |
| 117 | operation completes). These notifications are received through |
| 118 | the msg.msg_control buffer of struct msghdr. The format of the |
| 119 | messages is described in manpages. |
| 120 | |
| 121 | poll(fd) |
| 122 | RDS supports the poll interface to allow the application |
| 123 | to implement async I/O. |
| 124 | |
| 125 | POLLIN handling is pretty straightforward. When there's an |
| 126 | incoming message queued to the socket, or a pending notification, |
| 127 | we signal POLLIN. |
| 128 | |
| 129 | POLLOUT is a little harder. Since you can essentially send |
| 130 | to any destination, RDS will always signal POLLOUT as long as |
| 131 | there's room on the send queue (ie the number of bytes queued |
| 132 | is less than the sendbuf size). |
| 133 | |
| 134 | However, the kernel will refuse to accept messages to |
| 135 | a destination marked congested - in this case you will loop |
| 136 | forever if you rely on poll to tell you what to do. |
| 137 | This isn't a trivial problem, but applications can deal with |
| 138 | this - by using congestion notifications, and by checking for |
| 139 | ENOBUFS errors returned by sendmsg. |
| 140 | |
| 141 | setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) |
| 142 | This allows the application to discard all messages queued to a |
| 143 | specific destination on this particular socket. |
| 144 | |
| 145 | This allows the application to cancel outstanding messages if |
| 146 | it detects a timeout. For instance, if it tried to send a message, |
| 147 | and the remote host is unreachable, RDS will keep trying forever. |
| 148 | The application may decide it's not worth it, and cancel the |
| 149 | operation. In this case, it would use RDS_CANCEL_SENT_TO to |
| 150 | nuke any pending messages. |
| 151 | |
| 152 | |
| 153 | RDMA for RDS |
| 154 | ============ |
| 155 | |
| 156 | see rds-rdma(7) manpage (available in rds-tools) |
| 157 | |
| 158 | |
| 159 | Congestion Notifications |
| 160 | ======================== |
| 161 | |
| 162 | see rds(7) manpage |
| 163 | |
| 164 | |
| 165 | RDS Protocol |
| 166 | ============ |
| 167 | |
| 168 | Message header |
| 169 | |
| 170 | The message header is a 'struct rds_header' (see rds.h): |
| 171 | Fields: |
| 172 | h_sequence: |
| 173 | per-packet sequence number |
| 174 | h_ack: |
| 175 | piggybacked acknowledgment of last packet received |
| 176 | h_len: |
| 177 | length of data, not including header |
| 178 | h_sport: |
| 179 | source port |
| 180 | h_dport: |
| 181 | destination port |
| 182 | h_flags: |
| 183 | CONG_BITMAP - this is a congestion update bitmap |
| 184 | ACK_REQUIRED - receiver must ack this packet |
| 185 | RETRANSMITTED - packet has previously been sent |
| 186 | h_credit: |
| 187 | indicate to other end of connection that |
| 188 | it has more credits available (i.e. there is |
| 189 | more send room) |
| 190 | h_padding[4]: |
| 191 | unused, for future use |
| 192 | h_csum: |
| 193 | header checksum |
| 194 | h_exthdr: |
| 195 | optional data can be passed here. This is currently used for |
| 196 | passing RDMA-related information. |
| 197 | |
| 198 | ACK and retransmit handling |
| 199 | |
| 200 | One might think that with reliable IB connections you wouldn't need |
| 201 | to ack messages that have been received. The problem is that IB |
| 202 | hardware generates an ack message before it has DMAed the message |
| 203 | into memory. This creates a potential message loss if the HCA is |
| 204 | disabled for any reason between when it sends the ack and before |
| 205 | the message is DMAed and processed. This is only a potential issue |
| 206 | if another HCA is available for fail-over. |
| 207 | |
| 208 | Sending an ack immediately would allow the sender to free the sent |
| 209 | message from their send queue quickly, but could cause excessive |
| 210 | traffic to be used for acks. RDS piggybacks acks on sent data |
| 211 | packets. Ack-only packets are reduced by only allowing one to be |
| 212 | in flight at a time, and by the sender only asking for acks when |
| 213 | its send buffers start to fill up. All retransmissions are also |
| 214 | acked. |
| 215 | |
| 216 | Flow Control |
| 217 | |
| 218 | RDS's IB transport uses a credit-based mechanism to verify that |
| 219 | there is space in the peer's receive buffers for more data. This |
| 220 | eliminates the need for hardware retries on the connection. |
| 221 | |
| 222 | Congestion |
| 223 | |
| 224 | Messages waiting in the receive queue on the receiving socket |
| 225 | are accounted against the sockets SO_RCVBUF option value. Only |
| 226 | the payload bytes in the message are accounted for. If the |
| 227 | number of bytes queued equals or exceeds rcvbuf then the socket |
| 228 | is congested. All sends attempted to this socket's address |
| 229 | should return block or return -EWOULDBLOCK. |
| 230 | |
| 231 | Applications are expected to be reasonably tuned such that this |
| 232 | situation very rarely occurs. An application encountering this |
| 233 | "back-pressure" is considered a bug. |
| 234 | |
| 235 | This is implemented by having each node maintain bitmaps which |
| 236 | indicate which ports on bound addresses are congested. As the |
| 237 | bitmap changes it is sent through all the connections which |
| 238 | terminate in the local address of the bitmap which changed. |
| 239 | |
| 240 | The bitmaps are allocated as connections are brought up. This |
| 241 | avoids allocation in the interrupt handling path which queues |
| 242 | sages on sockets. The dense bitmaps let transports send the |
| 243 | entire bitmap on any bitmap change reasonably efficiently. This |
| 244 | is much easier to implement than some finer-grained |
| 245 | communication of per-port congestion. The sender does a very |
| 246 | inexpensive bit test to test if the port it's about to send to |
| 247 | is congested or not. |
| 248 | |
| 249 | |
| 250 | RDS Transport Layer |
| 251 | ================== |
| 252 | |
| 253 | As mentioned above, RDS is not IB-specific. Its code is divided |
| 254 | into a general RDS layer and a transport layer. |
| 255 | |
| 256 | The general layer handles the socket API, congestion handling, |
| 257 | loopback, stats, usermem pinning, and the connection state machine. |
| 258 | |
| 259 | The transport layer handles the details of the transport. The IB |
| 260 | transport, for example, handles all the queue pairs, work requests, |
| 261 | CM event handlers, and other Infiniband details. |
| 262 | |
| 263 | |
| 264 | RDS Kernel Structures |
| 265 | ===================== |
| 266 | |
| 267 | struct rds_message |
| 268 | aka possibly "rds_outgoing", the generic RDS layer copies data to |
| 269 | be sent and sets header fields as needed, based on the socket API. |
| 270 | This is then queued for the individual connection and sent by the |
| 271 | connection's transport. |
| 272 | struct rds_incoming |
| 273 | a generic struct referring to incoming data that can be handed from |
| 274 | the transport to the general code and queued by the general code |
| 275 | while the socket is awoken. It is then passed back to the transport |
| 276 | code to handle the actual copy-to-user. |
| 277 | struct rds_socket |
| 278 | per-socket information |
| 279 | struct rds_connection |
| 280 | per-connection information |
| 281 | struct rds_transport |
| 282 | pointers to transport-specific functions |
| 283 | struct rds_statistics |
| 284 | non-transport-specific statistics |
| 285 | struct rds_cong_map |
| 286 | wraps the raw congestion bitmap, contains rbnode, waitq, etc. |
| 287 | |
| 288 | Connection management |
| 289 | ===================== |
| 290 | |
| 291 | Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and |
| 292 | ERROR states. |
| 293 | |
| 294 | The first time an attempt is made by an RDS socket to send data to |
| 295 | a node, a connection is allocated and connected. That connection is |
| 296 | then maintained forever -- if there are transport errors, the |
| 297 | connection will be dropped and re-established. |
| 298 | |
| 299 | Dropping a connection while packets are queued will cause queued or |
| 300 | partially-sent datagrams to be retransmitted when the connection is |
| 301 | re-established. |
| 302 | |
| 303 | |
| 304 | The send path |
| 305 | ============= |
| 306 | |
| 307 | rds_sendmsg() |
| 308 | struct rds_message built from incoming data |
| 309 | CMSGs parsed (e.g. RDMA ops) |
| 310 | transport connection alloced and connected if not already |
| 311 | rds_message placed on send queue |
| 312 | send worker awoken |
| 313 | rds_send_worker() |
| 314 | calls rds_send_xmit() until queue is empty |
| 315 | rds_send_xmit() |
| 316 | transmits congestion map if one is pending |
| 317 | may set ACK_REQUIRED |
| 318 | calls transport to send either non-RDMA or RDMA message |
| 319 | (RDMA ops never retransmitted) |
| 320 | rds_ib_xmit() |
| 321 | allocs work requests from send ring |
| 322 | adds any new send credits available to peer (h_credits) |
| 323 | maps the rds_message's sg list |
| 324 | piggybacks ack |
| 325 | populates work requests |
| 326 | post send to connection's queue pair |
| 327 | |
| 328 | The recv path |
| 329 | ============= |
| 330 | |
| 331 | rds_ib_recv_cq_comp_handler() |
| 332 | looks at write completions |
| 333 | unmaps recv buffer from device |
| 334 | no errors, call rds_ib_process_recv() |
| 335 | refill recv ring |
| 336 | rds_ib_process_recv() |
| 337 | validate header checksum |
| 338 | copy header to rds_ib_incoming struct if start of a new datagram |
| 339 | add to ibinc's fraglist |
| 340 | if competed datagram: |
| 341 | update cong map if datagram was cong update |
| 342 | call rds_recv_incoming() otherwise |
| 343 | note if ack is required |
| 344 | rds_recv_incoming() |
| 345 | drop duplicate packets |
| 346 | respond to pings |
| 347 | find the sock associated with this datagram |
| 348 | add to sock queue |
| 349 | wake up sock |
| 350 | do some congestion calculations |
| 351 | rds_recvmsg |
| 352 | copy data into user iovec |
| 353 | handle CMSGs |
| 354 | return to application |
| 355 | |
| 356 | |