Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 1 | |
| 2 | Overview |
| 3 | ======== |
| 4 | |
| 5 | This readme tries to provide some background on the hows and whys of RDS, |
| 6 | and will hopefully help you find your way around the code. |
| 7 | |
| 8 | In addition, please see this email about RDS origins: |
| 9 | http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html |
| 10 | |
| 11 | RDS Architecture |
| 12 | ================ |
| 13 | |
| 14 | RDS provides reliable, ordered datagram delivery by using a single |
| 15 | reliable connection between any two nodes in the cluster. This allows |
| 16 | applications to use a single socket to talk to any other process in the |
| 17 | cluster - so in a cluster with N processes you need N sockets, in contrast |
| 18 | to N*N if you use a connection-oriented socket transport like TCP. |
| 19 | |
| 20 | RDS is not Infiniband-specific; it was designed to support different |
| 21 | transports. The current implementation used to support RDS over TCP as well |
santosh.shilimkar@oracle.com | dcdede0 | 2016-03-01 15:20:42 -0800 | [diff] [blame] | 22 | as IB. |
Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 23 | |
| 24 | The high-level semantics of RDS from the application's point of view are |
| 25 | |
| 26 | * Addressing |
| 27 | RDS uses IPv4 addresses and 16bit port numbers to identify |
| 28 | the end point of a connection. All socket operations that involve |
| 29 | passing addresses between kernel and user space generally |
| 30 | use a struct sockaddr_in. |
| 31 | |
| 32 | The fact that IPv4 addresses are used does not mean the underlying |
| 33 | transport has to be IP-based. In fact, RDS over IB uses a |
| 34 | reliable IB connection; the IP address is used exclusively to |
| 35 | locate the remote node's GID (by ARPing for the given IP). |
| 36 | |
| 37 | The port space is entirely independent of UDP, TCP or any other |
| 38 | protocol. |
| 39 | |
| 40 | * Socket interface |
| 41 | RDS sockets work *mostly* as you would expect from a BSD |
| 42 | socket. The next section will cover the details. At any rate, |
| 43 | all I/O is performed through the standard BSD socket API. |
| 44 | Some additions like zerocopy support are implemented through |
| 45 | control messages, while other extensions use the getsockopt/ |
| 46 | setsockopt calls. |
| 47 | |
| 48 | Sockets must be bound before you can send or receive data. |
| 49 | This is needed because binding also selects a transport and |
| 50 | attaches it to the socket. Once bound, the transport assignment |
| 51 | does not change. RDS will tolerate IPs moving around (eg in |
| 52 | a active-active HA scenario), but only as long as the address |
| 53 | doesn't move to a different transport. |
| 54 | |
| 55 | * sysctls |
| 56 | RDS supports a number of sysctls in /proc/sys/net/rds |
| 57 | |
| 58 | |
| 59 | Socket Interface |
| 60 | ================ |
| 61 | |
| 62 | AF_RDS, PF_RDS, SOL_RDS |
Sowmini Varadhan | ebe96e6 | 2015-04-08 12:33:45 -0400 | [diff] [blame] | 63 | AF_RDS and PF_RDS are the domain type to be used with socket(2) |
| 64 | to create RDS sockets. SOL_RDS is the socket-level to be used |
| 65 | with setsockopt(2) and getsockopt(2) for RDS specific socket |
| 66 | options. |
Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 67 | |
| 68 | fd = socket(PF_RDS, SOCK_SEQPACKET, 0); |
| 69 | This creates a new, unbound RDS socket. |
| 70 | |
| 71 | setsockopt(SOL_SOCKET): send and receive buffer size |
| 72 | RDS honors the send and receive buffer size socket options. |
| 73 | You are not allowed to queue more than SO_SNDSIZE bytes to |
| 74 | a socket. A message is queued when sendmsg is called, and |
| 75 | it leaves the queue when the remote system acknowledges |
| 76 | its arrival. |
| 77 | |
| 78 | The SO_RCVSIZE option controls the maximum receive queue length. |
| 79 | This is a soft limit rather than a hard limit - RDS will |
| 80 | continue to accept and queue incoming messages, even if that |
| 81 | takes the queue length over the limit. However, it will also |
| 82 | mark the port as "congested" and send a congestion update to |
| 83 | the source node. The source node is supposed to throttle any |
| 84 | processes sending to this congested port. |
| 85 | |
| 86 | bind(fd, &sockaddr_in, ...) |
| 87 | This binds the socket to a local IP address and port, and a |
| 88 | transport. |
| 89 | |
| 90 | sendmsg(fd, ...) |
| 91 | Sends a message to the indicated recipient. The kernel will |
| 92 | transparently establish the underlying reliable connection |
| 93 | if it isn't up yet. |
| 94 | |
| 95 | An attempt to send a message that exceeds SO_SNDSIZE will |
| 96 | return with -EMSGSIZE |
| 97 | |
| 98 | An attempt to send a message that would take the total number |
| 99 | of queued bytes over the SO_SNDSIZE threshold will return |
| 100 | EAGAIN. |
| 101 | |
| 102 | An attempt to send a message to a destination that is marked |
| 103 | as "congested" will return ENOBUFS. |
| 104 | |
| 105 | recvmsg(fd, ...) |
| 106 | Receives a message that was queued to this socket. The sockets |
| 107 | recv queue accounting is adjusted, and if the queue length |
| 108 | drops below SO_SNDSIZE, the port is marked uncongested, and |
| 109 | a congestion update is sent to all peers. |
| 110 | |
| 111 | Applications can ask the RDS kernel module to receive |
| 112 | notifications via control messages (for instance, there is a |
| 113 | notification when a congestion update arrived, or when a RDMA |
| 114 | operation completes). These notifications are received through |
| 115 | the msg.msg_control buffer of struct msghdr. The format of the |
| 116 | messages is described in manpages. |
| 117 | |
| 118 | poll(fd) |
| 119 | RDS supports the poll interface to allow the application |
| 120 | to implement async I/O. |
| 121 | |
| 122 | POLLIN handling is pretty straightforward. When there's an |
| 123 | incoming message queued to the socket, or a pending notification, |
| 124 | we signal POLLIN. |
| 125 | |
| 126 | POLLOUT is a little harder. Since you can essentially send |
| 127 | to any destination, RDS will always signal POLLOUT as long as |
| 128 | there's room on the send queue (ie the number of bytes queued |
| 129 | is less than the sendbuf size). |
| 130 | |
| 131 | However, the kernel will refuse to accept messages to |
| 132 | a destination marked congested - in this case you will loop |
| 133 | forever if you rely on poll to tell you what to do. |
| 134 | This isn't a trivial problem, but applications can deal with |
| 135 | this - by using congestion notifications, and by checking for |
| 136 | ENOBUFS errors returned by sendmsg. |
| 137 | |
| 138 | setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) |
| 139 | This allows the application to discard all messages queued to a |
| 140 | specific destination on this particular socket. |
| 141 | |
| 142 | This allows the application to cancel outstanding messages if |
| 143 | it detects a timeout. For instance, if it tried to send a message, |
| 144 | and the remote host is unreachable, RDS will keep trying forever. |
| 145 | The application may decide it's not worth it, and cancel the |
| 146 | operation. In this case, it would use RDS_CANCEL_SENT_TO to |
| 147 | nuke any pending messages. |
| 148 | |
| 149 | |
| 150 | RDMA for RDS |
| 151 | ============ |
| 152 | |
| 153 | see rds-rdma(7) manpage (available in rds-tools) |
| 154 | |
| 155 | |
| 156 | Congestion Notifications |
| 157 | ======================== |
| 158 | |
| 159 | see rds(7) manpage |
| 160 | |
| 161 | |
| 162 | RDS Protocol |
| 163 | ============ |
| 164 | |
| 165 | Message header |
| 166 | |
| 167 | The message header is a 'struct rds_header' (see rds.h): |
| 168 | Fields: |
| 169 | h_sequence: |
| 170 | per-packet sequence number |
| 171 | h_ack: |
| 172 | piggybacked acknowledgment of last packet received |
| 173 | h_len: |
| 174 | length of data, not including header |
| 175 | h_sport: |
| 176 | source port |
| 177 | h_dport: |
| 178 | destination port |
| 179 | h_flags: |
| 180 | CONG_BITMAP - this is a congestion update bitmap |
| 181 | ACK_REQUIRED - receiver must ack this packet |
| 182 | RETRANSMITTED - packet has previously been sent |
| 183 | h_credit: |
| 184 | indicate to other end of connection that |
| 185 | it has more credits available (i.e. there is |
| 186 | more send room) |
| 187 | h_padding[4]: |
| 188 | unused, for future use |
| 189 | h_csum: |
| 190 | header checksum |
| 191 | h_exthdr: |
| 192 | optional data can be passed here. This is currently used for |
| 193 | passing RDMA-related information. |
| 194 | |
| 195 | ACK and retransmit handling |
| 196 | |
| 197 | One might think that with reliable IB connections you wouldn't need |
| 198 | to ack messages that have been received. The problem is that IB |
| 199 | hardware generates an ack message before it has DMAed the message |
| 200 | into memory. This creates a potential message loss if the HCA is |
| 201 | disabled for any reason between when it sends the ack and before |
| 202 | the message is DMAed and processed. This is only a potential issue |
| 203 | if another HCA is available for fail-over. |
| 204 | |
| 205 | Sending an ack immediately would allow the sender to free the sent |
| 206 | message from their send queue quickly, but could cause excessive |
| 207 | traffic to be used for acks. RDS piggybacks acks on sent data |
| 208 | packets. Ack-only packets are reduced by only allowing one to be |
| 209 | in flight at a time, and by the sender only asking for acks when |
| 210 | its send buffers start to fill up. All retransmissions are also |
| 211 | acked. |
| 212 | |
| 213 | Flow Control |
| 214 | |
| 215 | RDS's IB transport uses a credit-based mechanism to verify that |
| 216 | there is space in the peer's receive buffers for more data. This |
| 217 | eliminates the need for hardware retries on the connection. |
| 218 | |
| 219 | Congestion |
| 220 | |
| 221 | Messages waiting in the receive queue on the receiving socket |
| 222 | are accounted against the sockets SO_RCVBUF option value. Only |
| 223 | the payload bytes in the message are accounted for. If the |
| 224 | number of bytes queued equals or exceeds rcvbuf then the socket |
| 225 | is congested. All sends attempted to this socket's address |
| 226 | should return block or return -EWOULDBLOCK. |
| 227 | |
| 228 | Applications are expected to be reasonably tuned such that this |
| 229 | situation very rarely occurs. An application encountering this |
| 230 | "back-pressure" is considered a bug. |
| 231 | |
| 232 | This is implemented by having each node maintain bitmaps which |
| 233 | indicate which ports on bound addresses are congested. As the |
| 234 | bitmap changes it is sent through all the connections which |
| 235 | terminate in the local address of the bitmap which changed. |
| 236 | |
| 237 | The bitmaps are allocated as connections are brought up. This |
| 238 | avoids allocation in the interrupt handling path which queues |
| 239 | sages on sockets. The dense bitmaps let transports send the |
| 240 | entire bitmap on any bitmap change reasonably efficiently. This |
| 241 | is much easier to implement than some finer-grained |
| 242 | communication of per-port congestion. The sender does a very |
| 243 | inexpensive bit test to test if the port it's about to send to |
| 244 | is congested or not. |
| 245 | |
| 246 | |
| 247 | RDS Transport Layer |
| 248 | ================== |
| 249 | |
| 250 | As mentioned above, RDS is not IB-specific. Its code is divided |
| 251 | into a general RDS layer and a transport layer. |
| 252 | |
| 253 | The general layer handles the socket API, congestion handling, |
| 254 | loopback, stats, usermem pinning, and the connection state machine. |
| 255 | |
| 256 | The transport layer handles the details of the transport. The IB |
| 257 | transport, for example, handles all the queue pairs, work requests, |
| 258 | CM event handlers, and other Infiniband details. |
| 259 | |
| 260 | |
| 261 | RDS Kernel Structures |
| 262 | ===================== |
| 263 | |
| 264 | struct rds_message |
| 265 | aka possibly "rds_outgoing", the generic RDS layer copies data to |
| 266 | be sent and sets header fields as needed, based on the socket API. |
| 267 | This is then queued for the individual connection and sent by the |
| 268 | connection's transport. |
| 269 | struct rds_incoming |
| 270 | a generic struct referring to incoming data that can be handed from |
| 271 | the transport to the general code and queued by the general code |
| 272 | while the socket is awoken. It is then passed back to the transport |
| 273 | code to handle the actual copy-to-user. |
| 274 | struct rds_socket |
| 275 | per-socket information |
| 276 | struct rds_connection |
| 277 | per-connection information |
| 278 | struct rds_transport |
| 279 | pointers to transport-specific functions |
| 280 | struct rds_statistics |
| 281 | non-transport-specific statistics |
| 282 | struct rds_cong_map |
| 283 | wraps the raw congestion bitmap, contains rbnode, waitq, etc. |
| 284 | |
| 285 | Connection management |
| 286 | ===================== |
| 287 | |
| 288 | Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and |
| 289 | ERROR states. |
| 290 | |
| 291 | The first time an attempt is made by an RDS socket to send data to |
| 292 | a node, a connection is allocated and connected. That connection is |
| 293 | then maintained forever -- if there are transport errors, the |
| 294 | connection will be dropped and re-established. |
| 295 | |
| 296 | Dropping a connection while packets are queued will cause queued or |
| 297 | partially-sent datagrams to be retransmitted when the connection is |
| 298 | re-established. |
| 299 | |
| 300 | |
| 301 | The send path |
| 302 | ============= |
| 303 | |
| 304 | rds_sendmsg() |
| 305 | struct rds_message built from incoming data |
| 306 | CMSGs parsed (e.g. RDMA ops) |
| 307 | transport connection alloced and connected if not already |
| 308 | rds_message placed on send queue |
| 309 | send worker awoken |
| 310 | rds_send_worker() |
| 311 | calls rds_send_xmit() until queue is empty |
| 312 | rds_send_xmit() |
| 313 | transmits congestion map if one is pending |
| 314 | may set ACK_REQUIRED |
| 315 | calls transport to send either non-RDMA or RDMA message |
| 316 | (RDMA ops never retransmitted) |
| 317 | rds_ib_xmit() |
| 318 | allocs work requests from send ring |
| 319 | adds any new send credits available to peer (h_credits) |
| 320 | maps the rds_message's sg list |
| 321 | piggybacks ack |
| 322 | populates work requests |
| 323 | post send to connection's queue pair |
| 324 | |
| 325 | The recv path |
| 326 | ============= |
| 327 | |
| 328 | rds_ib_recv_cq_comp_handler() |
| 329 | looks at write completions |
| 330 | unmaps recv buffer from device |
| 331 | no errors, call rds_ib_process_recv() |
| 332 | refill recv ring |
| 333 | rds_ib_process_recv() |
| 334 | validate header checksum |
| 335 | copy header to rds_ib_incoming struct if start of a new datagram |
| 336 | add to ibinc's fraglist |
| 337 | if competed datagram: |
| 338 | update cong map if datagram was cong update |
| 339 | call rds_recv_incoming() otherwise |
| 340 | note if ack is required |
| 341 | rds_recv_incoming() |
| 342 | drop duplicate packets |
| 343 | respond to pings |
| 344 | find the sock associated with this datagram |
| 345 | add to sock queue |
| 346 | wake up sock |
| 347 | do some congestion calculations |
| 348 | rds_recvmsg |
| 349 | copy data into user iovec |
| 350 | handle CMSGs |
| 351 | return to application |
| 352 | |
| 353 | |