Blame - Documentation/networking/rds.txt - kernel/msm-4.19

blob: 9d219d856d46bf3235ce99c40e3a26454c51adfc [file] [log] [blame]

Andy Grover	0c5f9b8	2009-02-24 15:30:38 +0000	[diff] [blame]	1
				2	Overview
				3	========
				4
				5	This readme tries to provide some background on the hows and whys of RDS,
				6	and will hopefully help you find your way around the code.
				7
				8	In addition, please see this email about RDS origins:
				9	http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
				10
				11	RDS Architecture
				12	================
				13
				14	RDS provides reliable, ordered datagram delivery by using a single
				15	reliable connection between any two nodes in the cluster. This allows
				16	applications to use a single socket to talk to any other process in the
				17	cluster - so in a cluster with N processes you need N sockets, in contrast
				18	to N*N if you use a connection-oriented socket transport like TCP.
				19
				20	RDS is not Infiniband-specific; it was designed to support different
				21	transports. The current implementation used to support RDS over TCP as well
santosh.shilimkar@oracle.com	dcdede0	2016-03-01 15:20:42 -0800	[diff] [blame]	22	as IB.
Andy Grover	0c5f9b8	2009-02-24 15:30:38 +0000	[diff] [blame]	23
				24	The high-level semantics of RDS from the application's point of view are
				25
				26	* Addressing
				27	RDS uses IPv4 addresses and 16bit port numbers to identify
				28	the end point of a connection. All socket operations that involve
				29	passing addresses between kernel and user space generally
				30	use a struct sockaddr_in.
				31
				32	The fact that IPv4 addresses are used does not mean the underlying
				33	transport has to be IP-based. In fact, RDS over IB uses a
				34	reliable IB connection; the IP address is used exclusively to
				35	locate the remote node's GID (by ARPing for the given IP).
				36
				37	The port space is entirely independent of UDP, TCP or any other
				38	protocol.
				39
				40	* Socket interface
				41	RDS sockets work mostly as you would expect from a BSD
				42	socket. The next section will cover the details. At any rate,
				43	all I/O is performed through the standard BSD socket API.
				44	Some additions like zerocopy support are implemented through
				45	control messages, while other extensions use the getsockopt/
				46	setsockopt calls.
				47
				48	Sockets must be bound before you can send or receive data.
				49	This is needed because binding also selects a transport and
				50	attaches it to the socket. Once bound, the transport assignment
				51	does not change. RDS will tolerate IPs moving around (eg in
				52	a active-active HA scenario), but only as long as the address
				53	doesn't move to a different transport.
				54
				55	* sysctls
				56	RDS supports a number of sysctls in /proc/sys/net/rds
				57
				58
				59	Socket Interface
				60	================
				61
				62	AF_RDS, PF_RDS, SOL_RDS
Sowmini Varadhan	ebe96e6	2015-04-08 12:33:45 -0400	[diff] [blame]	63	AF_RDS and PF_RDS are the domain type to be used with socket(2)
				64	to create RDS sockets. SOL_RDS is the socket-level to be used
				65	with setsockopt(2) and getsockopt(2) for RDS specific socket
				66	options.
Andy Grover	0c5f9b8	2009-02-24 15:30:38 +0000	[diff] [blame]	67
				68	fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
				69	This creates a new, unbound RDS socket.
				70
				71	setsockopt(SOL_SOCKET): send and receive buffer size
				72	RDS honors the send and receive buffer size socket options.
				73	You are not allowed to queue more than SO_SNDSIZE bytes to
				74	a socket. A message is queued when sendmsg is called, and
				75	it leaves the queue when the remote system acknowledges
				76	its arrival.
				77
				78	The SO_RCVSIZE option controls the maximum receive queue length.
				79	This is a soft limit rather than a hard limit - RDS will
				80	continue to accept and queue incoming messages, even if that
				81	takes the queue length over the limit. However, it will also
				82	mark the port as "congested" and send a congestion update to
				83	the source node. The source node is supposed to throttle any
				84	processes sending to this congested port.
				85
				86	bind(fd, &sockaddr_in, ...)
				87	This binds the socket to a local IP address and port, and a
				88	transport.
				89
				90	sendmsg(fd, ...)
				91	Sends a message to the indicated recipient. The kernel will
				92	transparently establish the underlying reliable connection
				93	if it isn't up yet.
				94
				95	An attempt to send a message that exceeds SO_SNDSIZE will
				96	return with -EMSGSIZE
				97
				98	An attempt to send a message that would take the total number
				99	of queued bytes over the SO_SNDSIZE threshold will return
				100	EAGAIN.
				101
				102	An attempt to send a message to a destination that is marked
				103	as "congested" will return ENOBUFS.
				104
				105	recvmsg(fd, ...)
				106	Receives a message that was queued to this socket. The sockets
				107	recv queue accounting is adjusted, and if the queue length
				108	drops below SO_SNDSIZE, the port is marked uncongested, and
				109	a congestion update is sent to all peers.
				110
				111	Applications can ask the RDS kernel module to receive
				112	notifications via control messages (for instance, there is a
				113	notification when a congestion update arrived, or when a RDMA
				114	operation completes). These notifications are received through
				115	the msg.msg_control buffer of struct msghdr. The format of the
				116	messages is described in manpages.
				117
				118	poll(fd)
				119	RDS supports the poll interface to allow the application
				120	to implement async I/O.
				121
				122	POLLIN handling is pretty straightforward. When there's an
				123	incoming message queued to the socket, or a pending notification,
				124	we signal POLLIN.
				125
				126	POLLOUT is a little harder. Since you can essentially send
				127	to any destination, RDS will always signal POLLOUT as long as
				128	there's room on the send queue (ie the number of bytes queued
				129	is less than the sendbuf size).
				130
				131	However, the kernel will refuse to accept messages to
				132	a destination marked congested - in this case you will loop
				133	forever if you rely on poll to tell you what to do.
				134	This isn't a trivial problem, but applications can deal with
				135	this - by using congestion notifications, and by checking for
				136	ENOBUFS errors returned by sendmsg.
				137
				138	setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
				139	This allows the application to discard all messages queued to a
				140	specific destination on this particular socket.
				141
				142	This allows the application to cancel outstanding messages if
				143	it detects a timeout. For instance, if it tried to send a message,
				144	and the remote host is unreachable, RDS will keep trying forever.
				145	The application may decide it's not worth it, and cancel the
				146	operation. In this case, it would use RDS_CANCEL_SENT_TO to
				147	nuke any pending messages.
				148
				149
				150	RDMA for RDS
				151	============
				152
				153	see rds-rdma(7) manpage (available in rds-tools)
				154
				155
				156	Congestion Notifications
				157	========================
				158
				159	see rds(7) manpage
				160
				161
				162	RDS Protocol
				163	============
				164
				165	Message header
				166
				167	The message header is a 'struct rds_header' (see rds.h):
				168	Fields:
				169	h_sequence:
				170	per-packet sequence number
				171	h_ack:
				172	piggybacked acknowledgment of last packet received
				173	h_len:
				174	length of data, not including header
				175	h_sport:
				176	source port
				177	h_dport:
				178	destination port
				179	h_flags:
				180	CONG_BITMAP - this is a congestion update bitmap
				181	ACK_REQUIRED - receiver must ack this packet
				182	RETRANSMITTED - packet has previously been sent
				183	h_credit:
				184	indicate to other end of connection that
				185	it has more credits available (i.e. there is
				186	more send room)
				187	h_padding[4]:
				188	unused, for future use
				189	h_csum:
				190	header checksum
				191	h_exthdr:
				192	optional data can be passed here. This is currently used for
				193	passing RDMA-related information.
				194
				195	ACK and retransmit handling
				196
				197	One might think that with reliable IB connections you wouldn't need
				198	to ack messages that have been received. The problem is that IB
				199	hardware generates an ack message before it has DMAed the message
				200	into memory. This creates a potential message loss if the HCA is
				201	disabled for any reason between when it sends the ack and before
				202	the message is DMAed and processed. This is only a potential issue
				203	if another HCA is available for fail-over.
				204
				205	Sending an ack immediately would allow the sender to free the sent
				206	message from their send queue quickly, but could cause excessive
				207	traffic to be used for acks. RDS piggybacks acks on sent data
				208	packets. Ack-only packets are reduced by only allowing one to be
				209	in flight at a time, and by the sender only asking for acks when
				210	its send buffers start to fill up. All retransmissions are also
				211	acked.
				212
				213	Flow Control
				214
				215	RDS's IB transport uses a credit-based mechanism to verify that
				216	there is space in the peer's receive buffers for more data. This
				217	eliminates the need for hardware retries on the connection.
				218
				219	Congestion
				220
				221	Messages waiting in the receive queue on the receiving socket
				222	are accounted against the sockets SO_RCVBUF option value. Only
				223	the payload bytes in the message are accounted for. If the
				224	number of bytes queued equals or exceeds rcvbuf then the socket
				225	is congested. All sends attempted to this socket's address
				226	should return block or return -EWOULDBLOCK.
				227
				228	Applications are expected to be reasonably tuned such that this
				229	situation very rarely occurs. An application encountering this
				230	"back-pressure" is considered a bug.
				231
				232	This is implemented by having each node maintain bitmaps which
				233	indicate which ports on bound addresses are congested. As the
				234	bitmap changes it is sent through all the connections which
				235	terminate in the local address of the bitmap which changed.
				236
				237	The bitmaps are allocated as connections are brought up. This
				238	avoids allocation in the interrupt handling path which queues
				239	sages on sockets. The dense bitmaps let transports send the
				240	entire bitmap on any bitmap change reasonably efficiently. This
				241	is much easier to implement than some finer-grained
				242	communication of per-port congestion. The sender does a very
				243	inexpensive bit test to test if the port it's about to send to
				244	is congested or not.
				245
				246
				247	RDS Transport Layer
				248	==================
				249
				250	As mentioned above, RDS is not IB-specific. Its code is divided
				251	into a general RDS layer and a transport layer.
				252
				253	The general layer handles the socket API, congestion handling,
				254	loopback, stats, usermem pinning, and the connection state machine.
				255
				256	The transport layer handles the details of the transport. The IB
				257	transport, for example, handles all the queue pairs, work requests,
				258	CM event handlers, and other Infiniband details.
				259
				260
				261	RDS Kernel Structures
				262	=====================
				263
				264	struct rds_message
				265	aka possibly "rds_outgoing", the generic RDS layer copies data to
				266	be sent and sets header fields as needed, based on the socket API.
				267	This is then queued for the individual connection and sent by the
				268	connection's transport.
				269	struct rds_incoming
				270	a generic struct referring to incoming data that can be handed from
				271	the transport to the general code and queued by the general code
				272	while the socket is awoken. It is then passed back to the transport
				273	code to handle the actual copy-to-user.
				274	struct rds_socket
				275	per-socket information
				276	struct rds_connection
				277	per-connection information
				278	struct rds_transport
				279	pointers to transport-specific functions
				280	struct rds_statistics
				281	non-transport-specific statistics
				282	struct rds_cong_map
				283	wraps the raw congestion bitmap, contains rbnode, waitq, etc.
				284
				285	Connection management
				286	=====================
				287
				288	Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
				289	ERROR states.
				290
				291	The first time an attempt is made by an RDS socket to send data to
				292	a node, a connection is allocated and connected. That connection is
				293	then maintained forever -- if there are transport errors, the
				294	connection will be dropped and re-established.
				295
				296	Dropping a connection while packets are queued will cause queued or
				297	partially-sent datagrams to be retransmitted when the connection is
				298	re-established.
				299
				300
				301	The send path
				302	=============
				303
				304	rds_sendmsg()
				305	struct rds_message built from incoming data
				306	CMSGs parsed (e.g. RDMA ops)
				307	transport connection alloced and connected if not already
				308	rds_message placed on send queue
				309	send worker awoken
				310	rds_send_worker()
				311	calls rds_send_xmit() until queue is empty
				312	rds_send_xmit()
				313	transmits congestion map if one is pending
				314	may set ACK_REQUIRED
				315	calls transport to send either non-RDMA or RDMA message
				316	(RDMA ops never retransmitted)
				317	rds_ib_xmit()
				318	allocs work requests from send ring
				319	adds any new send credits available to peer (h_credits)
				320	maps the rds_message's sg list
				321	piggybacks ack
				322	populates work requests
				323	post send to connection's queue pair
				324
				325	The recv path
				326	=============
				327
				328	rds_ib_recv_cq_comp_handler()
				329	looks at write completions
				330	unmaps recv buffer from device
				331	no errors, call rds_ib_process_recv()
				332	refill recv ring
				333	rds_ib_process_recv()
				334	validate header checksum
				335	copy header to rds_ib_incoming struct if start of a new datagram
				336	add to ibinc's fraglist
				337	if competed datagram:
				338	update cong map if datagram was cong update
				339	call rds_recv_incoming() otherwise
				340	note if ack is required
				341	rds_recv_incoming()
				342	drop duplicate packets
				343	respond to pings
				344	find the sock associated with this datagram
				345	add to sock queue
				346	wake up sock
				347	do some congestion calculations
				348	rds_recvmsg
				349	copy data into user iovec
				350	handle CMSGs
				351	return to application
				352
				353