Blame - Documentation/sparc/oradax/oracle-dax.txt - kernel/msm-5.4

blob: 9d53ac93286fc8b98f45da90a507c76b1d2345cb [file] [log] [blame]

Rob Gardner	dd02732	2017-12-05 19:40:43 -0700	[diff] [blame]	1	Oracle Data Analytics Accelerator (DAX)
				2	---------------------------------------
				3
				4	DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
				5	(DAX2) processor chips, and has direct access to the CPU's L3 caches
				6	as well as physical memory. It can perform several operations on data
				7	streams with various input and output formats. A driver provides a
				8	transport mechanism and has limited knowledge of the various opcodes
				9	and data formats. A user space library provides high level services
				10	and translates these into low level commands which are then passed
				11	into the driver and subsequently the Hypervisor and the coprocessor.
				12	The library is the recommended way for applications to use the
				13	coprocessor, and the driver interface is not intended for general use.
				14	This document describes the general flow of the driver, its
				15	structures, and its programmatic interface. It also provides example
				16	code sufficient to write user or kernel applications that use DAX
				17	functionality.
				18
				19	The user library is open source and available at:
				20	https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
				21
				22	The Hypervisor interface to the coprocessor is described in detail in
				23	the accompanying document, dax-hv-api.txt, which is a plain text
				24	excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
				25	Specification" version 3.0.20+15, dated 2017-09-25.
				26
				27
				28	High Level Overview
				29	-------------------
				30
				31	A coprocessor request is described by a Command Control Block
				32	(CCB). The CCB contains an opcode and various parameters. The opcode
				33	specifies what operation is to be done, and the parameters specify
				34	options, flags, sizes, and addresses. The CCB (or an array of CCBs)
				35	is passed to the Hypervisor, which handles queueing and scheduling of
				36	requests to the available coprocessor execution units. A status code
				37	returned indicates if the request was submitted successfully or if
				38	there was an error. One of the addresses given in each CCB is a
				39	pointer to a "completion area", which is a 128 byte memory block that
				40	is written by the coprocessor to provide execution status. No
				41	interrupt is generated upon completion; the completion area must be
				42	polled by software to find out when a transaction has finished, but
				43	the M7 and later processors provide a mechanism to pause the virtual
				44	processor until the completion status has been updated by the
				45	coprocessor. This is done using the monitored load and mwait
				46	instructions, which are described in more detail later. The DAX
				47	coprocessor was designed so that after a request is submitted, the
				48	kernel is no longer involved in the processing of it. The polling is
				49	done at the user level, which results in almost zero latency between
				50	completion of a request and resumption of execution of the requesting
				51	thread.
				52
				53
				54	Addressing Memory
				55	-----------------
				56
				57	The kernel does not have access to physical memory in the Sun4v
				58	architecture, as there is an additional level of memory virtualization
				59	present. This intermediate level is called "real" memory, and the
				60	kernel treats this as if it were physical. The Hypervisor handles the
				61	translations between real memory and physical so that each logical
				62	domain (LDOM) can have a partition of physical memory that is isolated
				63	from that of other LDOMs. When the kernel sets up a virtual mapping,
				64	it specifies a virtual address and the real address to which it should
				65	be mapped.
				66
				67	The DAX coprocessor can only operate on physical memory, so before a
				68	request can be fed to the coprocessor, all the addresses in a CCB must
				69	be converted into physical addresses. The kernel cannot do this since
				70	it has no visibility into physical addresses. So a CCB may contain
				71	either the virtual or real addresses of the buffers or a combination
				72	of them. An "address type" field is available for each address that
				73	may be given in the CCB. In all cases, the Hypervisor will translate
				74	all the addresses to physical before dispatching to hardware. Address
				75	translations are performed using the context of the process initiating
				76	the request.
				77
				78
				79	The Driver API
				80	--------------
				81
				82	An application makes requests to the driver via the write() system
				83	call, and gets results (if any) via read(). The completion areas are
				84	made accessible via mmap(), and are read-only for the application.
				85
				86	The request may either be an immediate command or an array of CCBs to
				87	be submitted to the hardware.
				88
				89	Each open instance of the device is exclusive to the thread that
				90	opened it, and must be used by that thread for all subsequent
				91	operations. The driver open function creates a new context for the
				92	thread and initializes it for use. This context contains pointers and
				93	values used internally by the driver to keep track of submitted
				94	requests. The completion area buffer is also allocated, and this is
				95	large enough to contain the completion areas for many concurrent
				96	requests. When the device is closed, any outstanding transactions are
				97	flushed and the context is cleaned up.
				98
				99	On a DAX1 system (M7), the device will be called "oradax1", while on a
				100	DAX2 system (M8) it will be "oradax2". If an application requires one
				101	or the other, it should simply attempt to open the appropriate
				102	device. Only one of the devices will exist on any given system, so the
				103	name can be used to determine what the platform supports.
				104
				105	The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
				106	all of these, success is indicated by a return value from write()
				107	equal to the number of bytes given in the call. Otherwise -1 is
				108	returned and errno is set.
				109
				110	CCB_DEQUEUE
				111
				112	Tells the driver to clean up resources associated with past
				113	requests. Since no interrupt is generated upon the completion of a
				114	request, the driver must be told when it may reclaim resources. No
				115	further status information is returned, so the user should not
				116	subsequently call read().
				117
				118	CCB_KILL
				119
				120	Kills a CCB during execution. The CCB is guaranteed to not continue
				121	executing once this call returns successfully. On success, read() must
				122	be called to retrieve the result of the action.
				123
				124	CCB_INFO
				125
				126	Retrieves information about a currently executing CCB. Note that some
				127	Hypervisors might return 'notfound' when the CCB is in 'inprogress'
				128	state. To ensure a CCB in the 'notfound' state will never be executed,
				129	CCB_KILL must be invoked on that CCB. Upon success, read() must be
				130	called to retrieve the details of the action.
				131
				132	Submission of an array of CCBs for execution
				133
				134	A write() whose length is a multiple of the CCB size is treated as a
				135	submit operation. The file offset is treated as the index of the
				136	completion area to use, and may be set via lseek() or using the
				137	pwrite() system call. If -1 is returned then errno is set to indicate
				138	the error. Otherwise, the return value is the length of the array that
				139	was actually accepted by the coprocessor. If the accepted length is
				140	equal to the requested length, then the submission was completely
				141	successful and there is no further status needed; hence, the user
				142	should not subsequently call read(). Partial acceptance of the CCB
				143	array is indicated by a return value less than the requested length,
				144	and read() must be called to retrieve further status information. The
				145	status will reflect the error caused by the first CCB that was not
				146	accepted, and status_data will provide additional data in some cases.
				147
				148	MMAP
				149
				150	The mmap() function provides access to the completion area allocated
				151	in the driver. Note that the completion area is not writeable by the
				152	user process, and the mmap call must not specify PROT_WRITE.
				153
				154
				155	Completion of a Request
				156	-----------------------
				157
				158	The first byte in each completion area is the command status which is
				159	updated by the coprocessor hardware. Software may take advantage of
				160	new M7/M8 processor capabilities to efficiently poll this status byte.
				161	First, a "monitored load" is achieved via a Load from Alternate Space
				162	(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a
				163	"monitored wait" is achieved via the mwait instruction (a write to
				164	%asr28). This instruction is like pause in that it suspends execution
				165	of the virtual processor for the given number of nanoseconds, but in
				166	addition will terminate early when one of several events occur. If the
				167	block of data containing the monitored location is modified, then the
				168	mwait terminates. This causes software to resume execution immediately
				169	(without a context switch or kernel to user transition) after a
				170	transaction completes. Thus the latency between transaction completion
				171	and resumption of execution may be just a few nanoseconds.
				172
				173
				174	Application Life Cycle of a DAX Submission
				175	------------------------------------------
				176
				177	- open dax device
				178	- call mmap() to get the completion area address
				179	- allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
				180	- submit CCB via write() or pwrite()
				181	- go into a loop executing monitored load + monitored wait and
				182	terminate when the command status indicates the request is complete
				183	(CCB_KILL or CCB_INFO may be used any time as necessary)
				184	- perform a CCB_DEQUEUE
				185	- call munmap() for completion area
				186	- close the dax device
				187
				188
				189	Memory Constraints
				190	------------------
				191
				192	The DAX hardware operates only on physical addresses. Therefore, it is
				193	not aware of virtual memory mappings and the discontiguities that may
				194	exist in the physical memory that a virtual buffer maps to. There is
				195	no I/O TLB or any scatter/gather mechanism. All buffers, whether input
				196	or output, must reside in a physically contiguous region of memory.
				197
				198	The Hypervisor translates all addresses within a CCB to physical
				199	before handing off the CCB to DAX. The Hypervisor determines the
				200	virtual page size for each virtual address given, and uses this to
				201	program a size limit for each address. This prevents the coprocessor
				202	from reading or writing beyond the bound of the virtual page, even
				203	though it is accessing physical memory directly. A simpler way of
				204	saying this is that a DAX operation will never "cross" a virtual page
				205	boundary. If an 8k virtual page is used, then the data is strictly
				206	limited to 8k. If a user's buffer is larger than 8k, then a larger
				207	page size must be used, or the transaction size will be truncated to
				208	8k.
				209
				210	Huge pages. A user may allocate huge pages using standard interfaces.
				211	Memory buffers residing on huge pages may be used to achieve much
				212	larger DAX transaction sizes, but the rules must still be followed,
				213	and no transaction will cross a page boundary, even a huge page. A
				214	major caveat is that Linux on Sparc presents 8Mb as one of the huge
				215	page sizes. Sparc does not actually provide a 8Mb hardware page size,
				216	and this size is synthesized by pasting together two 4Mb pages. The
				217	reasons for this are historical, and it creates an issue because only
				218	half of this 8Mb page can actually be used for any given buffer in a
				219	DAX request, and it must be either the first half or the second half;
				220	it cannot be a 4Mb chunk in the middle, since that crosses a
				221	(hardware) page boundary. Note that this entire issue may be hidden by
				222	higher level libraries.
				223
				224
				225	CCB Structure
				226	-------------
				227	A CCB is an array of 8 64-bit words. Several of these words provide
				228	command opcodes, parameters, flags, etc., and the rest are addresses
				229	for the completion area, output buffer, and various inputs:
				230
				231	struct ccb {
				232	u64 control;
				233	u64 completion;
				234	u64 input0;
				235	u64 access;
				236	u64 input1;
				237	u64 op_data;
				238	u64 output;
				239	u64 table;
				240	};
				241
				242	See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
				243	each of these fields, and see dax-hv-api.txt for a complete description
				244	of the Hypervisor API available to the guest OS (ie, Linux kernel).
				245
				246	The first word (control) is examined by the driver for the following:
				247	- CCB version, which must be consistent with hardware version
				248	- Opcode, which must be one of the documented allowable commands
				249	- Address types, which must be set to "virtual" for all the addresses
				250	given by the user, thereby ensuring that the application can
				251	only access memory that it owns
				252
				253
				254	Example Code
				255	------------
				256
				257	The DAX is accessible to both user and kernel code. The kernel code
				258	can make hypercalls directly while the user code must use wrappers
				259	provided by the driver. The setup of the CCB is nearly identical for
				260	both; the only difference is in preparation of the completion area. An
				261	example of user code is given now, with kernel code afterwards.
				262
				263	In order to program using the driver API, the file
				264	arch/sparc/include/uapi/asm/oradax.h must be included.
				265
				266	First, the proper device must be opened. For M7 it will be
				267	/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
				268	procedure is to attempt to open both, as only one will succeed:
				269
				270	fd = open("/dev/oradax1", O_RDWR);
				271	if (fd < 0)
				272	fd = open("/dev/oradax2", O_RDWR);
				273	if (fd < 0)
				274	/* No DAX found */
				275
				276	Next, the completion area must be mapped:
				277
				278	completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
				279
				280	All input and output buffers must be fully contained in one hardware
				281	page, since as explained above, the DAX is strictly constrained by
				282	virtual page boundaries. In addition, the output buffer must be
				283	64-byte aligned and its size must be a multiple of 64 bytes because
				284	the coprocessor writes in units of cache lines.
				285
				286	This example demonstrates the DAX Scan command, which takes as input a
				287	vector and a match value, and produces a bitmap as the output. For
				288	each input element that matches the value, the corresponding bit is
				289	set in the output.
				290
				291	In this example, the input vector consists of a series of single bits,
				292	and the match value is 0. So each 0 bit in the input will produce a 1
				293	in the output, and vice versa, which produces an output bitmap which
				294	is the input bitmap inverted.
				295
				296	For details of all the parameters and bits used in this CCB, please
				297	refer to section 36.2.1.3 of the DAX Hypervisor API document, which
				298	describes the Scan command in detail.
				299
				300	ccb->control = /* Table 36.1, CCB Header Format */
				301	(2L << 48) /* command = Scan Value */
				302	\| (3L << 40) /* output address type = primary virtual */
				303	\| (3L << 34) /* primary input address type = primary virtual */
				304	/* Section 36.2.1, Query CCB Command Formats */
				305	\| (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */
				306	\| (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
				307	\| (8 << 10) /* 36.2.1.1.6 output format = bit vector */
				308	\| (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
				309	\| (31 << 0); /* 36.2.1.3 Disable second scan criteria */
				310
				311	ccb->completion = 0; /* Completion area address, to be filled in by driver */
				312
				313	ccb->input0 = (unsigned long) input; /* primary input address */
				314
				315	ccb->access = /* Section 36.2.1.2, Data Access Control */
				316	(2 << 24) /* Primary input length format = bits */
				317	\| (nbits - 1); /* number of bits in primary input stream, minus 1 */
				318
				319	ccb->input1 = 0; /* secondary input address, unused */
				320
				321	ccb->op_data = 0; /* scan criteria (value to be matched) */
				322
				323	ccb->output = (unsigned long) output; /* output address */
				324
				325	ccb->table = 0; /* table address, unused */
				326
				327	The CCB submission is a write() or pwrite() system call to the
				328	driver. If the call fails, then a read() must be used to retrieve the
				329	status:
				330
				331	if (pwrite(fd, ccb, 64, 0) != 64) {
				332	struct ccb_exec_result status;
				333	read(fd, &status, sizeof(status));
				334	/* bail out */
				335	}
				336
				337	After a successful submission of the CCB, the completion area may be
				338	polled to determine when the DAX is finished. Detailed information on
				339	the contents of the completion area can be found in section 36.2.2 of
				340	the DAX HV API document.
				341
				342	while (1) {
				343	/* Monitored Load */
				344	__asm__ __volatile__("lduba [%1] 0x84, %0\n"
				345	: "=r" (status)
				346	: "r" (completion_area));
				347
				348	if (status) /* 0 indicates command in progress */
				349	break;
				350
				351	/* MWAIT */
				352	__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
				353	}
				354
				355	A completion area status of 1 indicates successful completion of the
				356	CCB and validity of the output bitmap, which may be used immediately.
				357	All other non-zero values indicate error conditions which are
				358	described in section 36.2.2.
				359
				360	if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
				361	/* completion_area[0] contains the completion status */
				362	/* completion_area[1] contains an error code, see 36.2.2 */
				363	}
				364
				365	After the completion area has been processed, the driver must be
				366	notified that it can release any resources associated with the
				367	request. This is done via the dequeue operation:
				368
				369	struct dax_command cmd;
				370	cmd.command = CCB_DEQUEUE;
				371	if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
				372	/* bail out */
				373	}
				374
				375	Finally, normal program cleanup should be done, i.e., unmapping
				376	completion area, closing the dax device, freeing memory etc.
				377
				378	[Kernel example]
				379
				380	The only difference in using the DAX in kernel code is the treatment
				381	of the completion area. Unlike user applications which mmap the
				382	completion area allocated by the driver, kernel code must allocate its
				383	own memory to use for the completion area, and this address and its
				384	type must be given in the CCB:
				385
				386	ccb->control \|= /* Table 36.1, CCB Header Format */
				387	(3L << 32); /* completion area address type = primary virtual */
				388
				389	ccb->completion = (unsigned long) completion_area; /* Completion area address */
				390
				391	The dax submit hypercall is made directly. The flags used in the
				392	ccb_submit call are documented in the DAX HV API in section 36.3.1.
				393
				394	#include <asm/hypervisor.h>
				395
				396	hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
				397	HV_CCB_QUERY_CMD \|
				398	HV_CCB_ARG0_PRIVILEGED \| HV_CCB_ARG0_TYPE_PRIMARY \|
				399	HV_CCB_VA_PRIVILEGED,
				400	0, &bytes_accepted, &status_data);
				401
				402	if (hv_rv != HV_EOK) {
				403	/* hv_rv is an error code, status_data contains */
				404	/* potential additional status, see 36.3.1.1 */
				405	}
				406
				407	After the submission, the completion area polling code is identical to
				408	that in user land:
				409
				410	while (1) {
				411	/* Monitored Load */
				412	__asm__ __volatile__("lduba [%1] 0x84, %0\n"
				413	: "=r" (status)
				414	: "r" (completion_area));
				415
				416	if (status) /* 0 indicates command in progress */
				417	break;
				418
				419	/* MWAIT */
				420	__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
				421	}
				422
				423	if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
				424	/* completion_area[0] contains the completion status */
				425	/* completion_area[1] contains an error code, see 36.2.2 */
				426	}
				427
				428	The output bitmap is ready for consumption immediately after the
				429	completion status indicates success.