sfc: Use write-combining to reduce TX latency

Based on work by Neil Turton <nturton@solarflare.com> and
Kieran Mansley <kmansley@solarflare.com>.

The BIU has now been verified to handle 3- and 4-dword writes within a
single 128-bit register correctly.  This means we can enable write-
combining and only insert write barriers between writes to distinct
registers.

This has been observed to save about 0.5 us when pushing a TX
descriptor to an empty TX queue.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
diff --git a/drivers/net/sfc/mcdi.c b/drivers/net/sfc/mcdi.c
index 8bba895..5e118f0 100644
--- a/drivers/net/sfc/mcdi.c
+++ b/drivers/net/sfc/mcdi.c
@@ -94,14 +94,15 @@
 
 	efx_writed(efx, &hdr, pdu);
 
-	for (i = 0; i < inlen; i += 4)
+	for (i = 0; i < inlen; i += 4) {
 		_efx_writed(efx, *((__le32 *)(inbuf + i)), pdu + 4 + i);
-
-	/* Ensure the payload is written out before the header */
-	wmb();
+		/* use wmb() within loop to inhibit write combining */
+		wmb();
+	}
 
 	/* ring the doorbell with a distinctive value */
 	_efx_writed(efx, (__force __le32) 0x45789abc, doorbell);
+	wmb();
 }
 
 static void efx_mcdi_copyout(struct efx_nic *efx, u8 *outbuf, size_t outlen)