drbd: fix race between disconnect and receive_state If the asender thread, or request_timer_fn(), or some other part of the code, decided to drop the connection (because of timeout or other), but the receiver just now was processing a P_STATE packet, there was a chance that receive_state() would do a hard state change "re-establishing" an already failed connection without additional handshake. Log excerpt: Remote failed to finish a request within ko-count * timeout peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) asender terminated ... peer( Unknown -> Secondary ) conn( Timeout -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 ) ... Connection closed peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 ) receiver terminated Impact: while the connection state is erroneously "Connected", requests may be queued and even sent, which would never be acknowledged, and may have been missed by the cleanup. These requests would never be completed. The next drbd_suspend_io() will then lock up, waiting forever for these requests to complete. Fixed in several code paths: Make sure the connection state is NetworkFailure or worse before starting the cleanup in drbd_disconnect(). This should make sure the cleanup won't miss any requests. Disallow receive_state() to "upgrade" the connection state from an error state. This will make sure the "illegal" state transition won't happen. For all connection failure states, relax the safe-guard in sanitize_state() again to silently mask out those state changes (e.g. Timeout -> Connected becomes Timeout -> Timeout). Note by Philipp Reisner: The 3rd chunk described as "relax the safe-guard..." is not there in 8.4 as it is relaxed to the maximum in 8.4 already Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

commit: b8853dbd8c6410d1faef2785e8ee4c990b068a77 [log] [tgz]
author: Philipp Reisner <philipp.reisner@linbit.com> Tue Dec 13 11:09:16 2011 +0100
committer: Philipp Reisner <philipp.reisner@linbit.com> Thu Nov 08 16:58:12 2012 +0100
tree: 5888fa63f3b5cd0c95f89bb122fd744159b85762
parent: 57bcb6cf1ddb1593face20a13b140be19af9f6cd [diff] [blame]
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 733b8bd..1b6845a 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c

@@ -3787,6 +3787,12 @@
 	os = ns = drbd_read_state(mdev);
 	spin_unlock_irq(&mdev->tconn->req_lock);
 
+	/* If some other part of the code (asender thread, timeout)
+	 * already decided to close the connection again,
+	 * we must not "re-establish" it here. */
+	if (os.conn <= C_TEAR_DOWN)
+		return false;
+
 	/* If this is the "end of sync" confirmation, usually the peer disk
 	 * transitions from D_INCONSISTENT to D_UP_TO_DATE. For empty (0 bits
 	 * set) resync started in PausedSyncT, or if the timing of pause-/
@@ -4368,6 +4374,13 @@
 	if (tconn->cstate == C_STANDALONE)
 		return;
 
+	/* We are about to start the cleanup after connection loss.
+	 * Make sure drbd_make_request knows about that.
+	 * Usually we should be in some network failure state already,
+	 * but just in case we are not, we fix it up here.
+	 */
+	conn_request_state(tconn, NS(conn, C_NETWORK_FAILURE), CS_HARD);
+
 	/* asender does not clean up anything. it must not interfere, either */
 	drbd_thread_stop(&tconn->asender);
 	drbd_free_sock(tconn);
commit	b8853dbd8c6410d1faef2785e8ee4c990b068a77	[log] [tgz]
author	Philipp Reisner <philipp.reisner@linbit.com>	Tue Dec 13 11:09:16 2011 +0100
committer	Philipp Reisner <philipp.reisner@linbit.com>	Thu Nov 08 16:58:12 2012 +0100
tree	5888fa63f3b5cd0c95f89bb122fd744159b85762
parent	57bcb6cf1ddb1593face20a13b140be19af9f6cd [diff] [blame]