blob: 4c29bd5afbc57069e81bb98a0cccf40d2377b251 [file] [log] [blame]
Jeff Layton80aafd52017-07-24 06:22:16 -04001The errseq_t datatype
2=====================
3An errseq_t is a way of recording errors in one place, and allowing any
4number of "subscribers" to tell whether it has changed since a previous
5point where it was sampled.
6
7The initial use case for this is tracking errors for file
8synchronization syscalls (fsync, fdatasync, msync and sync_file_range),
9but it may be usable in other situations.
10
11It's implemented as an unsigned 32-bit value. The low order bits are
12designated to hold an error code (between 1 and MAX_ERRNO). The upper bits
13are used as a counter. This is done with atomics instead of locking so that
14these functions can be called from any context.
15
16Note that there is a risk of collisions if new errors are being recorded
17frequently, since we have so few bits to use as a counter.
18
19To mitigate this, the bit between the error value and counter is used as
20a flag to tell whether the value has been sampled since a new value was
21recorded. That allows us to avoid bumping the counter if no one has
22sampled it since the last time an error was recorded.
23
24Thus we end up with a value that looks something like this::
25
26 bit: 31..13 12 11..0
27 +-----------------+----+----------------+
28 | counter | SF | errno |
29 +-----------------+----+----------------+
30
31The general idea is for "watchers" to sample an errseq_t value and keep
32it as a running cursor. That value can later be used to tell whether
33any new errors have occurred since that sampling was done, and atomically
34record the state at the time that it was checked. This allows us to
35record errors in one place, and then have a number of "watchers" that
36can tell whether the value has changed since they last checked it.
37
38A new errseq_t should always be zeroed out. An errseq_t value of all zeroes
39is the special (but common) case where there has never been an error. An all
40zero value thus serves as the "epoch" if one wishes to know whether there
41has ever been an error set since it was first initialized.
42
43API usage
44=========
45Let me tell you a story about a worker drone. Now, he's a good worker
46overall, but the company is a little...management heavy. He has to
47report to 77 supervisors today, and tomorrow the "big boss" is coming in
48from out of town and he's sure to test the poor fellow too.
49
50They're all handing him work to do -- so much he can't keep track of who
51handed him what, but that's not really a big problem. The supervisors
52just want to know when he's finished all of the work they've handed him so
53far and whether he made any mistakes since they last asked.
54
55He might have made the mistake on work they didn't actually hand him,
56but he can't keep track of things at that level of detail, all he can
57remember is the most recent mistake that he made.
58
59Here's our worker_drone representation::
60
61 struct worker_drone {
62 errseq_t wd_err; /* for recording errors */
63 };
64
65Every day, the worker_drone starts out with a blank slate::
66
67 struct worker_drone wd;
68
69 wd.wd_err = (errseq_t)0;
70
71The supervisors come in and get an initial read for the day. They
72don't care about anything that happened before their watch begins::
73
74 struct supervisor {
75 errseq_t s_wd_err; /* private "cursor" for wd_err */
76 spinlock_t s_wd_err_lock; /* protects s_wd_err */
77 }
78
79 struct supervisor su;
80
81 su.s_wd_err = errseq_sample(&wd.wd_err);
82 spin_lock_init(&su.s_wd_err_lock);
83
84Now they start handing him tasks to do. Every few minutes they ask him to
85finish up all of the work they've handed him so far. Then they ask him
86whether he made any mistakes on any of it::
87
88 spin_lock(&su.su_wd_err_lock);
89 err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err);
90 spin_unlock(&su.su_wd_err_lock);
91
92Up to this point, that just keeps returning 0.
93
94Now, the owners of this company are quite miserly and have given him
95substandard equipment with which to do his job. Occasionally it
96glitches and he makes a mistake. He sighs a heavy sigh, and marks it
97down::
98
99 errseq_set(&wd.wd_err, -EIO);
100
101...and then gets back to work. The supervisors eventually poll again
102and they each get the error when they next check. Subsequent calls will
103return 0, until another error is recorded, at which point it's reported
104to each of them once.
105
106Note that the supervisors can't tell how many mistakes he made, only
107whether one was made since they last checked, and the latest value
108recorded.
109
110Occasionally the big boss comes in for a spot check and asks the worker
111to do a one-off job for him. He's not really watching the worker
112full-time like the supervisors, but he does need to know whether a
113mistake occurred while his job was processing.
114
115He can just sample the current errseq_t in the worker, and then use that
116to tell whether an error has occurred later::
117
118 errseq_t since = errseq_sample(&wd.wd_err);
119 /* submit some work and wait for it to complete */
120 err = errseq_check(&wd.wd_err, since);
121
122Since he's just going to discard "since" after that point, he doesn't
123need to advance it here. He also doesn't need any locking since it's
124not usable by anyone else.
125
126Serializing errseq_t cursor updates
127===================================
128Note that the errseq_t API does not protect the errseq_t cursor during a
129check_and_advance_operation. Only the canonical error code is handled
130atomically. In a situation where more than one task might be using the
131same errseq_t cursor at the same time, it's important to serialize
132updates to that cursor.
133
134If that's not done, then it's possible for the cursor to go backward
135in which case the same error could be reported more than once.
136
137Because of this, it's often advantageous to first do an errseq_check to
138see if anything has changed, and only later do an
139errseq_check_and_advance after taking the lock. e.g.::
140
141 if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) {
142 /* su.s_wd_err is protected by s_wd_err_lock */
143 spin_lock(&su.s_wd_err_lock);
144 err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err);
145 spin_unlock(&su.s_wd_err_lock);
146 }
147
148That avoids the spinlock in the common case where nothing has changed
149since the last time it was checked.