Linas Vepstas | 3213e3a | 2007-06-11 14:12:09 -0500 | [diff] [blame] | 1 | |
| 2 | The Spidernet Device Driver |
| 3 | =========================== |
| 4 | |
| 5 | Written by Linas Vepstas <linas@austin.ibm.com> |
| 6 | |
| 7 | Version of 7 June 2007 |
| 8 | |
| 9 | Abstract |
| 10 | ======== |
| 11 | This document sketches the structure of portions of the spidernet |
| 12 | device driver in the Linux kernel tree. The spidernet is a gigabit |
| 13 | ethernet device built into the Toshiba southbridge commonly used |
| 14 | in the SONY Playstation 3 and the IBM QS20 Cell blade. |
| 15 | |
| 16 | The Structure of the RX Ring. |
| 17 | ============================= |
| 18 | The receive (RX) ring is a circular linked list of RX descriptors, |
| 19 | together with three pointers into the ring that are used to manage its |
| 20 | contents. |
| 21 | |
| 22 | The elements of the ring are called "descriptors" or "descrs"; they |
| 23 | describe the received data. This includes a pointer to a buffer |
| 24 | containing the received data, the buffer size, and various status bits. |
| 25 | |
| 26 | There are three primary states that a descriptor can be in: "empty", |
| 27 | "full" and "not-in-use". An "empty" or "ready" descriptor is ready |
| 28 | to receive data from the hardware. A "full" descriptor has data in it, |
| 29 | and is waiting to be emptied and processed by the OS. A "not-in-use" |
| 30 | descriptor is neither empty or full; it is simply not ready. It may |
| 31 | not even have a data buffer in it, or is otherwise unusable. |
| 32 | |
| 33 | During normal operation, on device startup, the OS (specifically, the |
| 34 | spidernet device driver) allocates a set of RX descriptors and RX |
| 35 | buffers. These are all marked "empty", ready to receive data. This |
| 36 | ring is handed off to the hardware, which sequentially fills in the |
| 37 | buffers, and marks them "full". The OS follows up, taking the full |
| 38 | buffers, processing them, and re-marking them empty. |
| 39 | |
| 40 | This filling and emptying is managed by three pointers, the "head" |
| 41 | and "tail" pointers, managed by the OS, and a hardware current |
| 42 | descriptor pointer (GDACTDPA). The GDACTDPA points at the descr |
| 43 | currently being filled. When this descr is filled, the hardware |
| 44 | marks it full, and advances the GDACTDPA by one. Thus, when there is |
| 45 | flowing RX traffic, every descr behind it should be marked "full", |
| 46 | and everything in front of it should be "empty". If the hardware |
| 47 | discovers that the current descr is not empty, it will signal an |
| 48 | interrupt, and halt processing. |
| 49 | |
| 50 | The tail pointer tails or trails the hardware pointer. When the |
| 51 | hardware is ahead, the tail pointer will be pointing at a "full" |
| 52 | descr. The OS will process this descr, and then mark it "not-in-use", |
| 53 | and advance the tail pointer. Thus, when there is flowing RX traffic, |
| 54 | all of the descrs in front of the tail pointer should be "full", and |
| 55 | all of those behind it should be "not-in-use". When RX traffic is not |
| 56 | flowing, then the tail pointer can catch up to the hardware pointer. |
| 57 | The OS will then note that the current tail is "empty", and halt |
| 58 | processing. |
| 59 | |
| 60 | The head pointer (somewhat mis-named) follows after the tail pointer. |
| 61 | When traffic is flowing, then the head pointer will be pointing at |
| 62 | a "not-in-use" descr. The OS will perform various housekeeping duties |
| 63 | on this descr. This includes allocating a new data buffer and |
| 64 | dma-mapping it so as to make it visible to the hardware. The OS will |
| 65 | then mark the descr as "empty", ready to receive data. Thus, when there |
| 66 | is flowing RX traffic, everything in front of the head pointer should |
| 67 | be "not-in-use", and everything behind it should be "empty". If no |
| 68 | RX traffic is flowing, then the head pointer can catch up to the tail |
| 69 | pointer, at which point the OS will notice that the head descr is |
| 70 | "empty", and it will halt processing. |
| 71 | |
| 72 | Thus, in an idle system, the GDACTDPA, tail and head pointers will |
| 73 | all be pointing at the same descr, which should be "empty". All of the |
| 74 | other descrs in the ring should be "empty" as well. |
| 75 | |
| 76 | The show_rx_chain() routine will print out the the locations of the |
| 77 | GDACTDPA, tail and head pointers. It will also summarize the contents |
| 78 | of the ring, starting at the tail pointer, and listing the status |
| 79 | of the descrs that follow. |
| 80 | |
| 81 | A typical example of the output, for a nearly idle system, might be |
| 82 | |
| 83 | net eth1: Total number of descrs=256 |
| 84 | net eth1: Chain tail located at descr=20 |
| 85 | net eth1: Chain head is at 20 |
| 86 | net eth1: HW curr desc (GDACTDPA) is at 21 |
| 87 | net eth1: Have 1 descrs with stat=x40800101 |
| 88 | net eth1: HW next desc (GDACNEXTDA) is at 22 |
| 89 | net eth1: Last 255 descrs with stat=xa0800000 |
| 90 | |
| 91 | In the above, the hardware has filled in one descr, number 20. Both |
| 92 | head and tail are pointing at 20, because it has not yet been emptied. |
| 93 | Meanwhile, hw is pointing at 21, which is free. |
| 94 | |
| 95 | The "Have nnn decrs" refers to the descr starting at the tail: in this |
| 96 | case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers |
| 97 | to all of the rest of the descrs, from the last status change. The "nnn" |
| 98 | is a count of how many descrs have exactly the same status. |
| 99 | |
| 100 | The status x4... corresponds to "full" and status xa... corresponds |
| 101 | to "empty". The actual value printed is RXCOMST_A. |
| 102 | |
| 103 | In the device driver source code, a different set of names are |
| 104 | used for these same concepts, so that |
| 105 | |
| 106 | "empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa |
| 107 | "full" == SPIDER_NET_DESCR_FRAME_END == 0x4 |
| 108 | "not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf |
| 109 | |
| 110 | |
| 111 | The RX RAM full bug/feature |
| 112 | =========================== |
| 113 | |
| 114 | As long as the OS can empty out the RX buffers at a rate faster than |
| 115 | the hardware can fill them, there is no problem. If, for some reason, |
| 116 | the OS fails to empty the RX ring fast enough, the hardware GDACTDPA |
| 117 | pointer will catch up to the head, notice the not-empty condition, |
| 118 | ad stop. However, RX packets may still continue arriving on the wire. |
| 119 | The spidernet chip can save some limited number of these in local RAM. |
| 120 | When this local ram fills up, the spider chip will issue an interrupt |
| 121 | indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit |
| 122 | will be set in GHIINT1STS). When the RX ram full condition occurs, |
| 123 | a certain bug/feature is triggered that has to be specially handled. |
| 124 | This section describes the special handling for this condition. |
| 125 | |
| 126 | When the OS finally has a chance to run, it will empty out the RX ring. |
| 127 | In particular, it will clear the descriptor on which the hardware had |
| 128 | stopped. However, once the hardware has decided that a certain |
| 129 | descriptor is invalid, it will not restart at that descriptor; instead |
| 130 | it will restart at the next descr. This potentially will lead to a |
| 131 | deadlock condition, as the tail pointer will be pointing at this descr, |
| 132 | which, from the OS point of view, is empty; the OS will be waiting for |
| 133 | this descr to be filled. However, the hardware has skipped this descr, |
| 134 | and is filling the next descrs. Since the OS doesn't see this, there |
| 135 | is a potential deadlock, with the OS waiting for one descr to fill, |
| 136 | while the hardware is waiting for a different set of descrs to become |
| 137 | empty. |
| 138 | |
| 139 | A call to show_rx_chain() at this point indicates the nature of the |
| 140 | problem. A typical print when the network is hung shows the following: |
| 141 | |
| 142 | net eth1: Spider RX RAM full, incoming packets might be discarded! |
| 143 | net eth1: Total number of descrs=256 |
| 144 | net eth1: Chain tail located at descr=255 |
| 145 | net eth1: Chain head is at 255 |
| 146 | net eth1: HW curr desc (GDACTDPA) is at 0 |
| 147 | net eth1: Have 1 descrs with stat=xa0800000 |
| 148 | net eth1: HW next desc (GDACNEXTDA) is at 1 |
| 149 | net eth1: Have 127 descrs with stat=x40800101 |
| 150 | net eth1: Have 1 descrs with stat=x40800001 |
| 151 | net eth1: Have 126 descrs with stat=x40800101 |
| 152 | net eth1: Last 1 descrs with stat=xa0800000 |
| 153 | |
| 154 | Both the tail and head pointers are pointing at descr 255, which is |
| 155 | marked xa... which is "empty". Thus, from the OS point of view, there |
| 156 | is nothing to be done. In particular, there is the implicit assumption |
| 157 | that everything in front of the "empty" descr must surely also be empty, |
| 158 | as explained in the last section. The OS is waiting for descr 255 to |
| 159 | become non-empty, which, in this case, will never happen. |
| 160 | |
| 161 | The HW pointer is at descr 0. This descr is marked 0x4.. or "full". |
| 162 | Since its already full, the hardware can do nothing more, and thus has |
| 163 | halted processing. Notice that descrs 0 through 254 are all marked |
| 164 | "full", while descr 254 and 255 are empty. (The "Last 1 descrs" is |
| 165 | descr 254, since tail was at 255.) Thus, the system is deadlocked, |
| 166 | and there can be no forward progress; the OS thinks there's nothing |
| 167 | to do, and the hardware has nowhere to put incoming data. |
| 168 | |
| 169 | This bug/feature is worked around with the spider_net_resync_head_ptr() |
| 170 | routine. When the driver receives RX interrupts, but an examination |
| 171 | of the RX chain seems to show it is empty, then it is probable that |
| 172 | the hardware has skipped a descr or two (sometimes dozens under heavy |
| 173 | network conditions). The spider_net_resync_head_ptr() subroutine will |
| 174 | search the ring for the next full descr, and the driver will resume |
| 175 | operations there. Since this will leave "holes" in the ring, there |
| 176 | is also a spider_net_resync_tail_ptr() that will skip over such holes. |
| 177 | |
| 178 | As of this writing, the spider_net_resync() strategy seems to work very |
| 179 | well, even under heavy network loads. |
| 180 | |
| 181 | |
| 182 | The TX ring |
| 183 | =========== |
| 184 | The TX ring uses a low-watermark interrupt scheme to make sure that |
| 185 | the TX queue is appropriately serviced for large packet sizes. |
| 186 | |
| 187 | For packet sizes greater than about 1KBytes, the kernel can fill |
| 188 | the TX ring quicker than the device can drain it. Once the ring |
| 189 | is full, the netdev is stopped. When there is room in the ring, |
| 190 | the netdev needs to be reawakened, so that more TX packets are placed |
| 191 | in the ring. The hardware can empty the ring about four times per jiffy, |
| 192 | so its not appropriate to wait for the poll routine to refill, since |
| 193 | the poll routine runs only once per jiffy. The low-watermark mechanism |
| 194 | marks a descr about 1/4th of the way from the bottom of the queue, so |
| 195 | that an interrupt is generated when the descr is processed. This |
| 196 | interrupt wakes up the netdev, which can then refill the queue. |
| 197 | For large packets, this mechanism generates a relatively small number |
| 198 | of interrupts, about 1K/sec. For smaller packets, this will drop to zero |
| 199 | interrupts, as the hardware can empty the queue faster than the kernel |
| 200 | can fill it. |
| 201 | |
| 202 | |
| 203 | ======= END OF DOCUMENT ======== |
| 204 | |