Lucas Eckels | f869a6f | 2012-08-06 15:15:24 -0700 | [diff] [blame] | 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> |
| 2 | <html> |
| 3 | <head> |
| 4 | |
| 5 | <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/> |
| 6 | <title>Ogg Documentation</title> |
| 7 | |
| 8 | <style type="text/css"> |
| 9 | body { |
| 10 | margin: 0 18px 0 18px; |
| 11 | padding-bottom: 30px; |
| 12 | font-family: Verdana, Arial, Helvetica, sans-serif; |
| 13 | color: #333333; |
| 14 | font-size: .8em; |
| 15 | } |
| 16 | |
| 17 | a { |
| 18 | color: #3366cc; |
| 19 | } |
| 20 | |
| 21 | img { |
| 22 | border: 0; |
| 23 | } |
| 24 | |
| 25 | #xiphlogo { |
| 26 | margin: 30px 0 16px 0; |
| 27 | } |
| 28 | |
| 29 | #content p { |
| 30 | line-height: 1.4; |
| 31 | } |
| 32 | |
| 33 | h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a { |
| 34 | font-weight: bold; |
| 35 | color: #ff9900; |
| 36 | margin: 1.3em 0 8px 0; |
| 37 | } |
| 38 | |
| 39 | h1 { |
| 40 | font-size: 1.3em; |
| 41 | } |
| 42 | |
| 43 | h2 { |
| 44 | font-size: 1.2em; |
| 45 | } |
| 46 | |
| 47 | h3 { |
| 48 | font-size: 1.1em; |
| 49 | } |
| 50 | |
| 51 | li { |
| 52 | line-height: 1.4; |
| 53 | } |
| 54 | |
| 55 | #copyright { |
| 56 | margin-top: 30px; |
| 57 | line-height: 1.5em; |
| 58 | text-align: center; |
| 59 | font-size: .8em; |
| 60 | color: #888888; |
| 61 | clear: both; |
| 62 | } |
| 63 | </style> |
| 64 | |
| 65 | </head> |
| 66 | |
| 67 | <body> |
| 68 | |
| 69 | <div id="xiphlogo"> |
| 70 | <a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.org"/></a> |
| 71 | </div> |
| 72 | |
| 73 | <h1>Page Multiplexing and Ordering in a Physical Ogg Stream</h1> |
| 74 | |
| 75 | <p>The low-level mechanisms of an Ogg stream (as described in the Ogg |
| 76 | Bitstream Overview) provide means for mixing multiple logical streams |
| 77 | and media types into a single linear-chronological stream. This |
| 78 | document specifies the high-level arrangement and use of page |
| 79 | structure to multiplex multiple streams of mixed media type within a |
| 80 | physical Ogg stream.</p> |
| 81 | |
| 82 | <h2>Design Elements</h2> |
| 83 | |
| 84 | <p>The design and arrangement of the Ogg container format is governed by |
| 85 | several high-level design decisions that form the reasoning behind |
| 86 | specific low-level design decisions.</p> |
| 87 | |
| 88 | <h3>Linear media</h3> |
| 89 | |
| 90 | <p>The Ogg bitstream is intended to encapsulate chronological, |
| 91 | time-linear mixed media into a single delivery stream or file. The |
| 92 | design is such that an application can always encode and/or decode a |
| 93 | full-featured bitstream in one pass with no seeking and minimal |
| 94 | buffering. Seeking to provide optimized encoding (such as two-pass |
| 95 | encoding) or interactive decoding (such as scrubbing or instant |
| 96 | replay) is not disallowed or discouraged, however no bitstream feature |
| 97 | must require nonlinear operation on the bitstream.</p> |
| 98 | |
| 99 | <h3>Multiplexing</h3> |
| 100 | |
| 101 | <p>Ogg bitstreams multiplex multiple logical streams into a single |
| 102 | physical stream at the page level. Each page contains an abstract |
| 103 | time stamp (the Granule Position) that represents an absolute time |
| 104 | landmark within the stream. After the pages representing stream |
| 105 | headers (all logical stream headers occur at the beginning of a |
| 106 | physical bitstream section before any logical stream data), logical |
| 107 | stream data pages are arranged in a physical bitstream in strict |
| 108 | non-decreasing order by chronological absolute time as |
| 109 | specified by the granule position.</p> |
| 110 | |
| 111 | <p>The only exception to arranging pages in strictly ascending time order |
| 112 | by granule position is those pages that do not set the granule |
| 113 | position value. This is a special case when exceptionally large |
| 114 | packets span multiple pages; the specifics of handling this special |
| 115 | case are described later under 'Continuous and Discontinuous |
| 116 | Streams'.</p> |
| 117 | |
| 118 | <h3>Seeking</h3> |
| 119 | |
| 120 | <p>Ogg is designed to use a bisection search to implement exact |
| 121 | positional seeking rather than building an index; an index requires |
| 122 | two-pass encoding and as such is not acceptable given the requirement |
| 123 | for full-featured linear encoding.</p> |
| 124 | |
| 125 | <p><i>Even making an index optional then requires an |
| 126 | application to support multiple methods (bisection search for a |
| 127 | one-pass stream, indexing for a two-pass stream), which adds no |
| 128 | additional functionality as bisection search delivers the same |
| 129 | functionality for both stream types.</i></p> |
| 130 | |
| 131 | <p>Seek operations are by absolute time; a direct bisection search must |
| 132 | find the exact time position requested. Information in the Ogg |
| 133 | bitstream is arranged such that all information to be presented for |
| 134 | playback from the desired seek point will occur at or after the |
| 135 | desired seek point. Seek operations are neither 'fuzzy' nor |
| 136 | heuristic.</p> |
| 137 | |
| 138 | <p><i>Although key frame handling in video appears to be an exception to |
| 139 | "all needed playback information lies ahead of a given seek", |
| 140 | key frames can still be handled directly within this indexless |
| 141 | framework. Seeking to a key frame in video (as well as seeking in other |
| 142 | media types with analogous restraints) is handled as two seeks; first |
| 143 | a seek to the desired time which extracts state information that |
| 144 | decodes to the time of the last key frame, followed by a second seek |
| 145 | directly to the key frame. The location of the previous key frame is |
| 146 | embedded as state information in the granulepos; this mechanism is |
| 147 | described in more detail later.</i></p> |
| 148 | |
| 149 | <h3>Continuous and Discontinuous Streams</h3> |
| 150 | |
| 151 | <p>Logical streams within a physical Ogg stream belong to one of two |
| 152 | categories, "Continuous" streams and "Discontinuous" streams. |
| 153 | Although these are discussed in more detail later, the distinction is |
| 154 | important to a high-level understanding of how to buffer an Ogg |
| 155 | stream.</p> |
| 156 | |
| 157 | <p>A stream that provides a gapless, time-continuous media type with a |
| 158 | fine-grained timebase is considered to be 'Continuous'. A continuous |
| 159 | stream should never be starved of data. Clear examples of continuous |
| 160 | data types include broadcast audio and video.</p> |
| 161 | |
| 162 | <p>A stream that delivers data in a potentially irregular pattern or with |
| 163 | widely spaced timing gaps is considered to be 'Discontinuous'. A |
| 164 | discontinuous stream may be best thought of as data representing |
| 165 | scattered events; although they happen in order, they are typically |
| 166 | unconnected data often located far apart. One possible example of a |
| 167 | discontinuous stream types would be captioning. Although it's |
| 168 | possible to design captions as a continuous stream type, it's most |
| 169 | natural to think of captions as widely spaced pieces of text with |
| 170 | little happening between.</p> |
| 171 | |
| 172 | <p>The fundamental design distinction between continuous and |
| 173 | discontinuous streams concerns buffering.</p> |
| 174 | |
| 175 | <h3>Buffering</h3> |
| 176 | |
| 177 | <p>Because a continuous stream is, by definition, gapless, Ogg buffering |
| 178 | is based on the simple premise of never allowing any active continuous |
| 179 | stream to starve for data during decode; buffering proceeds ahead |
| 180 | until all continuous streams in a physical stream have data ready to |
| 181 | decode on demand.</p> |
| 182 | |
| 183 | <p>Discontinuous stream data may occur on a fairly regular basis, but the |
| 184 | timing of, for example, a specific caption is impossible to predict |
| 185 | with certainty in most captioning systems. Thus the buffering system |
| 186 | should take discontinuous data 'as it comes' rather than working ahead |
| 187 | (for a potentially unbounded period) to look for future discontinuous |
| 188 | data. As such, discontinuous streams are ignored when managing |
| 189 | buffering; their pages simply 'fall out' of the stream when continuous |
| 190 | streams are handled properly.</p> |
| 191 | |
| 192 | <p>Buffering requirements need not be explicitly declared or managed for |
| 193 | the encoded stream; the decoder simply reads as much data as is |
| 194 | necessary to keep all continuous stream types gapless (also ensuring |
| 195 | discontinuous data arrives in time) and no more, resulting in optimum |
| 196 | implicit buffer usage for a given stream. Because all pages of all |
| 197 | data types are stamped with absolute timing information within the |
| 198 | stream, inter-stream synchronization timing is always explicitly |
| 199 | maintained without the need for explicitly declared buffer-ahead |
| 200 | hinting.</p> |
| 201 | |
| 202 | <p>Further details, mechanisms and reasons for the differing arrangement |
| 203 | and behavior of continuous and discontinuous streams is discussed |
| 204 | later.</p> |
| 205 | |
| 206 | <h3>Whole-stream navigation</h3> |
| 207 | |
| 208 | <p>Ogg is designed so that the simplest navigation operations treat the |
| 209 | physical Ogg stream as a whole summary of its streams, rather than |
| 210 | navigating each interleaved stream as a separate entity.</p> |
| 211 | |
| 212 | <p>First Example: seeking to a desired time position in a multiplexed (or |
| 213 | unmultiplexed) Ogg stream can be accomplished through a bisection |
| 214 | search on time position of all pages in the stream (as encoded in the |
| 215 | granule position). More powerful searches (such as a key frame-aware |
| 216 | seek within video) are also possible with additional search |
| 217 | complexity, but similar computational complexity.</p> |
| 218 | |
| 219 | <p>Second Example: A bitstream section may consist of three multiplexed |
| 220 | streams of differing lengths. The result of multiplexing these |
| 221 | streams should be thought of as a single mixed stream with a length |
| 222 | equal to the longest of the three component streams. Although it is |
| 223 | also possible to think of the multiplexed results as three concurrent |
| 224 | streams of different lengths and it is possible to recover the three |
| 225 | original streams, it will also become obvious that once multiplexed, |
| 226 | it isn't possible to find the internal lengths of the component |
| 227 | streams without a linear search of the whole bitstream section. |
| 228 | However, it is possible to find the length of the whole bitstream |
| 229 | section easily (in near-constant time per section) just as it is for a |
| 230 | single-media unmultiplexed stream.</p> |
| 231 | |
| 232 | <h2>Granule Position</h2> |
| 233 | |
| 234 | <h3>Description</h3> |
| 235 | |
| 236 | <p>The Granule Position is a signed 64 bit field appearing in the header |
| 237 | of every Ogg page. Although the granule position represents absolute |
| 238 | time within a logical stream, its value does not necessarily directly |
| 239 | encode a simple timestamp. It may represent frames elapsed (as in |
| 240 | Vorbis), a simple timestamp, or a more complex bit-division encoding |
| 241 | (such as in Theora). The exact encoding of the granule position is up |
| 242 | to a specific codec.</p> |
| 243 | |
| 244 | <p>The granule position is governed by the following rules:</p> |
| 245 | |
| 246 | <ul> |
| 247 | |
| 248 | <li>Granule Position must always increase forward or remain equal from |
| 249 | page to page, be unset, or be zero for a header page. The absolute |
| 250 | time to which any correct sequence of granule position maps must |
| 251 | similarly always increase forward or remain equal. <i>(A codec may |
| 252 | make use of data, such as a control sequence, that only affects codec |
| 253 | working state without producing data and thus advancing granule |
| 254 | position and time. Although the packet sequence number increases in |
| 255 | this case, the granule position, and thus the time position, do |
| 256 | not.)</i></li> |
| 257 | |
| 258 | <li>Granule position may only be unset if there no packet defining a |
| 259 | time boundary on the page (that is, if no packet in a continuous |
| 260 | stream ends on the page, or no packet in a discontinuous stream begins |
| 261 | on the page. This will be discussed in more detail under Continuous |
| 262 | and Discontinuous streams).</li> |
| 263 | |
| 264 | <li>A codec must be able to translate a given granule position value |
| 265 | to a unique, deterministic absolute time value through direct |
| 266 | calculation. A codec is not required to be able to translate an |
| 267 | absolute time value into a unique granule position value.</li> |
| 268 | |
| 269 | <li>Codecs shall choose a granule position definition that allows that |
| 270 | codec means to seek as directly as possible to an immediately |
| 271 | decodable point, such as the bit-divided granule position encoding of |
| 272 | Theora allows the codec to seek efficiently to key frame without using |
| 273 | an index. That is, additional information other than absolute time |
| 274 | may be encoded into a granule position value so long as the granule |
| 275 | position obeys the above points.</li> |
| 276 | |
| 277 | </ul> |
| 278 | |
| 279 | <h4>Example: timestamp</h4> |
| 280 | |
| 281 | <p>In general, a codec/stream type should choose the simplest granule |
| 282 | position encoding that addresses its requirements. The examples here |
| 283 | are by no means exhaustive of the possibilities within Ogg.</p> |
| 284 | |
| 285 | <p>A simple granule position could encode a timestamp directly. For |
| 286 | example, a granule position that encoded milliseconds from beginning |
| 287 | of stream would allow a logical stream length of over 100,000,000,000 |
| 288 | days before beginning a new logical stream (to avoid the granule |
| 289 | position wrapping).</p> |
| 290 | |
| 291 | <h4>Example: framestamp</h4> |
| 292 | |
| 293 | <p>A simple millisecond timestamp granule encoding might suit many stream |
| 294 | types, but a millisecond resolution is inappropriate to, eg, most |
| 295 | audio encodings where exact single-sample resolution is generally a |
| 296 | requirement. A millisecond is both too large a granule and often does |
| 297 | not represent an integer number of samples.</p> |
| 298 | |
| 299 | <p>In the event that audio frames are always encoded as the same number of |
| 300 | samples, the granule position could simply be a linear count of frames |
| 301 | since beginning of stream. This has the advantages of being exact and |
| 302 | efficient. Position in time would simply be <tt>[granule_position] * |
| 303 | [samples_per_frame] / [samples_per_second]</tt>.</p> |
| 304 | |
| 305 | <h4>Example: samplestamp (Vorbis)</h4> |
| 306 | |
| 307 | <p>Frame counting is insufficient in codecs such as Vorbis where an audio |
| 308 | frame [packet] encodes a variable number of samples. In Vorbis's |
| 309 | case, the granule position is a count of the number of raw samples |
| 310 | from the beginning of stream; the absolute time of |
| 311 | a granule position is <tt>[granule_position] / |
| 312 | [samples_per_second]</tt>.</p> |
| 313 | |
| 314 | <h4>Example: bit-divided framestamp (Theora)</h4> |
| 315 | |
| 316 | <p>Some video codecs may be able to use the simple framestamp scheme for |
| 317 | granule position. However, most modern video codecs introduce at |
| 318 | least the following complications:</p> |
| 319 | |
| 320 | <ul> |
| 321 | |
| 322 | <li>video frames are relatively far apart compared to audio samples; |
| 323 | for this reason, the point at which a video frame changes to the next |
| 324 | frame is usually a strictly defined offset within the frame 'period'. |
| 325 | That is, video at 50fps could just as easily define frame transitions |
| 326 | <.015, .035, .055...> as at <.00, .02, .04...>.</li> |
| 327 | |
| 328 | <li>frame rates often include drop-frames, leap-frames or other |
| 329 | rational-but-non-integer timings.</li> |
| 330 | |
| 331 | <li>Decode must begin at a 'key frame' or 'I frame'. Keyframes usually |
| 332 | occur relatively seldom.</li> |
| 333 | |
| 334 | </ul> |
| 335 | |
| 336 | <p>The first two points can be handled straightforwardly via the fact |
| 337 | that the codec has complete control mapping granule position to |
| 338 | absolute time; non-integer frame rates and offsets can be set in the |
| 339 | codec's initial header, and the rest is just arithmetic.</p> |
| 340 | |
| 341 | <p>The third point appears trickier at first glance, but it too can be |
| 342 | handled through the granule position mapping mechanism. Here we |
| 343 | arrange the granule position in such a way that granule positions of |
| 344 | key frames are easy to find. Divide the granule position into two |
| 345 | fields; the most-significant bits are an absolute frame counter, but |
| 346 | it's only updated at each key frame. The least significant bits encode |
| 347 | the number of frames since the last key frame. In this way, each |
| 348 | granule position both encodes the absolute time of the current frame |
| 349 | as well as the absolute time of the last key frame.</p> |
| 350 | |
| 351 | <p>Seeking to a most recent preceding key frame is then accomplished by |
| 352 | first seeking to the original desired point, inspecting the granulepos |
| 353 | of the resulting video page, extracting from that granulepos the |
| 354 | absolute time of the desired key frame, and then seeking directly to |
| 355 | that key frame's page. Of course, it's still possible for an |
| 356 | application to ignore key frames and use a simpler seeking algorithm |
| 357 | (decode would be unable to present decoded video until the next |
| 358 | key frame). Surprisingly many player applications do choose the |
| 359 | simpler approach.</p> |
| 360 | |
| 361 | <h3>granule position, packets and pages</h3> |
| 362 | |
| 363 | <p>Although each packet of data in a logical stream theoretically has a |
| 364 | specific granule position, only one granule position is encoded |
| 365 | per page. It is possible to encode a logical stream such that each |
| 366 | page contains only a single packet (so that granule positions are |
| 367 | preserved for each packet), however a one-to-one packet/page mapping |
| 368 | is not intended to be the general case.</p> |
| 369 | |
| 370 | <p>Because Ogg functions at the page, not packet, level, this |
| 371 | once-per-page time information provides Ogg with the finest-grained |
| 372 | time information is can use. Ogg passes this granule positioning data |
| 373 | to the codec (along with the packets extracted from a page); it is the |
| 374 | responsibility of codecs to track timing information at granularities |
| 375 | finer than a single page.</p> |
| 376 | |
| 377 | <h3>start-time and end-time positioning</h3> |
| 378 | |
| 379 | <p>A granule position represents the <em>instantaneous time location |
| 380 | between two pages</em>. However, continuous streams and discontinuous |
| 381 | streams differ on whether the granulepos represents the end-time of |
| 382 | the data on a page or the start-time. Continuous streams are |
| 383 | 'end-time' encoded; the granulepos represents the point in time |
| 384 | immediately after the last data decoded from a page. Discontinuous |
| 385 | streams are 'start-time' encoded; the granulepos represents the point |
| 386 | in time of the first data decoded from the page.</p> |
| 387 | |
| 388 | <p>An Ogg stream type is declared continuous or discontinuous by its |
| 389 | codec. A given codec may support both continuous and discontinuous |
| 390 | operation so long as any given logical stream is continuous or |
| 391 | discontinuous for its entirety and the codec is able to ascertain (and |
| 392 | inform the Ogg layer) as to which after decoding the initial stream |
| 393 | header. The majority of codecs will always be continuous (such as |
| 394 | Vorbis) or discontinuous (such as Writ).</p> |
| 395 | |
| 396 | <p>Start- and end-time encoding do not affect multiplexing sort-order; |
| 397 | pages are still sorted by the absolute time a given granulepos maps to |
| 398 | regardless of whether that granulepos represents start- or |
| 399 | end-time.</p> |
| 400 | |
| 401 | <h2>Multiplex/Demultiplex Division of Labor</h2> |
| 402 | |
| 403 | <p>The Ogg multiplex/demultiplex layer provides mechanisms for encoding |
| 404 | raw packets into Ogg pages, decoding Ogg pages back into the original |
| 405 | codec packets, determining the logical structure of an Ogg stream, and |
| 406 | navigating through and synchronizing with an Ogg stream at a desired |
| 407 | stream location. Strict multiplex/demultiplex operations are entirely |
| 408 | in the Ogg domain and require no intervention from codecs.</p> |
| 409 | |
| 410 | <p>Implementation of more complex operations does require codec |
| 411 | knowledge, however. Unlike other framing systems, Ogg maintains |
| 412 | strict separation between framing and the framed bitstream data; Ogg |
| 413 | does not replicate codec-specific information in the page/framing |
| 414 | data, nor does Ogg blur the line between framing and stream |
| 415 | data/metadata. Because Ogg is fully data-agnostic toward the data it |
| 416 | frames, operations which require specifics of bitstream data (such as |
| 417 | 'seek to key frame') also require interaction with the codec layer |
| 418 | (because, in this example, the Ogg layer is not aware of the concept |
| 419 | of key frames). This is different from systems that blur the |
| 420 | separation between framing and stream data in order to simplify the |
| 421 | separation of code. The Ogg system purposely keeps the distinction in |
| 422 | data simple so that later codec innovations are not constrained by |
| 423 | framing design.</p> |
| 424 | |
| 425 | <p>For this reason, however, complex seeking operations require |
| 426 | interaction with the codecs in order to decode the granule position of |
| 427 | a given stream type back to absolute time or in order to find |
| 428 | 'decodable points' such as key frames in video.</p> |
| 429 | |
| 430 | <h2>Unsorted Discussion Points</h2> |
| 431 | |
| 432 | <p>flushes around key frames? RFC suggestion: repaginating or building a |
| 433 | stream this way is nice but not required</p> |
| 434 | |
| 435 | <h2>Appendix A: multiplexing examples</h2> |
| 436 | |
| 437 | <div id="copyright"> |
| 438 | The Xiph Fish Logo is a |
| 439 | trademark (™) of Xiph.Org.<br/> |
| 440 | |
| 441 | These pages © 1994 - 2005 Xiph.Org. All rights reserved. |
| 442 | </div> |
| 443 | |
| 444 | </body> |
| 445 | </html> |