Blame - bzip2.txt - platform/external/bzip2

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

1

2

NAME

Elliott Hughes

8645cf6

2021-12-08 15:07:46 -0800

[diff] [blame]

3

bzip2, bunzip2 - a block-sorting file compressor, v1.0.8

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

4

bzcat - decompresses files to stdout

5

bzip2recover - recovers data from damaged bzip2 files

SYNOPSIS

bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ]

10

bunzip2 [ -fkvsVL ] [ filenames ... ]

11

bzcat [ -s ] [ filenames ... ]

12

bzip2recover filename

DESCRIPTION

bzip2 compresses files using the Burrows-Wheeler block

17

sorting text compression algorithm, and Huffman coding.

18

Compression is generally considerably better than that

19

achieved by more conventional LZ77/LZ78-based compressors,

20

and approaches the performance of the PPM family of sta-

21

tistical compressors.

22

23

The command-line options are deliberately very similar to

24

those of GNU gzip, but they are not identical.

25

26

bzip2 expects a list of file names to accompany the com-

27

mand-line flags. Each file is replaced by a compressed

28

version of itself, with the name "original_name.bz2".

29

Each compressed file has the same modification date, per-

30

missions, and, when possible, ownership as the correspond-

31

ing original, so that these properties can be correctly

32

restored at decompression time. File name handling is

33

naive in the sense that there is no mechanism for preserv-

34

ing original file names, permissions, ownerships or dates

35

in filesystems which lack these concepts, or have serious

36

file name length restrictions, such as MS-DOS.

37

38

bzip2 and bunzip2 will by default not overwrite existing

39

files. If you want this to happen, specify the -f flag.

40

41

If no file names are specified, bzip2 compresses from

42

standard input to standard output. In this case, bzip2

43

will decline to write compressed output to a terminal, as

44

this would be entirely incomprehensible and therefore

45

pointless.

46

47

bunzip2 (or bzip2 -d) decompresses all specified files.

48

Files which were not created by bzip2 will be detected and

49

ignored, and a warning issued. bzip2 attempts to guess

50

the filename for the decompressed file from that of the

51

compressed file as follows:

52

53

filename.bz2 becomes filename

54

filename.bz becomes filename

55

filename.tbz2 becomes filename.tar

56

filename.tbz becomes filename.tar

57

anyothername becomes anyothername.out

58

59

If the file does not end in one of the recognised endings,

60

.bz2, .bz, .tbz2 or .tbz, bzip2 complains that it cannot

61

guess the name of the original file, and uses the original

62

name with .out appended.

63

64

As with compression, supplying no filenames causes decom-

65

pression from standard input to standard output.

66

67

bunzip2 will correctly decompress a file which is the con-

68

catenation of two or more compressed files. The result is

69

the concatenation of the corresponding uncompressed files.

70

Integrity testing (-t) of concatenated compressed files is

71

also supported.

72

73

You can also compress or decompress files to the standard

74

output by giving the -c flag. Multiple files may be com-

75

pressed and decompressed like this. The resulting outputs

76

are fed sequentially to stdout. Compression of multiple

77

files in this manner generates a stream containing multi-

78

ple compressed file representations. Such a stream can be

79

decompressed correctly only by bzip2 version 0.9.0 or

80

later. Earlier versions of bzip2 will stop after decom-

81

pressing the first file in the stream.

82

83

bzcat (or bzip2 -dc) decompresses all specified files to

84

the standard output.

85

86

bzip2 will read arguments from the environment variables

87

BZIP2 and BZIP, in that order, and will process them

88

before any arguments read from the command line. This

89

gives a convenient way to supply default arguments.

90

91

Compression is always performed, even if the compressed

92

file is slightly larger than the original. Files of less

93

than about one hundred bytes tend to get larger, since the

94

compression mechanism has a constant overhead in the

95

region of 50 bytes. Random data (including the output of

96

most file compressors) is coded at about 8.05 bits per

97

byte, giving an expansion of around 0.5%.

98

99

As a self-check for your protection, bzip2 uses 32-bit

100

CRCs to make sure that the decompressed version of a file

101

is identical to the original. This guards against corrup-

102

tion of the compressed data, and against undetected bugs

103

in bzip2 (hopefully very unlikely). The chances of data

104

corruption going undetected is microscopic, about one

105

chance in four billion for each file processed. Be aware,

106

though, that the check occurs upon decompression, so it

107

can only tell you that something is wrong. It can't help

108

you recover the original uncompressed data. You can use

109

bzip2recover to try to recover data from damaged files.

110

111

Return values: 0 for a normal exit, 1 for environmental

112

problems (file not found, invalid flags, I/O errors, &c),

113

2 to indicate a corrupt compressed file, 3 for an internal

114

consistency error (eg, bug) which caused bzip2 to panic.

OPTIONS

-c --stdout

Compress or decompress to standard output.

120

121

-d --decompress

122

Force decompression. bzip2, bunzip2 and bzcat are

123

really the same program, and the decision about

124

what actions to take is done on the basis of which

125

name is used. This flag overrides that mechanism,

126

and forces bzip2 to decompress.

127

128

-z --compress

129

The complement to -d: forces compression,

130

regardless of the invocation name.

131

132

-t --test

133

Check integrity of the specified file(s), but don't

134

decompress them. This really performs a trial

135

decompression and throws away the result.

136

137

-f --force

138

Force overwrite of output files. Normally, bzip2

139

will not overwrite existing output files. Also

140

forces bzip2 to break hard links to files, which it

141

otherwise wouldn't do.

142

143

bzip2 normally declines to decompress files which

144

don't have the correct magic header bytes. If

145

forced (-f), however, it will pass such files

146

through unmodified. This is how GNU gzip behaves.

147

148

-k --keep

149

Keep (don't delete) input files during compression

or decompression.

-s --small

Reduce memory usage, for compression, decompression

154

and testing. Files are decompressed and tested

155

using a modified algorithm which only requires 2.5

156

bytes per block byte. This means any file can be

157

decompressed in 2300k of memory, albeit at about

158

half the normal speed.

159

160

During compression, -s selects a block size of

161

200k, which limits memory use to around the same

162

figure, at the expense of your compression ratio.

163

In short, if your machine is low on memory (8

164

megabytes or less), use -s for everything. See

165

MEMORY MANAGEMENT below.

166

167

-q --quiet

168

Suppress non-essential warning messages. Messages

169

pertaining to I/O errors and other critical events

170

will not be suppressed.

171

172

-v --verbose

173

Verbose mode -- show the compression ratio for each

174

file processed. Further -v's increase the ver-

175

bosity level, spewing out lots of information which

176

is primarily of interest for diagnostic purposes.

177

178

-L --license -V --version

179

Display the software version, license terms and

180

conditions.

181

182

-1 (or --fast) to -9 (or --best)

183

Set the block size to 100 k, 200 k .. 900 k when

184

compressing. Has no effect when decompressing.

185

See MEMORY MANAGEMENT below. The --fast and --best

186

aliases are primarily for GNU gzip compatibility.

187

In particular, --fast doesn't make things signifi-

188

cantly faster. And --best merely selects the

189

default behaviour.

190

191

-- Treats all subsequent arguments as file names, even

192

if they start with a dash. This is so you can han-

193

dle files with names beginning with a dash, for

194

example: bzip2 -- -myfilename.

195

196

--repetitive-fast --repetitive-best

197

These flags are redundant in versions 0.9.5 and

198

above. They provided some coarse control over the

199

behaviour of the sorting algorithm in earlier ver-

200

sions, which was sometimes useful. 0.9.5 and above

201

have an improved algorithm which renders these

flags irrelevant.

MEMORY MANAGEMENT

bzip2 compresses large files in blocks. The block size

207

affects both the compression ratio achieved, and the

208

amount of memory needed for compression and decompression.

209

The flags -1 through -9 specify the block size to be

210

100,000 bytes through 900,000 bytes (the default) respec-

211

tively. At decompression time, the block size used for

212

compression is read from the header of the compressed

213

file, and bunzip2 then allocates itself just enough memory

214

to decompress the file. Since block sizes are stored in

215

compressed files, it follows that the flags -1 to -9 are

216

irrelevant to and so ignored during decompression.

217

218

Compression and decompression requirements, in bytes, can

219

be estimated as:

220

221

Compression: 400k + ( 8 x block size )

222

223

Decompression: 100k + ( 4 x block size ), or

224

100k + ( 2.5 x block size )

225

226

Larger block sizes give rapidly diminishing marginal

227

returns. Most of the compression comes from the first two

228

or three hundred k of block size, a fact worth bearing in

229

mind when using bzip2 on small machines. It is also

230

important to appreciate that the decompression memory

231

requirement is set at compression time by the choice of

232

block size.

233

234

For files compressed with the default 900k block size,

235

bunzip2 will require about 3700 kbytes to decompress. To

236

support decompression of any file on a 4 megabyte machine,

237

bunzip2 has an option to decompress using approximately

238

half this amount of memory, about 2300 kbytes. Decompres-

239

sion speed is also halved, so you should use this option

240

only where necessary. The relevant flag is -s.

241

242

In general, try and use the largest block size memory con-

243

straints allow, since that maximises the compression

244

achieved. Compression and decompression speed are virtu-

245

ally unaffected by block size.

246

247

Another significant point applies to files which fit in a

248

single block -- that means most files you'd encounter

249

using a large block size. The amount of real memory

250

touched is proportional to the size of the file, since the

251

file is smaller than a block. For example, compressing a

252

file 20,000 bytes long with the flag -9 will cause the

253

compressor to allocate around 7600k of memory, but only

254

touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the

255

decompressor will allocate 3700k but only touch 100k +

256

20000 * 4 = 180 kbytes.

257

258

Here is a table which summarises the maximum memory usage

259

for different block sizes. Also recorded is the total

260

compressed size for 14 files of the Calgary Text Compres-

261

sion Corpus totalling 3,141,622 bytes. This column gives

262

some feel for how compression varies with block size.

263

These figures tend to understate the advantage of larger

264

block sizes for larger files, since the Corpus is domi-

265

nated by smaller files.

266

267

Compress Decompress Decompress Corpus

268

Flag usage usage -s usage Size

269

270

-1 1200k 500k 350k 914704

271

-2 2000k 900k 600k 877703

272

-3 2800k 1300k 850k 860338

273

-4 3600k 1700k 1100k 846899

274

-5 4400k 2100k 1350k 845160

275

-6 5200k 2500k 1600k 838626

276

-7 6100k 2900k 1850k 834096

277

-8 6800k 3300k 2100k 828642

278

-9 7600k 3700k 2350k 828642

279

280

281

RECOVERING DATA FROM DAMAGED FILES

282

bzip2 compresses files in blocks, usually 900kbytes long.

283

Each block is handled independently. If a media or trans-

284

mission error causes a multi-block .bz2 file to become

285

damaged, it may be possible to recover data from the

286

undamaged blocks in the file.

287

288

The compressed representation of each block is delimited

289

by a 48-bit pattern, which makes it possible to find the

290

block boundaries with reasonable certainty. Each block

291

also carries its own 32-bit CRC, so damaged blocks can be

292

distinguished from undamaged ones.

293

294

bzip2recover is a simple program whose purpose is to

295

search for blocks in .bz2 files, and write each block out

296

into its own .bz2 file. You can then use bzip2 -t to test

297

the integrity of the resulting files, and decompress those

298

which are undamaged.

299

300

bzip2recover takes a single argument, the name of the dam-

301

aged file, and writes a number of files

302

"rec00001file.bz2", "rec00002file.bz2", etc, containing

303

the extracted blocks. The output filenames are

304

designed so that the use of wildcards in subsequent pro-

305

cessing -- for example, "bzip2 -dc rec*file.bz2 > recov-

306

ered_data" -- processes the files in the correct order.

307

308

bzip2recover should be of most use dealing with large .bz2

309

files, as these will contain many blocks. It is clearly

310

futile to use it on damaged single-block files, since a

311

damaged block cannot be recovered. If you wish to min-

312

imise any potential data loss through media or transmis-

313

sion errors, you might consider compressing with a smaller

block size.

PERFORMANCE NOTES

The sorting phase of compression gathers together similar

319

strings in the file. Because of this, files containing

320

very long runs of repeated symbols, like "aabaabaabaab

321

..." (repeated several hundred times) may compress more

322

slowly than normal. Versions 0.9.5 and above fare much

323

better than previous versions in this respect. The ratio

324

between worst-case and average-case compression time is in

325

the region of 10:1. For previous versions, this figure

326

was more like 100:1. You can use the -vvvv option to mon-

327

itor progress in great detail, if you want.

328

329

Decompression speed is unaffected by these phenomena.

330

331

bzip2 usually allocates several megabytes of memory to

332

operate in, and then charges all over it in a fairly ran-

333

dom fashion. This means that performance, both for com-

334

pressing and decompressing, is largely determined by the

335

speed at which your machine can service cache misses.

336

Because of this, small changes to the code to reduce the

337

miss rate have been observed to give disproportionately

338

large performance improvements. I imagine bzip2 will per-

339

form best on machines with very large caches.

CAVEATS

I/O error messages are not as helpful as they could be.

344

bzip2 tries hard to detect I/O errors and exit cleanly,

345

but the details of what the problem is sometimes seem

346

rather misleading.

347

Elliott Hughes

8645cf6

2021-12-08 15:07:46 -0800

[diff] [blame]

348

This manual page pertains to version 1.0.8 of bzip2. Com-

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

349

pressed data created by this version is entirely forwards

350

and backwards compatible with the previous public

351

releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1,

Nick Kralevich

172b266

2010-09-20 17:21:30 -0700

[diff] [blame]

352

1.0.2 and above, but with the following exception: 0.9.0

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

353

and above can correctly decompress multiple concatenated

354

compressed files. 0.1pl2 cannot do this; it will stop

355

after decompressing just the first file in the stream.

356

357

bzip2recover versions prior to 1.0.2 used 32-bit integers

358

to represent bit positions in compressed files, so they

359

could not handle compressed files more than 512 megabytes

360

long. Versions 1.0.2 and above use 64-bit ints on some

361

platforms which support them (GNU supported targets, and

362

Windows). To establish whether or not bzip2recover was

363

built with such a limitation, run it without arguments.

364

In any event you can build yourself an unlimited version

365

if you can recompile it with MaybeUInt64 set to be an

366

unsigned 64-bit integer.

367

368

369

AUTHOR

Elliott Hughes

8645cf6

2021-12-08 15:07:46 -0800

[diff] [blame]

370

Julian Seward, jseward@acm.org

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

371

Elliott Hughes

8645cf6

2021-12-08 15:07:46 -0800

[diff] [blame]

372

https://sourceware.org/bzip2/

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

373

374

The ideas embodied in bzip2 are due to (at least) the fol-

375

lowing people: Michael Burrows and David Wheeler (for the

376

block sorting transformation), David Wheeler (again, for

377

the Huffman coder), Peter Fenwick (for the structured cod-

378

ing model in the original bzip, and many refinements), and

379

Alistair Moffat, Radford Neal and Ian Witten (for the

380

arithmetic coder in the original bzip). I am much

381

indebted for their help, support and advice. See the man-

382

ual in the source distribution for pointers to sources of

383

documentation. Christian von Roques encouraged me to look

384

for faster sorting algorithms, so as to speed up compres-

385

sion. Bela Lubkin encouraged me to improve the worst-case

386

compression performance. Donna Robinson XMLised the docu-

387

mentation. The bz* scripts are derived from those of GNU

388

gzip. Many people sent patches, helped with portability

389

problems, lent machines, gave advice and were generally

390

helpful.

391