Blame - bzip2.1.preformatted - platform/external/bzip2

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

bzip2(1) bzip2(1)

NNAAMMEE

Elliott Hughes

8645cf6

2021-12-08 15:07:46 -0800

[diff] [blame]

6

bzip2, bunzip2 − a block‐sorting file compressor, v1.0.8

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

7

bzcat − decompresses files to stdout

8

bzip2recover − recovers data from damaged bzip2 files

SSYYNNOOPPSSIISS

bbzziipp22 [ −−ccddffkkqqssttvvzzVVLL112233445566778899 ] [ _f_i_l_e_n_a_m_e_s _._._. ]

13

bbuunnzziipp22 [ −−ffkkvvssVVLL ] [ _f_i_l_e_n_a_m_e_s _._._. ]

14

bbzzccaatt [ −−ss ] [ _f_i_l_e_n_a_m_e_s _._._. ]

15

bbzziipp22rreeccoovveerr _f_i_l_e_n_a_m_e

16

17

18

DDEESSCCRRIIPPTTIIOONN

19

_b_z_i_p_2 compresses files using the Burrows‐Wheeler block

20

sorting text compression algorithm, and Huffman coding.

21

Compression is generally considerably better than that

22

achieved by more conventional LZ77/LZ78‐based compressors,

23

and approaches the performance of the PPM family of sta

24

tistical compressors.

25

26

The command‐line options are deliberately very similar to

27

those of _G_N_U _g_z_i_p_, but they are not identical.

28

29

_b_z_i_p_2 expects a list of file names to accompany the com

30

mand‐line flags. Each file is replaced by a compressed

31

version of itself, with the name "original_name.bz2".

32

Each compressed file has the same modification date, per

33

missions, and, when possible, ownership as the correspond

34

ing original, so that these properties can be correctly

35

restored at decompression time. File name handling is

36

naive in the sense that there is no mechanism for preserv

37

ing original file names, permissions, ownerships or dates

38

in filesystems which lack these concepts, or have serious

39

file name length restrictions, such as MS‐DOS.

40

41

_b_z_i_p_2 and _b_u_n_z_i_p_2 will by default not overwrite existing

42

files. If you want this to happen, specify the −f flag.

43

44

If no file names are specified, _b_z_i_p_2 compresses from

45

standard input to standard output. In this case, _b_z_i_p_2

46

will decline to write compressed output to a terminal, as

47

this would be entirely incomprehensible and therefore

48

pointless.

49

50

_b_u_n_z_i_p_2 (or _b_z_i_p_2 _−_d_) decompresses all specified files.

51

Files which were not created by _b_z_i_p_2 will be detected and

52

ignored, and a warning issued. _b_z_i_p_2 attempts to guess

53

the filename for the decompressed file from that of the

54

compressed file as follows:

55

56

filename.bz2 becomes filename

57

filename.bz becomes filename

58

filename.tbz2 becomes filename.tar

59

filename.tbz becomes filename.tar

60

anyothername becomes anyothername.out

61

62

If the file does not end in one of the recognised endings,

63

_._b_z_2_, _._b_z_, _._t_b_z_2 or _._t_b_z_, _b_z_i_p_2 complains that it cannot

64

guess the name of the original file, and uses the original

65

name with _._o_u_t appended.

66

67

As with compression, supplying no filenames causes decom

68

pression from standard input to standard output.

69

70

_b_u_n_z_i_p_2 will correctly decompress a file which is the con

71

catenation of two or more compressed files. The result is

72

the concatenation of the corresponding uncompressed files.

73

Integrity testing (−t) of concatenated compressed files is

74

also supported.

75

76

You can also compress or decompress files to the standard

77

output by giving the −c flag. Multiple files may be com

78

pressed and decompressed like this. The resulting outputs

79

are fed sequentially to stdout. Compression of multiple

80

files in this manner generates a stream containing multi

81

ple compressed file representations. Such a stream can be

82

decompressed correctly only by _b_z_i_p_2 version 0.9.0 or

83

later. Earlier versions of _b_z_i_p_2 will stop after decom

84

pressing the first file in the stream.

85

86

_b_z_c_a_t (or _b_z_i_p_2 _‐_d_c_) decompresses all specified files to

87

the standard output.

88

89

_b_z_i_p_2 will read arguments from the environment variables

90

_B_Z_I_P_2 and _B_Z_I_P_, in that order, and will process them

91

before any arguments read from the command line. This

92

gives a convenient way to supply default arguments.

93

94

Compression is always performed, even if the compressed

95

file is slightly larger than the original. Files of less

96

than about one hundred bytes tend to get larger, since the

97

compression mechanism has a constant overhead in the

98

region of 50 bytes. Random data (including the output of

99

most file compressors) is coded at about 8.05 bits per

100

byte, giving an expansion of around 0.5%.

101

102

As a self‐check for your protection, _b_z_i_p_2 uses 32‐bit

103

CRCs to make sure that the decompressed version of a file

104

is identical to the original. This guards against corrup

105

tion of the compressed data, and against undetected bugs

106

in _b_z_i_p_2 (hopefully very unlikely). The chances of data

107

corruption going undetected is microscopic, about one

108

chance in four billion for each file processed. Be aware,

109

though, that the check occurs upon decompression, so it

110

can only tell you that something is wrong. It can’t help

111

you recover the original uncompressed data. You can use

112

_b_z_i_p_2_r_e_c_o_v_e_r to try to recover data from damaged files.

113

114

Return values: 0 for a normal exit, 1 for environmental

115

problems (file not found, invalid flags, I/O errors, &c),

116

2 to indicate a corrupt compressed file, 3 for an internal

117

consistency error (eg, bug) which caused _b_z_i_p_2 to panic.

OOPPTTIIOONNSS

−−cc ‐‐‐‐ssttddoouutt

122

Compress or decompress to standard output.

123

124

−−dd ‐‐‐‐ddeeccoommpprreessss

125

Force decompression. _b_z_i_p_2_, _b_u_n_z_i_p_2 and _b_z_c_a_t are

126

really the same program, and the decision about

127

what actions to take is done on the basis of which

128

name is used. This flag overrides that mechanism,

129

and forces _b_z_i_p_2 to decompress.

130

131

−−zz ‐‐‐‐ccoommpprreessss

132

The complement to −d: forces compression,

133

regardless of the invocation name.

134

135

−−tt ‐‐‐‐tteesstt

136

Check integrity of the specified file(s), but don’t

137

decompress them. This really performs a trial

138

decompression and throws away the result.

139

140

−−ff ‐‐‐‐ffoorrccee

141

Force overwrite of output files. Normally, _b_z_i_p_2

142

will not overwrite existing output files. Also

143

forces _b_z_i_p_2 to break hard links to files, which it

144

otherwise wouldn’t do.

145

146

bzip2 normally declines to decompress files which

147

don’t have the correct magic header bytes. If

148

forced (‐f), however, it will pass such files

149

through unmodified. This is how GNU gzip behaves.

150

151

−−kk ‐‐‐‐kkeeeepp

152

Keep (don’t delete) input files during compression

or decompression.

−−ss ‐‐‐‐ssmmaallll

Reduce memory usage, for compression, decompression

157

and testing. Files are decompressed and tested

158

using a modified algorithm which only requires 2.5

159

bytes per block byte. This means any file can be

160

decompressed in 2300k of memory, albeit at about

161

half the normal speed.

162

163

During compression, −s selects a block size of

164

200k, which limits memory use to around the same

165

figure, at the expense of your compression ratio.

166

In short, if your machine is low on memory (8

167

megabytes or less), use −s for everything. See

168

MEMORY MANAGEMENT below.

169

170

−−qq ‐‐‐‐qquuiieett

171

Suppress non‐essential warning messages. Messages

172

pertaining to I/O errors and other critical events

173

will not be suppressed.

174

175

−−vv ‐‐‐‐vveerrbboossee

176

Verbose mode ‐‐ show the compression ratio for each

177

file processed. Further −v’s increase the ver

178

bosity level, spewing out lots of information which

179

is primarily of interest for diagnostic purposes.

180

181

−−LL ‐‐‐‐lliicceennssee ‐‐VV ‐‐‐‐vveerrssiioonn

182

Display the software version, license terms and

183

conditions.

184

185

−−11 ((oorr −−−−ffaasstt)) ttoo −−99 ((oorr −−−−bbeesstt))

186

Set the block size to 100 k, 200 k .. 900 k when

187

compressing. Has no effect when decompressing.

188

See MEMORY MANAGEMENT below. The −−fast and −−best

189

aliases are primarily for GNU gzip compatibility.

190

In particular, −−fast doesn’t make things signifi

191

cantly faster. And −−best merely selects the

192

default behaviour.

193

194

−−‐‐ Treats all subsequent arguments as file names, even

195

if they start with a dash. This is so you can han

196

dle files with names beginning with a dash, for

197

example: bzip2 −‐ −myfilename.

198

199

−−‐‐rreeppeettiittiivvee‐‐ffaasstt ‐‐‐‐rreeppeettiittiivvee‐‐bbeesstt

200

These flags are redundant in versions 0.9.5 and

201

above. They provided some coarse control over the

202

behaviour of the sorting algorithm in earlier ver

203

sions, which was sometimes useful. 0.9.5 and above

204

have an improved algorithm which renders these

flags irrelevant.

MMEEMMOORRYY MMAANNAAGGEEMMEENNTT

209

_b_z_i_p_2 compresses large files in blocks. The block size

210

affects both the compression ratio achieved, and the

211

amount of memory needed for compression and decompression.

212

The flags −1 through −9 specify the block size to be

213

100,000 bytes through 900,000 bytes (the default) respec

214

tively. At decompression time, the block size used for

215

compression is read from the header of the compressed

216

file, and _b_u_n_z_i_p_2 then allocates itself just enough memory

217

to decompress the file. Since block sizes are stored in

218

compressed files, it follows that the flags −1 to −9 are

219

irrelevant to and so ignored during decompression.

220

221

Compression and decompression requirements, in bytes, can

222

be estimated as:

223

224

Compression: 400k + ( 8 x block size )

225

226

Decompression: 100k + ( 4 x block size ), or

227

100k + ( 2.5 x block size )

228

229

Larger block sizes give rapidly diminishing marginal

230

returns. Most of the compression comes from the first two

231

or three hundred k of block size, a fact worth bearing in

232

mind when using _b_z_i_p_2 on small machines. It is also

233

important to appreciate that the decompression memory

234

requirement is set at compression time by the choice of

235

block size.

236

237

For files compressed with the default 900k block size,

238

_b_u_n_z_i_p_2 will require about 3700 kbytes to decompress. To

239

support decompression of any file on a 4 megabyte machine,

240

_b_u_n_z_i_p_2 has an option to decompress using approximately

241

half this amount of memory, about 2300 kbytes. Decompres

242

sion speed is also halved, so you should use this option

243

only where necessary. The relevant flag is ‐s.

244

245

In general, try and use the largest block size memory con

246

straints allow, since that maximises the compression

247

achieved. Compression and decompression speed are virtu

248

ally unaffected by block size.

249

250

Another significant point applies to files which fit in a

251

single block ‐‐ that means most files you’d encounter

252

using a large block size. The amount of real memory

253

touched is proportional to the size of the file, since the

254

file is smaller than a block. For example, compressing a

255

file 20,000 bytes long with the flag ‐9 will cause the

256

compressor to allocate around 7600k of memory, but only

257

touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the

258

decompressor will allocate 3700k but only touch 100k +

259

20000 * 4 = 180 kbytes.

260

261

Here is a table which summarises the maximum memory usage

262

for different block sizes. Also recorded is the total

263

compressed size for 14 files of the Calgary Text Compres

264

sion Corpus totalling 3,141,622 bytes. This column gives

265

some feel for how compression varies with block size.

266

These figures tend to understate the advantage of larger

267

block sizes for larger files, since the Corpus is domi

268

nated by smaller files.

269

270

Compress Decompress Decompress Corpus

271

Flag usage usage ‐s usage Size

272

273

‐1 1200k 500k 350k 914704

274

‐2 2000k 900k 600k 877703

275

‐3 2800k 1300k 850k 860338

276

‐4 3600k 1700k 1100k 846899

277

‐5 4400k 2100k 1350k 845160

278

‐6 5200k 2500k 1600k 838626

279

‐7 6100k 2900k 1850k 834096

280

‐8 6800k 3300k 2100k 828642

281

‐9 7600k 3700k 2350k 828642

282

283

284

RREECCOOVVEERRIINNGG DDAATTAA FFRROOMM DDAAMMAAGGEEDD FFIILLEESS

285

_b_z_i_p_2 compresses files in blocks, usually 900kbytes long.

286

Each block is handled independently. If a media or trans

287

mission error causes a multi‐block .bz2 file to become

288

damaged, it may be possible to recover data from the

289

undamaged blocks in the file.

290

291

The compressed representation of each block is delimited

292

by a 48‐bit pattern, which makes it possible to find the

293

block boundaries with reasonable certainty. Each block

294

also carries its own 32‐bit CRC, so damaged blocks can be

295

distinguished from undamaged ones.

296

297

_b_z_i_p_2_r_e_c_o_v_e_r is a simple program whose purpose is to

298

search for blocks in .bz2 files, and write each block out

299

into its own .bz2 file. You can then use _b_z_i_p_2 −t to test

300

the integrity of the resulting files, and decompress those

301

which are undamaged.

302

303

_b_z_i_p_2_r_e_c_o_v_e_r takes a single argument, the name of the dam

304

aged file, and writes a number of files

305

"rec00001file.bz2", "rec00002file.bz2", etc, containing

306

the extracted blocks. The output filenames are

307

designed so that the use of wildcards in subsequent pro

308

cessing ‐‐ for example, "bzip2 ‐dc rec*file.bz2 > recov

309

ered_data" ‐‐ processes the files in the correct order.

310

311

_b_z_i_p_2_r_e_c_o_v_e_r should be of most use dealing with large .bz2

312

files, as these will contain many blocks. It is clearly

313

futile to use it on damaged single‐block files, since a

314

damaged block cannot be recovered. If you wish to min

315

imise any potential data loss through media or transmis

316

sion errors, you might consider compressing with a smaller

block size.

PPEERRFFOORRMMAANNCCEE NNOOTTEESS

321

The sorting phase of compression gathers together similar

322

strings in the file. Because of this, files containing

323

very long runs of repeated symbols, like "aabaabaabaab

324

..." (repeated several hundred times) may compress more

325

slowly than normal. Versions 0.9.5 and above fare much

326

better than previous versions in this respect. The ratio

327

between worst‐case and average‐case compression time is in

328

the region of 10:1. For previous versions, this figure

329

was more like 100:1. You can use the −vvvv option to mon

330

itor progress in great detail, if you want.

331

332

Decompression speed is unaffected by these phenomena.

333

334

_b_z_i_p_2 usually allocates several megabytes of memory to

335

operate in, and then charges all over it in a fairly ran

336

dom fashion. This means that performance, both for com

337

pressing and decompressing, is largely determined by the

338

speed at which your machine can service cache misses.

339

Because of this, small changes to the code to reduce the

340

miss rate have been observed to give disproportionately

341

large performance improvements. I imagine _b_z_i_p_2 will per

342

form best on machines with very large caches.

CCAAVVEEAATTSS

I/O error messages are not as helpful as they could be.

347

_b_z_i_p_2 tries hard to detect I/O errors and exit cleanly,

348

but the details of what the problem is sometimes seem

349

rather misleading.

350

Elliott Hughes

8645cf6

2021-12-08 15:07:46 -0800

[diff] [blame]

351

This manual page pertains to version 1.0.8 of _b_z_i_p_2_. Com

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

352

pressed data created by this version is entirely forwards

353

and backwards compatible with the previous public

354

releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1,

Nick Kralevich

172b266

2010-09-20 17:21:30 -0700

[diff] [blame]

355

1.0.2 and above, but with the following exception: 0.9.0

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

356

and above can correctly decompress multiple concatenated

357

compressed files. 0.1pl2 cannot do this; it will stop

358

after decompressing just the first file in the stream.

359

360

_b_z_i_p_2_r_e_c_o_v_e_r versions prior to 1.0.2 used 32‐bit integers

361

to represent bit positions in compressed files, so they

362

could not handle compressed files more than 512 megabytes

363

long. Versions 1.0.2 and above use 64‐bit ints on some

364

platforms which support them (GNU supported targets, and

365

Windows). To establish whether or not bzip2recover was

366

built with such a limitation, run it without arguments.

367

In any event you can build yourself an unlimited version

368

if you can recompile it with MaybeUInt64 set to be an

369

unsigned 64‐bit integer.

AAUUTTHHOORR

Elliott Hughes

8645cf6

2021-12-08 15:07:46 -0800

[diff] [blame]

375

Julian Seward, jseward@acm.org.

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

376

Elliott Hughes

8645cf6

2021-12-08 15:07:46 -0800

[diff] [blame]

377

https://sourceware.org/bzip2/

The Android Open Source Project

cfb3b27

2009-03-03 19:29:20 -0800

[diff] [blame]

378

379

The ideas embodied in _b_z_i_p_2 are due to (at least) the fol

380

lowing people: Michael Burrows and David Wheeler (for the

381

block sorting transformation), David Wheeler (again, for

382

the Huffman coder), Peter Fenwick (for the structured cod

383

ing model in the original _b_z_i_p_, and many refinements), and

384

Alistair Moffat, Radford Neal and Ian Witten (for the

385

arithmetic coder in the original _b_z_i_p_)_. I am much

386

indebted for their help, support and advice. See the man

387

ual in the source distribution for pointers to sources of

388

documentation. Christian von Roques encouraged me to look

389

for faster sorting algorithms, so as to speed up compres

390

sion. Bela Lubkin encouraged me to improve the worst‐case

391

compression performance. Donna Robinson XMLised the docu

392

mentation. The bz* scripts are derived from those of GNU

393

gzip. Many people sent patches, helped with portability

394

problems, lent machines, gave advice and were generally

helpful.

bzip2(1)