What the zip archive looks like and what we can do with it. Part 2 - Data Descriptor and Compression

Continuation of the article What the zip archive looks like and what we can do with it .







Foreword



Good day.

And again on air we have unconventional programming in PHP.







In a previous article, esteemed readers were interested in zip compression and zip streaming. Let’s try to open this topic a little today.







Let's take a look at







Code from last article
<?php //        (1.txt  2.txt)   : $entries = [ '1.txt' => 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc id ante ultrices, fermentum nibh eleifend, ullamcorper nunc. Sed dignissim ut odio et imperdiet. Nunc id felis et ligula viverra blandit a sit amet magna. Vestibulum facilisis venenatis enim sed bibendum. Duis maximus felis in suscipit bibendum. Mauris suscipit turpis eleifend nibh commodo imperdiet. Donec tincidunt porta interdum. Aenean interdum condimentum ligula, vitae ornare lorem auctor in. Suspendisse metus ipsum, porttitor et sapien id, fringilla aliquam nibh. Curabitur sem lacus, ultrices quis felis sed, blandit commodo metus. Duis tincidunt vel mauris at accumsan. Integer et ipsum fermentum leo viverra blandit.', '2.txt' => 'Mauris in purus sit amet ante tempor finibus nec sed justo. Integer ac nibh tempus, mollis sem vel, consequat diam. Pellentesque ut condimentum ex. Praesent finibus volutpat gravida. Vivamus eleifend neque sit amet diam scelerisque lacinia. Nunc imperdiet augue in suscipit lacinia. Curabitur orci diam, iaculis non ligula vitae, porta pellentesque est. Duis dolor erat, placerat a lacus eu, scelerisque egestas massa. Aliquam molestie pulvinar faucibus. Quisque consequat, dolor mattis lacinia pretium, eros eros tempor neque, volutpat consectetur elit elit non diam. In faucibus nulla justo, non dignissim erat maximus consectetur. Sed porttitor turpis nisl, elementum aliquam dui tincidunt nec. Nunc eu enim at nibh molestie porta ut ac erat. Sed tortor sem, mollis eget sodales vel, faucibus in dolor.', ]; //      Lorem.zip,      cwd (      ) $destination = 'Lorem.zip'; $handle = fopen($destination, 'w'); //      ,    ,     ,   "" Central Directory File Header $written = 0; $dictionary = []; foreach ($entries as $filename => $content) { //         Local File Header,     //        ,      . $fileInfo = [ //     'versionToExtract' => 10, //   0,        - 'generalPurposeBitFlag' => 0, //      ,    0 'compressionMethod' => 0, // -    mtime ,    ,      ? 'modificationTime' => 28021, //   , ? 'modificationDate' => 20072, //      .     ,       ,   ? 'crc32' => hexdec(hash('crc32b', $content)), //     .        . //       :) 'compressedSize' => $size = strlen($content), 'uncompressedSize' => $size, //    'filenameLength' => strlen($filename), //  .    ,   0. 'extraFieldLength' => 0, ]; //      . $LFH = pack('LSSSSSLLLSSa*', ...array_values([ 'signature' => 0x04034b50, //  Local File Header ] + $fileInfo + ['filename' => $filename])); //       ,       Central Directory File Header $dictionary[$filename] = [ 'signature' => 0x02014b50, //  Central Directory File Header 'versionMadeBy' => 798, //  .    ,  -  . ] + $fileInfo + [ 'fileCommentLength' => 0, //    . No comments 'diskNumber' => 0, //     0,        'internalFileAttributes' => 0, //    'externalFileAttributes' => 2176057344, //    'localFileHeaderOffset' => $written, //      Local File Header 'filename' => $filename, //  . ]; //      $written += fwrite($handle, $LFH); //    $written += fwrite($handle, $content); } // ,     ,    . //          End of central directory record (EOCD) $EOCD = [ //  EOCD 'signature' => 0x06054b50, //  .    ,   0 'diskNumber' => 0, //      -  0 'startDiskNumber' => 0, //       . 'numberCentralDirectoryRecord' => $records = count($dictionary), //    .    ,     'totalCentralDirectoryRecord' => $records, //   Central Directory Record. //      ,      'sizeOfCentralDirectory' => 0, // ,    Central Directory Records 'centralDirectoryOffset' => $written, //     'commentLength' => 0 ]; //     !   foreach ($dictionary as $entryInfo) { $CDFH = pack('LSSSSSSLLLSSSSSLLa*', ...array_values($entryInfo)); $written += fwrite($handle, $CDFH); } // ,   .  ,    $EOCD['sizeOfCentralDirectory'] = $written - $EOCD['centralDirectoryOffset']; //     End of central directory record $EOCD = pack('LSSSSLLS', ...array_values($EOCD)); $written += fwrite($handle, $EOCD); //  . fclose($handle); echo '  : ' . $written . ' ' . PHP_EOL; echo '     `unzip -tq ' . $destination . '`' . PHP_EOL; echo PHP_EOL;
      
      







What is his problem? Well, in fairness, it is worth noting that its only advantage is that it works, and there are problems there, but still.



In my opinion, the main problem is that we must first write the Local File Header (LFH) with crc32 and the file length, and then the contents of the file itself.

What does this threaten? Or we load the entire file into memory, consider crc32 for it, write LFH , and then the contents of the file are economical from the point of view of I / O, but it is unacceptable with large files. Or we read the file 2 times - first to calculate the hash, and then to read the contents and write to the archive - economically from the point of view of RAM, but, for example, firstly it doubles the load on the drive, which is not necessarily an SSD.



And if the file is located remotely and its volume, for example, 1.5GB? It’s necessary to either load all 1.5GB into memory, or wait until all these 1.5GB are downloaded and we will calculate the hash, and then download them again to give the contents. In the case, if we want to give on the fly, for example, a dump database, which we, for example, read from stdout, this is generally unacceptable - the data in the database has changed, the dump data will change, there will be a completely different hash and we will get an invalid archive. Yeah, things are bad, of course.







Data Descriptor Structure for Streaming Archive Records



But don't be discouraged, the ZIP specification allows us to write data first, and then stick the Data Descriptor (DD) structure after the data, which already contains crc32, the length of the packed data and the length of the data without compression. To do this, we need only 3 times a day on an empty stomach in LFH specify generalPurposeBitFlag equal to 0x0008 , and crc32 , compressedSize and uncompressedSize specify 0 . Then, after the data, we write the DD structure, which will look something like this:









 pack('LLLL', ...array_values([ 'signature' => 0x08074b50, //  Data Descriptor 'crc32' => $crc32, //  crc32    'compressedSize' => $compressedSize, //    'uncompressedSize' => $uncompressedSize, //    . ]));
      
      





And in the Central Directory File Header (CDFH) only the generalPurposeBitFlag changes , the rest of the data must be real. But this is not a problem, since we write CDFH after all the data, and hashes with data lengths are known in any case.







This is all, of course, good. It remains only to implement in PHP.

And the standard Hash library will help us a lot. We can create a hash context in which it will be enough to stuff chunks with data, and in the end get the hash value. Of course, this solution will be somewhat more cumbersome than hash ('crc32b', $ content) , but it will save us just an unimaginable bunch of resources and time.



It looks something like this:







 $hashCtx = hash_init('crc32b'); $handle = fopen($source, 'r'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); hash_update($hashCtx, $chunk); $chunk = null; } $hash = hash_final($hashCtx);
      
      





If everything is done correctly, then the value will not differ at all from hash_file ('crc32b', $ source) or hash ('crc32b', file_get_content ($ source)) .



Let's try to somehow wrap this all up in one function, so that we can conveniently read the file to us, and at the end get its hash and length. And the generators will help us with this:







 function read(string $path): \Generator { $length = 0; $handle = fopen($path, 'r'); $hashCtx = hash_init('crc32b'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); $length += strlen($chunk); hash_update($hashCtx, $chunk); yield $chunk; $chunk = null; } fclose($handle); return ['length' => $length, 'crc32' => hexdec(hash_final($hashCtx))]; }
      
      





and now we can just



 $reader = read('https://speed.hetzner.de/1GB.bin'); foreach ($reader as $chunk) { // -   . } //      . ['length' => $length, 'crc32' => $crc32] = $reader->getReturn(); echo round(memory_get_peak_usage(true) / 1024 / 1024, 2) . 'MB - Memory Peak Usage' . PHP_EOL;
      
      





In my opinion it’s quite simple and convenient. With a 1GB file, my peak memory consumption was 2MB.

Now let's try to modify the code from the previous article so that we can use this function.







Final script
 <?php function read(string $path): \Generator { $length = 0; $handle = fopen($path, 'r'); $hashCtx = hash_init('crc32b'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); $length += strlen($chunk); hash_update($hashCtx, $chunk); yield $chunk; $chunk = null; } fclose($handle); return ['length' => $length, 'crc32' => hexdec(hash_final($hashCtx))]; } $entries = ['https://speed.hetzner.de/100MB.bin', __FILE__]; $destination = 'test.zip'; $handle = fopen($destination, 'w'); $written = 0; $dictionary = []; foreach ($entries as $entry) { $filename = basename($entry); $fileInfo = [ 'versionToExtract' => 10, //       Data Descriptor,     00008, //   00000    . 'generalPurposeBitFlag' => 0x0008, 'compressionMethod' => 0, 'modificationTime' => 28021, 'modificationDate' => 20072, 'crc32' => 0, 'compressedSize' => 0, 'uncompressedSize' => 0, 'filenameLength' => strlen($filename), 'extraFieldLength' => 0, ]; $LFH = pack('LSSSSSLLLSSa*', ...array_values([ 'signature' => 0x04034b50, ] + $fileInfo + ['filename' => $filename])); $fileOffset = $written; $written += fwrite($handle, $LFH); //     $reader = read($entry); foreach ($reader as $chunk) { //      $written += fwrite($handle, $chunk); $chunk = null; } //       ['length' => $length, 'crc32' => $crc32] = $reader->getReturn(); //    fileInfo,     CDFH $fileInfo['crc32'] = $crc32; $fileInfo['compressedSize'] = $length; $fileInfo['uncompressedSize'] = $length; //  Data Descriptor $DD = pack('LLLL', ...array_values([ 'signature' => 0x08074b50, 'crc32' => $fileInfo['crc32'], 'compressedSize' => $fileInfo['compressedSize'], 'uncompressedSize' => $fileInfo['uncompressedSize'], ])); $written += fwrite($handle, $DD); $dictionary[$filename] = [ 'signature' => 0x02014b50, 'versionMadeBy' => 798, ] + $fileInfo + [ 'fileCommentLength' => 0, 'diskNumber' => 0, 'internalFileAttributes' => 0, 'externalFileAttributes' => 2176057344, 'localFileHeaderOffset' => $fileOffset, 'filename' => $filename, ]; } $EOCD = [ 'signature' => 0x06054b50, 'diskNumber' => 0, 'startDiskNumber' => 0, 'numberCentralDirectoryRecord' => $records = count($dictionary), 'totalCentralDirectoryRecord' => $records, 'sizeOfCentralDirectory' => 0, 'centralDirectoryOffset' => $written, 'commentLength' => 0 ]; foreach ($dictionary as $entryInfo) { $CDFH = pack('LSSSSSSLLLSSSSSLLa*', ...array_values($entryInfo)); $written += fwrite($handle, $CDFH); } $EOCD['sizeOfCentralDirectory'] = $written - $EOCD['centralDirectoryOffset']; $EOCD = pack('LSSSSLLS', ...array_values($EOCD)); $written += fwrite($handle, $EOCD); fclose($handle); echo '  : ' . memory_get_peak_usage(true) . ' ' . PHP_EOL; echo '  : ' . $written . ' ' . PHP_EOL; echo '   `unzip -tq ' . $destination . '`: ' . PHP_EOL; echo '> ' . exec('unzip -tq ' . $destination) . PHP_EOL; echo PHP_EOL;
      
      







At the output, we should get a Zip archive with the name test.zip, in which there will be a file with the above script and 100MB.bin, about 100 MB in size.



Compression in zip archives



Now we have virtually everything to compress the data and do it on the fly too.

Just as we get a hash by giving small chunks to functions, we can also compress data thanks to the wonderful Zlib library and its deflate_init and deflate_add functions .







It looks something like this:







 $deflateCtx = deflate_init(ZLIB_ENCODING_RAW, ['level' => 6]); $handle = fopen($source, 'r'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); yield deflate_add($deflateCtx, $chunk, feof($handle) ? ZLIB_FINISH : ZLIB_SYNC_FLUSH); $chunk = null; }
      
      





I came across an option like this that, in comparison with the previous one, it will add a few zeros at the end.
Spoiler heading
 while (!feof($handle)) { yield deflate_add($deflateCtx, $chunk, ZLIB_SYNC_FLUSH); } yield deflate_add($deflateCtx, '', ZLIB_FINISH);
      
      





But unzip swore, so I had to get rid of such simplification.



Let's fix our reader so that it immediately compresses our data, and in the end returns us a hash, data length without compression and data length with compression:







 function read(string $path): \Generator { $uncompressedSize = 0; $compressedSize = 0; $hashCtx = hash_init('crc32b'); $deflateCtx = deflate_init(ZLIB_ENCODING_RAW, ['level' => 6]); $handle = fopen($path, 'r'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); hash_update($hashCtx, $chunk); $compressedChunk = deflate_add($deflateCtx, $chunk, feof($handle) ? ZLIB_FINISH : ZLIB_SYNC_FLUSH); $uncompressedSize += strlen($chunk); $compressedSize += strlen($compressedChunk); yield $compressedChunk; $chunk = null; $compressedChunk = null; } fclose($handle); return [ 'uncompressedSize' => $uncompressedSize, 'compressedSize' => $compressedSize, 'crc32' => hexdec(hash_final($hashCtx)) ]; }
      
      





and try on a 100 mb file:



 $reader = read('https://speed.hetzner.de/100MB.bin'); foreach ($reader as $chunk) { // -   . } ['uncompressedSize' => $uncompressedSize, 'compressedSize' => $compressedSize, 'crc32' => $crc32] = $reader->getReturn(); echo 'Uncompressed size: ' . $uncompressedSize . PHP_EOL; echo 'Compressed size: ' . $compressedSize . PHP_EOL; echo round(memory_get_peak_usage(true) / 1024 / 1024, 2) . 'MB - Memory Peak Usage' . PHP_EOL;
      
      





Memory consumption still shows that we did not load the entire file into memory.



Let's put it all together and finally get a really real script archiver.

Unlike the previous version, our generalPurposeBitFlag will change - now its value is 0x0018 , as well as compressionMethod - 8 (which means Deflate ).







Final script
 <?php function read(string $path): \Generator { $uncompressedSize = 0; $compressedSize = 0; $hashCtx = hash_init('crc32b'); $deflateCtx = deflate_init(ZLIB_ENCODING_RAW, ['level' => 6]); $handle = fopen($path, 'r'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); hash_update($hashCtx, $chunk); $compressedChunk = deflate_add($deflateCtx, $chunk, feof($handle) ? ZLIB_FINISH : ZLIB_SYNC_FLUSH); $uncompressedSize += strlen($chunk); $compressedSize += strlen($compressedChunk); yield $compressedChunk; $chunk = null; $compressedChunk = null; } fclose($handle); return [ 'uncompressedSize' => $uncompressedSize, 'compressedSize' => $compressedSize, 'crc32' => hexdec(hash_final($hashCtx)) ]; } $entries = ['https://speed.hetzner.de/100MB.bin', __FILE__]; $destination = 'test.zip'; $handle = fopen($destination, 'w'); $written = 0; $dictionary = []; foreach ($entries as $entry) { $filename = basename($entry); $fileInfo = [ 'versionToExtract' => 10, //   ,        0x0018  0x0008 'generalPurposeBitFlag' => 0x0018, 'compressionMethod' => 8, //      : 8 - Deflate 'modificationTime' => 28021, 'modificationDate' => 20072, 'crc32' => 0, 'compressedSize' => 0, 'uncompressedSize' => 0, 'filenameLength' => strlen($filename), 'extraFieldLength' => 0, ]; $LFH = pack('LSSSSSLLLSSa*', ...array_values([ 'signature' => 0x04034b50, ] + $fileInfo + ['filename' => $filename])); $fileOffset = $written; $written += fwrite($handle, $LFH); $reader = read($entry); foreach ($reader as $chunk) { $written += fwrite($handle, $chunk); $chunk = null; } [ 'uncompressedSize' => $uncompressedSize, 'compressedSize' => $compressedSize, 'crc32' => $crc32 ] = $reader->getReturn(); $fileInfo['crc32'] = $crc32; $fileInfo['compressedSize'] = $compressedSize; $fileInfo['uncompressedSize'] = $uncompressedSize; $DD = pack('LLLL', ...array_values([ 'signature' => 0x08074b50, 'crc32' => $fileInfo['crc32'], 'compressedSize' => $fileInfo['compressedSize'], 'uncompressedSize' => $fileInfo['uncompressedSize'], ])); $written += fwrite($handle, $DD); $dictionary[$filename] = [ 'signature' => 0x02014b50, 'versionMadeBy' => 798, ] + $fileInfo + [ 'fileCommentLength' => 0, 'diskNumber' => 0, 'internalFileAttributes' => 0, 'externalFileAttributes' => 2176057344, 'localFileHeaderOffset' => $fileOffset, 'filename' => $filename, ]; } $EOCD = [ 'signature' => 0x06054b50, 'diskNumber' => 0, 'startDiskNumber' => 0, 'numberCentralDirectoryRecord' => $records = count($dictionary), 'totalCentralDirectoryRecord' => $records, 'sizeOfCentralDirectory' => 0, 'centralDirectoryOffset' => $written, 'commentLength' => 0 ]; foreach ($dictionary as $entryInfo) { $CDFH = pack('LSSSSSSLLLSSSSSLLa*', ...array_values($entryInfo)); $written += fwrite($handle, $CDFH); } $EOCD['sizeOfCentralDirectory'] = $written - $EOCD['centralDirectoryOffset']; $EOCD = pack('LSSSSLLS', ...array_values($EOCD)); $written += fwrite($handle, $EOCD); fclose($handle); echo '  : ' . memory_get_peak_usage(true) . ' ' . PHP_EOL; echo '  : ' . $written . ' ' . PHP_EOL; echo '   `unzip -tq ' . $destination . '`: ' . PHP_EOL; echo '> ' . exec('unzip -tq ' . $destination) . PHP_EOL; echo PHP_EOL;
      
      





As a result, I got an archive of 360183 bytes in size (our 100MB file was compressed very well, in which most likely just a set of identical bytes), and unzip showed that no errors were found in the archive.



Conclusion



If I have enough energy and time for another article, then I will try to show how and, most importantly, why all this can be used.



If you are interested in anything else on this topic - suggest in the comments, I will try to give an answer to your question. Most likely, we will not deal with encryption, because the script has already grown, and in real life such archives, it seems to me, are not used very often.









Thank you for your attention and for your comments.








All Articles