Episode #22 - Common Archive and Compression Formats

Loading the player...

About Episode - Duration: 11 minutes, Published: 2014-04-24

In this episode, I wanted to talk about common archive and compression formats that you are likely to encounter on Linux machines. We will review the differences between archive and compression formats and then focus on how to create and extract these formats using various utilities.

Download: mp4 or webm

Get notified about future content via the mailing list, follow @jweissig_ on Twitter for episode updates, or use the RSS feed.

Links, Code, and Transcript


In this episode, I wanted to talk about common archive and compression formats that you are likely to encounter on Linux machines. We will review the differences between archive and compression formats and then focus on how to create and extract these formats using various utilities.

Before we dive in, I should define what I think the most common archive and compression formats are. As a benchmark, lets focus on how the Linux Kernel has been distributed for many years using three key archive and compression types. First we have the tar.gz package, personally I think this is the most common, next there is the tar.xz package which is gaining traction, and it is actually the default download type when you visit kernel.org looking for the Linux Kernel. Finally, we have tar.bz2, even though the distribution of the tar.bz2 package was dropped in late 2013, it is still very common out in the wild.

You will notice that these packages are all prefixed with the word tar, this denotes the archive type, and then you have the tailing compression type, for example gz, xz, and bz2 are all different compression types. This is where the distinction between archive and compression formats comes in.

So, at this point we have defined three key archive and compression format combinations, but I also think this serves as a great example use case. The kernel.org site needs to distribute the Linux Kernel source code, so it packages these files and directories up using archive and compression tools with the goal of saving disk space, bandwidth, and making these packages much easier to work with and distribute.

So, what does the workflow look like for creating and then extracting one of these packages? Well, at a high level, we start with the files we want to share, in the case of the Linux Kernel source code, this would be many thousands of files and directories. We can use an archive tool, like tar, to create a wrapper around the files and directories, I like to think of this like a bucket. This bucket holds our files and directories and also has a listing of their associated metadata, things like, their timestamps, permissions, ownership, amongst other things. Next we use a compression tool, like gzip, which takes this archive or bucket, holding our files, and compresses it down to a much smaller file. In the case of the Linux Kernel, we go from roughly 600 megs worth of source code and documentation across thousands of files and directories, down to a single compressed archive of roughly 75 megs in size.

As you can probably guess by now, this is where the distinction between archive and compression formats happens. We are using the tar archive format as a bucket to store our many files and directories as a single easily handled archive, then we take that tar archive, commonly referred to as a tarball, and compress it down using the gzip compression format. This is why you see files with names like tar.gz, as it indicates we are using the tar archive format along with gzip compression.

To extract the files, we simply need to reverse the process, using gunzip to remove the compression, and then use tar to extract the files out of the archive. I should mention that in this workflow example, I am using two commands to do this, tar and gunzip, but in reality you also have the much simpler option of having the tar utility do both the archive and compression steps. The tar command can be used as a type of Swiss Army Knife, by having it act as a wrapper around many compression formats so you do not actually have to use multiple commands to compress or extract archives.

Compress and Extract

For simplicity sake I am have broken this episode into four logical sections. In the first section, we are going to walk through creating and extracting a tar archive, basically just an archive without the compression step. In the following three sections will cover creating tar.gz, tar.xz, and tar.bz2 compressed archives.

Before I create the tar archive, I should mention that I have created an example directory called archive, and inside that directory are four files of various types that will be used throughout the example sections. I highly recommend checking out the man page for the tar command as it is very well written and packed with detailed information. Also, in my examples today, I will show you the longer command arguments first as I find them helpful to use when explaining what a command does, and then we can look at the shorthand or abbreviated ones later on.

How to create and extract a tar archive

So, lets get started with creating an extracting tar archives.

tar --create --preserve-permissions --verbose --file archive.tar archive/

As you can see, I am using the tar command, with the create option, next we are telling tar to preserve the file permissions, this can be handy if you have special modes, users, or groups which you want to preserve, next we tell tar to be verbose, basically, give us feedback as things are happening, then we use the file option, telling tar to create our archive with a given name, lastly, we tell tar where the files are that we want to archive.

However, if we use the abbreviated options, the command is much shorter, and it gives the same result.

tar cpvf archive.tar archive/

Just to review, we create, preserve permissions, be verbose, provide a destination archive name, and give the source files or directory for the archive.

So, what about extracting the tarball? Well, you just reverse the process like we talked about earlier. The command to extract the tarball looks like this.

tar --extract --verbose --file archive.tar

Essentially, just tell tar to extract the contents, tell us about the status, and point tar at the archive we want to extract. The shorthand version is even simpler.

tar xvf archive.tar

We are extracting the archive contents, using verbose mode to give us feedback as things are happening, and then specifying the file we want to extract.

Okay, so that covers the archive part of the equation, but what about the compression part? Well, as I mentioned earlier, the following three sections will cover creating and extracting tar.gz, tar.xz, and tar.bz2 compressed archives. So, lets dive into creating and extracting tar.gz files.

How to create and extract a tar.gz archive

The starting point for this example will be our archive directory and some sample files. The command to create a tar.gz compressed archive looks like this.

tar --create --gzip --preserve-permissions --verbose --file archive.tar.gz archive/

We actually use almost the exact same command for creating tar.gz file as you do in creating standalone tar archive, just with one additional switch. You will notice that I am using this new gzip option in conjunction with the create option. What this is actually doing is creating tarball and then compressing it with gzip. The much shorter version of the command looks like this.

tar czpvf archive.tar.gz archive/

So we are creating the tarball, using gzip compression, preserving permissions, being verbose, and specifying the file name we want to create, along with the folder to archive.

To reverse the process you can use either of the following commands. Basically, we are just telling tar to extract the contents like it would any typical tarball, but we are also providing the gzip option, which tells it to decompress the archive first.

tar --extract --gzip --verbose --file archive.tar.gz

The shorthand version of the commands looks like this.

tar xzvf archive.tar.gz

Again, we extract the archive contacts, tell tar that it is a gzipped archive, be verbose about what is going on, and use this file name as the source.

How to create and extract a tar.xz archive

Next, lets look at creating and extracting tar.xz compressed archives. We are going to use almost the exact command for creating tar.gz files, but swap out the gzip compression option with xz. So, the command looks like this.

tar --create --xz --preserve-permissions --verbose --file archive.tar.xz archive/

Easy enough right? You will also notice that the command took a little longer to complete, and when we list the directory contents, you can see that xz compression yields a smaller file than gzip in this case. The short hand version is a little less intuitive since you need to remember the corresponding letter to the compression type. After a while it will become muscle memory though. The key thing to remember is the upper case J, this indicates we are using xz compression.

tar cJpvf archive.tar.xz archive/

You can probably guess what the extraction command looks like. The long version uses the extract option along with the xz option to indicate this is a compressed archive.

tar --extract --xz --verbose --file archive.tar.xz

Again, the shorthand version is a little less intuitive. This is why the man pages are so helpful, in that you can quickly look up these options if needed.

tar xJvf archive.tar.xz

How to create and extract a tar.bz2 archive

Finally, lets look at creating and extracting tar.bz2 compressed archives. Personally, in my day to day work, I find that tar.gz and tar.bz2 compressed archives are the most common. So, I think it is fitting that we are finishing off this episode looking at bz2 archives.

You can probably already guess that the command looks almost identical to the one used for creating the tar.gz and tar.xz compressed archives. The only difference is that we are using the bzip2 option.

tar --create --bzip2 --preserve-permissions --verbose --file archive.tar.bz2 archive/ 

You will notice that each of these compression formats yields various output file sizes. These are highly dependent on your workload, so you should experiment if you are looking for optimal file sizes.

The shorthand version of this command looks like this.

tar cjpvf archive.tar.bz2 archive/

In this case, to specifying bzip2 compression, we are using the little j. The man pages have full details about all the options, but you will find that as you interact with these formats, you will start to remember the options, or at least know where to find the information.

And finally lets look at extracting bz2 archive.

tar --extract --bzip2 --verbose --file archive.tar.bz2 

You should start to notice a theme here. We have used almost the exact same commands throughout this episode, while just swapping out the compression switches.

tar xjvf archive.tar.bz2

And to close off, here is the shorthand version of how to extract a tar.bz2 compressed archive.

Before I end this episode, I should mention that I have also listed the wikipedia pages for these various formats in the episode notes below, since they are also packed with useful information.