FreeXL  1.0.6
About the .xls binary format

What a .xls binary file really is

(Prepare yourself to be continuously surprised by many unexpected revelations ...)

You may already know that there are many different versions of .xls files. Different versions have different capabilities. So we'll start by reviewing the Excel evolutionary history and and we'll introduce some Microsoft jargon because it's central to understanding the underlying operations.

CFBF

Unexpected Revelation #1: There is no .xls file format. Its really a common file suffix applied to many different things.

Recent Microsoft Office document files are based on a common container layout named CFBF (Compound File Binary Format). This container format is the same for Excel (.xls), Word (.doc_ and PowerPoint (.ppt) amongst other applications. More information:

Unexpected Revelation #2: CFBF is more of a file system than a file format

A CFBF file is divided into many equal-sized blocks named sectors. Such sectors cannot be directly accessed. In order to retrieve sectors in the expected logical order a FAT (File Allocation Table) is allocated within the CFBF file. A CFBF file is internally organized as if it was a raw physical disk. The design is based on Microsoft own FAT file-system as used by MS-DOS and early versions of Windows. The first sector of the CFBF file acts as if it was a kind-of MBR (Master Boot Record) - this first sector provides information about the layout and type of the CFBF file, such as block/sector size and version. A FAT chain allows a reader to re-assemble the sectors in the required logical order. There is a list of free block, and very large files may use a double indirection (DIFAT - Double Indirection FAT). A CFBF file always has at least a root directory: but a complete directory tree can be provided.

A CFBF file can contain many and many distinct independent files. Just to make things a little clearer, Microsoft calls such pseudo-files (I mean: the many fake ones contained within the real CFBF file) streams

The practical consequence is that any software tool attempting to access an Excel binary document must first be able to correctly access this CFBF container format.

BIFF

Unexpected Revelation #3: An Excel document will contain a stream (pseudo-file) named Workbook in the root directory of the CFBF file (filesystem).

The Workbook stream is internally structured accordingly to the BIFF (Binary Interchange File Format) specifications. You can think of the BIFF as the real Excel binary format (following more conventional naming rules). Several BIFF versions were introduced during the years: and there are significant differences between them.

An useful correspondence table relating corresponding Excel and BIFF versions:

Excel Version Commercial Name BIFF Version Release Year Notes
2.x Excel 2.0 BIFF2 1987 Before CFBF. File is the BIFF stream, containing a single worksheet.
3.0 Excel 3.0 BIFF3 1990 Before CFBF. File is the BIFF stream, containing a single worksheet.
4.0 Excel 4.0 BIFF4 1992 Before CFBF. File is the BIFF stream, containing a single worksheet.
5.0 Excel 5.0 BIFF5 1993 Starting with BIFF5, a single Workbook can internally store many individual Worksheets. The BIFF stream is stored in the CFBF file container.
7.0 Excel 95 BIFF5 1995
8.0 Excel 98 BIFF8 1998
9.0 Excel 2000 BIFF8 1999
10.0 Excel XP BIFF8 2001
11.0 Excel 2003 BIFF8 2003
12.0 Excel 2007 BIFF8 2007 Introduced alternate XML format, which is usually the default for new files.
14.0 Excel 2010 BIFF8 2010 XML format is usually the default for new files.

Note that FreeXL does not support the new XML format which is a completely different and unrelated format.

Perhaps you are now expecting that BIFF will simply and directly encode your spreadsheet data. Unfortunately, you should have know better given the steps we took to get here...

Unexpected Revelation #4: Any BIFF stream (pseudo-file stored within a CFBF container file) is internally organized as a collection of variable-length records..

Each record starts with

  • a 16 bit unsigned integer specifying the record type
  • another 16 bit unsigned integer specifies the record data length (in bytes) excluding the standard type-size prefix.

Note that there are many different record types, and the record size / layout may differ for different BIFF versions.

Three record types have an absolutely special meaning:

  • a BOF [Beginning Of File] record marks starting of a different sub-stream.
  • an EOF [End Of File] record marks ending of current sub-stream.
  • a CONTINUE record means that the previous record exceeded the maximum size for a record, and the previous record data payload will be spanned on following CONTINUE records for as many CONTINUE records as are required to store the full data size.

Unexpected Revelation #5: So a BIFF stream (pseudo-file) isn't really a file - it's more like a collection of individual sub-streams, each one of which is enclosed between BOF / EOF markers.

The most recent BIFF8 requires that at least the following internal sub-streams are be defined:

  • the first sub-stream contains workbook level global data and metadata, such as author, password protection, styles, formats, window settings and so on
  • list of individual worksheets included into the Workbook, where each worksheet is identified by a name and by a type (data Worksheet, Chart, Visual Basic module ... visible, hidden ...), and relative offset position of the corresponding BOF record allows for fast positioning.
  • any text string is stored here into the SST [Shared String Table], so individual text cells simply refer the corresponding SST entry by index (in all previous BIFF version text strings are directly stored into the appropriate cell).
  • any subsequent sub-stream represents a single Worksheet, and the most relevant data stored at Worksheet level are dimension (number of valid rows and columns) and any cell value data.

We will now see how BIFF encodes individual data types with several further amazing surprises are still to come. Be prepared!

BIFF

Leaving aside special values such as images, OLE, COM, Visual Basic related items and so on, the basic data types are supported in BIFF:

  • text strings
  • numbers (both integers and decimals)
  • dates, date-times and times
  • NULL (empty cell)

Note that any multi-byte value is stored in BIFF accordingly using Little Endian byte ordering (i.e. least significant byte comes first, most significant byte comes last).

BIFF Record Type BIFF Version Content Type
INTEGER BIFF2 16 bit unsigned integer
NUMBER BIFF2 BIFF3 BIFF4 BIFF5 BIFF8 64 bit floating point (double precision)
RK BIFF3 BIFF4 BIFF5 BIFF8 number, variant-type: INTEGER FLOAT DATE DATETIME TIME (please see the corresponding detailed description)
MULRK BIFF5 BIFF8 a variable-sized array of elementary RK values. associated to a range of consecutive cells on the same row
LABEL BIFF2 BIFF3 BIFF4 BIFF5 BIFF8 text string, variable-length. (please see the corresponding detailed description)
LABELSST BIFF8 text string, variable-length. based on the global SST [Shared String Table] stored at the workbook level, so a LABELSST simply requires providing the corresponding SST index.

So the BIFF record type that is easy to handle is NUMBER, which is essentially a C-style double. Other record types require additional handling.

RK values

An RK value is a 32 bit value.

The least significant two bits are a bit-mask (in little endian order, so the least two significant bits in the first byte that is read):

  • if 0x02 is set the RK value represents a 30 bit signed integer, otherwise it represents a 64 bit floating point double precision number requiring special reconstruction.
  • if 0x01 is set the corresponding value needs to be divided by 100, so even an integer actually becomes a floating point double precision.

When interpreting RK values as a signed integer, right shifting two bits is required:

int value = rk_value >> 2;

When interpreting RK values as a 64 bit floating point, two steps are required:

  • the RK value requires appropriate masking:
    int value = rk_value & 0xfffffffc;
  • then the 32 bit value will be copied into a 64 bit buffer, and the least significant four bytes need to be initialized as zeroes: 0x00000000.

As a final step, if 0x01 was set into the bit-mask, now we have to divide by 100 before returning the effective cell value. So for 32 bit integers:

double final_value = (double)value / 100.0;

and for 64 bit floats:

double final_value = value / 100.0;

Text values

Any BIFF version from BIFF2 to BIFF5 simply supports CodePage based character encoding, i.e. each character simply requires 8 bits to be represented (single byte). Correct representation of characters requires knowing which one CodePage table has to be applied. This can be determined from the workbook or worksheet metadata (it is the CODEPAGE record).

BIFF8 is much more sophisticated, since any text string is usually encoded as Unicode in UTF-16 Little Endian [UTF-16LE] format. This encoding is a multi-byte encoding (two bytes are required to represent a single character), but being universal no character table is required.

BIFF text strings are never null-terminated. The actual length is always explicitly stated, as an 8 bit unsigned int or as a 16 bit unsigned int (depending on BIFF versions).

FreeXL is intended to be strictly interoperable with SQLite and SpatiaLite, so any text string has to be converted to UTF-8 encoding. GNU libiconv can easily handle any required charset conversion. So we can simply fetch the appropriate bytes, then call iconv() as appropriate, and we'll immediately get back the corresponding UTF-8 encoded text string.

Converting Unicode based text strings is a little more complex, because each Unicode string is prefixed by a mask byte, specifying how the string is encoded:

  • if 0x01 is set, then the string really is 16 bit per character Unicode, otherwise a stripped notation is used instead. Stripped notation means that the characters are actually represented as single bytes, so already have the UTF8 equivalent.
  • if 0x04 and/or 0x08 are set, than some further variable-length data (providing information on text decoration such as italics, bold, underline) is inserted immediately before and after the text string itself, so we must carefully skip over this extra data so to maintain the right byte alignment.

Note that the string length is expressed in characters, not in bytes, so the actual length in bytes is twice the indicated length.

Retrieving Date, DateTime and Time values.

Dates, DateTimes and Time values are also a little complicated. Any Date is expressed as an Integer (number of days since the conventional reference day):

  • for Windows Excel the reference day (day 0) is 1900, January 1
  • for Mac Excel the reference day (day 0) is 1904, January 2

There is no possible ambiguity, because the DATEMODE metadata record specifies tells which reference day is to be used.

An odd bug affects Excel, which (incorrectly) treats 1900 as a leap year. Therefore, the non-existent 29 February 1900 has to be included in the days calculation so to get the expected Date value.

Any Time is expressed as a Fraction (percent of seconds since midnight). 0.5 corresponds to 12:00:00 (midday), 0.25 corresponds to 06:00:00, 0.75 corresponds to 18:00:00 and so on.

So a DateTime is simply the sum of a Date value and of a Time value. Dates can be represented by Integers: but Times and DateTimes require a floating point number.

The complication with Dates, DateTimes and Time values is that the data-type does not specify when a cell values has to be interpreted as a Date or Time - it is simply an Integer or Float numbers like any other. A further indirection has to applied so to correctly recognize Dates, DateTimes and Times:

  • each NUMBER, RK or MULRK value exposes an index referencing the XF (Extended Format) entry associated with the corresponding cell.
  • each XF record specifies an unique combination of font, alignment, color and so on, however a further indirection specifies the corresponding FORMAT entry
  • each FORMAT record specifies an output format, such as M/D/YY, h:mm:ss AM/PM or M/D/YY h:mm: and this finally gives us a good chance to guess which cell values are intended to represent Date/Time values.

Both XF and FORMAT records are globally stored at the Workbook level, and represent ordered arrays.

If you haven't yet given up, if you aren't yet become totally mind-boggled, and if you are still awake and conscious, then you now know how .xls files are internally organized and structured.

Be happy and feel proud of yourself.