FreeXL
1.0.6
|
What a .xls binary file really is
(Prepare yourself to be continuously surprised by many unexpected revelations ...)
You may already know that there are many different versions of .xls files. Different versions have different capabilities. So we'll start by reviewing the Excel evolutionary history and and we'll introduce some Microsoft jargon because it's central to understanding the underlying operations.
Unexpected Revelation #1: There is no .xls file format. Its really a common file suffix applied to many different things.
Recent Microsoft Office document files are based on a common container layout named CFBF (Compound File Binary Format). This container format is the same for Excel (.xls), Word (.doc_ and PowerPoint (.ppt) amongst other applications. More information:
Unexpected Revelation #2: CFBF is more of a file system than a file format
A CFBF file is divided into many equal-sized blocks named sectors. Such sectors cannot be directly accessed. In order to retrieve sectors in the expected logical order a FAT (File Allocation Table) is allocated within the CFBF file. A CFBF file is internally organized as if it was a raw physical disk. The design is based on Microsoft own FAT file-system as used by MS-DOS and early versions of Windows. The first sector of the CFBF file acts as if it was a kind-of MBR (Master Boot Record) - this first sector provides information about the layout and type of the CFBF file, such as block/sector size and version. A FAT chain allows a reader to re-assemble the sectors in the required logical order. There is a list of free block, and very large files may use a double indirection (DIFAT - Double Indirection FAT). A CFBF file always has at least a root directory: but a complete directory tree can be provided.
A CFBF file can contain many and many distinct independent files. Just to make things a little clearer, Microsoft calls such pseudo-files (I mean: the many fake ones contained within the real CFBF file) streams
The practical consequence is that any software tool attempting to access an Excel binary document must first be able to correctly access this CFBF container format.
Unexpected Revelation #3: An Excel document will contain a stream (pseudo-file) named Workbook in the root directory of the CFBF file (filesystem).
The Workbook stream is internally structured accordingly to the BIFF (Binary Interchange File Format) specifications. You can think of the BIFF as the real Excel binary format (following more conventional naming rules). Several BIFF versions were introduced during the years: and there are significant differences between them.
An useful correspondence table relating corresponding Excel and BIFF versions:
Excel Version | Commercial Name | BIFF Version | Release Year | Notes |
---|---|---|---|---|
2.x | Excel 2.0 | BIFF2 | 1987 | Before CFBF. File is the BIFF stream, containing a single worksheet. |
3.0 | Excel 3.0 | BIFF3 | 1990 | Before CFBF. File is the BIFF stream, containing a single worksheet. |
4.0 | Excel 4.0 | BIFF4 | 1992 | Before CFBF. File is the BIFF stream, containing a single worksheet. |
5.0 | Excel 5.0 | BIFF5 | 1993 | Starting with BIFF5, a single Workbook can internally store many individual Worksheets. The BIFF stream is stored in the CFBF file container. |
7.0 | Excel 95 | BIFF5 | 1995 | |
8.0 | Excel 98 | BIFF8 | 1998 | |
9.0 | Excel 2000 | BIFF8 | 1999 | |
10.0 | Excel XP | BIFF8 | 2001 | |
11.0 | Excel 2003 | BIFF8 | 2003 | |
12.0 | Excel 2007 | BIFF8 | 2007 | Introduced alternate XML format, which is usually the default for new files. |
14.0 | Excel 2010 | BIFF8 | 2010 | XML format is usually the default for new files. |
Note that FreeXL does not support the new XML format which is a completely different and unrelated format.
Perhaps you are now expecting that BIFF will simply and directly encode your spreadsheet data. Unfortunately, you should have know better given the steps we took to get here...
Unexpected Revelation #4: Any BIFF stream (pseudo-file stored within a CFBF container file) is internally organized as a collection of variable-length records..
Each record starts with
Note that there are many different record types, and the record size / layout may differ for different BIFF versions.
Three record types have an absolutely special meaning:
Unexpected Revelation #5: So a BIFF stream (pseudo-file) isn't really a file - it's more like a collection of individual sub-streams, each one of which is enclosed between BOF / EOF markers.
The most recent BIFF8 requires that at least the following internal sub-streams are be defined:
We will now see how BIFF encodes individual data types with several further amazing surprises are still to come. Be prepared!
Leaving aside special values such as images, OLE, COM, Visual Basic related items and so on, the basic data types are supported in BIFF:
Note that any multi-byte value is stored in BIFF accordingly using Little Endian byte ordering (i.e. least significant byte comes first, most significant byte comes last).
BIFF Record Type | BIFF Version | Content Type |
---|---|---|
INTEGER | BIFF2 | 16 bit unsigned integer |
NUMBER | BIFF2 BIFF3 BIFF4 BIFF5 BIFF8 | 64 bit floating point (double precision) |
RK | BIFF3 BIFF4 BIFF5 BIFF8 | number, variant-type: INTEGER FLOAT DATE DATETIME TIME (please see the corresponding detailed description) |
MULRK | BIFF5 BIFF8 | a variable-sized array of elementary RK values. associated to a range of consecutive cells on the same row |
LABEL | BIFF2 BIFF3 BIFF4 BIFF5 BIFF8 | text string, variable-length. (please see the corresponding detailed description) |
LABELSST | BIFF8 | text string, variable-length. based on the global SST [Shared String Table] stored at the workbook level, so a LABELSST simply requires providing the corresponding SST index. |
So the BIFF record type that is easy to handle is NUMBER, which is essentially a C-style double. Other record types require additional handling.
An RK value is a 32 bit value.
The least significant two bits are a bit-mask (in little endian order, so the least two significant bits in the first byte that is read):
When interpreting RK values as a signed integer, right shifting two bits is required:
When interpreting RK values as a 64 bit floating point, two steps are required:
As a final step, if 0x01 was set into the bit-mask, now we have to divide by 100 before returning the effective cell value. So for 32 bit integers:
and for 64 bit floats:
Any BIFF version from BIFF2 to BIFF5 simply supports CodePage based character encoding, i.e. each character simply requires 8 bits to be represented (single byte). Correct representation of characters requires knowing which one CodePage table has to be applied. This can be determined from the workbook or worksheet metadata (it is the CODEPAGE record).
BIFF8 is much more sophisticated, since any text string is usually encoded as Unicode in UTF-16 Little Endian [UTF-16LE] format. This encoding is a multi-byte encoding (two bytes are required to represent a single character), but being universal no character table is required.
BIFF text strings are never null-terminated. The actual length is always explicitly stated, as an 8 bit unsigned int or as a 16 bit unsigned int (depending on BIFF versions).
FreeXL is intended to be strictly interoperable with SQLite and SpatiaLite, so any text string has to be converted to UTF-8 encoding. GNU libiconv can easily handle any required charset conversion. So we can simply fetch the appropriate bytes, then call iconv() as appropriate, and we'll immediately get back the corresponding UTF-8 encoded text string.
Converting Unicode based text strings is a little more complex, because each Unicode string is prefixed by a mask byte, specifying how the string is encoded:
Note that the string length is expressed in characters, not in bytes, so the actual length in bytes is twice the indicated length.
Dates, DateTimes and Time values are also a little complicated. Any Date is expressed as an Integer (number of days since the conventional reference day):
There is no possible ambiguity, because the DATEMODE metadata record specifies tells which reference day is to be used.
An odd bug affects Excel, which (incorrectly) treats 1900 as a leap year. Therefore, the non-existent 29 February 1900 has to be included in the days calculation so to get the expected Date value.
Any Time is expressed as a Fraction (percent of seconds since midnight). 0.5 corresponds to 12:00:00 (midday), 0.25 corresponds to 06:00:00, 0.75 corresponds to 18:00:00 and so on.
So a DateTime is simply the sum of a Date value and of a Time value. Dates can be represented by Integers: but Times and DateTimes require a floating point number.
The complication with Dates, DateTimes and Time values is that the data-type does not specify when a cell values has to be interpreted as a Date or Time - it is simply an Integer or Float numbers like any other. A further indirection has to applied so to correctly recognize Dates, DateTimes and Times:
Both XF and FORMAT records are globally stored at the Workbook level, and represent ordered arrays.
If you haven't yet given up, if you aren't yet become totally mind-boggled, and if you are still awake and conscious, then you now know how .xls files are internally organized and structured.
Be happy and feel proud of yourself.