\documentclass[a4paper]{article} % generated by Docutils \usepackage{cmap} % fix search and cut-and-paste in Acrobat \usepackage{ifthen} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{alltt} \setcounter{secnumdepth}{0} \usepackage{longtable,ltcaption,array} \setlength{\extrarowheight}{2pt} \newlength{\DUtablewidth} % internal use in tables %%% Custom LaTeX preamble % PDF Standard Fonts \usepackage{mathptmx} % Times \usepackage[scaled=.90]{helvet} \usepackage{courier} %%% User specified packages and stylesheets \usepackage{fullpage} \usepackage{Sweave} %\VignetteIndexEntry{sas7bdat} %%% Fallback definitions for Docutils-specific commands % Provide a length variable and set default, if it is new \providecommand*{\DUprovidelength}[2]{ \ifthenelse{\isundefined{#1}}{\newlength{#1}\setlength{#1}{#2}}{} } % line block environment \DUprovidelength{\DUlineblockindent}{2.5em} \ifthenelse{\isundefined{\DUlineblock}}{ \newenvironment{DUlineblock}[1]{% \list{}{\setlength{\partopsep}{\parskip} \addtolength{\partopsep}{\baselineskip} \setlength{\topsep}{0pt} \setlength{\itemsep}{0.15\baselineskip} \setlength{\parsep}{0pt} \setlength{\leftmargin}{#1}} \raggedright } {\endlist} }{} % hyperlinks: \ifthenelse{\isundefined{\hypersetup}}{ \usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue]{hyperref} \usepackage{bookmark} \urlstyle{same} % normal text font (alternatives: tt, rm, sf) }{} \hypersetup{ pdftitle={SAS7BDAT Database Binary Format}, } %%% Body \begin{document} \title{SAS7BDAT Database Binary Format% \label{sas7bdat-database-binary-format}} \author{} \date{} \maketitle by: \begin{quote} \begin{DUlineblock}{0em} \item[] Matthew S. Shotwell, PhD \item[] Assistant Professor \item[] Department of Biostatistics \item[] Vanderbilt University \item[] \href{mailto:matt.shotwell@vanderbilt.edu}{matt.shotwell@vanderbilt.edu} \end{DUlineblock} \end{quote} 1/9/2013 update (\textbf{u64} format extensions, Row Size fields, and RLE compression) by: \begin{quote} \begin{DUlineblock}{0em} \item[] Clint Cummins, PhD \item[] \href{mailto:clint@stanford.edu}{clint@stanford.edu} \end{DUlineblock} \end{quote} Copyright (C) 2013 is retained by the authors listed above. This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit \url{http://creativecommons.org/licenses/by-sa/3.0/}. \section{Contents% \label{contents}% } \begin{itemize} \item \hyperref[introduction]{Introduction} \item \hyperref[sas7bdat-header]{SAS7BDAT Header} \item \hyperref[sas7bdat-pages]{SAS7BDAT Pages} \item \hyperref[sas7bdat-subheaders]{SAS7BDAT Subheaders} \item \hyperref[sas7bdat-packed-binary-data]{SAS7BDAT Packed Binary Data} \item \hyperref[platform-differences]{Platform Differences} \item \hyperref[compression-data]{Compression Data} \item \hyperref[software-prototype]{Software Prototype} \item \hyperref[todo]{ToDo} \end{itemize} \section{Introduction% \label{introduction}% } The SAS7BDAT file is a binary database storage file. At the time of this writing, no description of the SAS7BDAT file format was publicly available. Hence, users who wish to read and manipulate these files were required to obtain a license for the SAS software, or third party software with support for SAS7BDAT files. The purpose of this document is to promote interoperability between SAS and other popular statistical software packages, especially R (\url{http://www.r-project.org/}). The information below was deduced by examining the contents of many SAS7BDAT databases downloaded freely from internet resources (see \texttt{data/sas7bdat.sources.RData}). No guarantee is made regarding its accuracy. No SAS software, nor any other software requiring the purchase of a license was used. SAS7BDAT files consist of binary encoded data. Data files encoded in this format often have the extension '.sas7bdat'. The name 'SAS7BDAT' is not official, but is used throughout this document to refer to SAS database files formatted according to the descriptions below. There are significant differences in the SAS7BDAT format depending on the operating systems and computer hardware platforms (32bit vs. 64bit). See the section on \hyperref[platform-differences]{platform differences} for more details. The format described below is sufficient to read the entire collection of test files referenced in \texttt{data/sas7bdat.sources.RData} (i.e. files associated with 32bit and some 64bit builds of SAS for Microsoft Windows, and \textbf{u64} SAS versions). This includes files created with COMPRESS=CHAR. The format described here is probably not sufficient to \textbf{write} SAS7BDAT format files, due to lingering uncertainties. The figure below illustrates the overall structure of the SAS7BDAT database. Each file consists of a header (length := HL bytes), followed by PC pages, each of length PL bytes (PC and PL are shorthand for 'page count' and 'page size' respectively, and are used to denote these quantities throughout this document).: \begin{quote} \begin{alltt} -{}-{}-{}-{}-{}-{}-{}-{}-{}- | HL | header -{}-{}-{}-{}-{}-{}-{}-{}-{}- | PL | page 1 -{}-{}-{}-{}-{}-{}-{}-{}-{}- | PL | page 2 -{}-{}-{}-{}-{}-{}-{}-{}-{}- ... -{}-{}-{}-{}-{}-{}-{}-{}-{}- | PL | page PC -{}-{}-{}-{}-{}-{}-{}-{}-{}- \end{alltt} \end{quote} Throughout this document, hexadecimal digits are denoted with a preceding 'x', binary digits with a preceding 'b', and decimal digits with no preceding character. For example, see the below \hyperref[table-of-hexadecimal-decimal-and-binary-values]{table of hexadecimal, decimal, and binary values}. \section{SAS7BDAT Header% \label{sas7bdat-header}% } The SAS7BDAT file header contains a binary file identifier (\emph{i.e.}, a magic number), the dataset name, timestamp, the number pages (PC), their size (PL) and a variety of other values that pertain to the database as a whole. The purpose of many header fields remain unknown, but are likely to include specifications for data compression and encryption, password protection, and dates/times of creation and/or modification. Most files encountered encode multi-byte values little-endian (least significant byte first). However, some files have big-endian values. Hence, it appears that multi-byte values are encoded using endianness of the platform where the file was written. See \hyperref[platform-differences]{Platform Differences} for a table of key test files which differ in several ways. The \emph{offset table} below describes the SAS7BDAT file header as a sequence of bytes. Information stored in the table is indexed by its byte offset (first column) in the header and its length (second column) in bytes. For example, the field at offset 0 has length 32 bytes. Hence, bytes 0,1,...,31 comprise the data for this field. Byte lengths having the form '\%n' should read: 'the number of bytes remaining up to, but not including byte n'. The fourth column gives a shorthand description of the data contained at the corresponding offset. For example, 'int, page size := PL' indicates that the data stored at the corresponding location is a signed integer representing the page size, which we denote PL. The description \emph{????????????} indicates that the meaning of data stored at the corresponding offset is unknown. The third column represents the author's confidence (low, medium, high) in the corresponding offset, length, and description. Each offset table in this document is formatted in a similar fashion. Variables defined in an offset table are sometimes used in subsequent tables. \subsection{Header Offset Table% \label{header-offset-table}% } \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.114\DUtablewidth}|p{0.056\DUtablewidth}|p{0.056\DUtablewidth}|p{0.724\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 32 & high & binary, \hyperref[magic-number]{magic number} \\ \hline 32 & 1 & high & binary, \hyperref[alignment]{Alignment}: if (byte==x33) a2=4 else a2=0 . \textbf{u64} is true if a2=4 (unix 64 bit format). \\ \hline 33 & 2 & low & \emph{????????????} \\ \hline 35 & 1 & high & binary, \hyperref[alignment]{Alignment} if (byte==x33) a1=4 else a1=0 \\ \hline 36 & 1 & low & \emph{????????????} \\ \hline 37 & 1 & high & int, endianness (x01-little {[}Intel{]} x00-big) \\ \hline 38 & 1 & low & \emph{????????????} \\ \hline 39 & 1 & medium & ascii, OS type (1-UNIX or 2-WIN). Does not affect format except for the OS strings. \\ \hline 40 & 8 & low & \emph{????????????} \\ \hline 48 & 8 & low & \emph{????????????} \\ \hline 56 & 8 & low & repeat of 32:32+8 \\ \hline 64 & 6 & low & \emph{????????????} \\ \hline 70 & 2 & low & int, \hyperref[character-encoding]{Character Encoding} \\ \hline 72 & 12 & low & \emph{????????????} \\ \hline 84 & 8 & high & ascii 'SAS FILE' \\ \hline 92 & 64 & high & ascii, dataset name \\ \hline 156 & 8 & medium & ascii, file type, e.g. \texttt{'DATA ~ ~'} \\ \hline 164 & a1 & medium & zero padding when a1=4 . Aligns the double timestamps below on double word boundaries. \\ \hline 164+a1 & 8 & high & double, timestamp, date created, secs since 1/1/60 (for SAS version 8.x and higher) \\ \hline 172+a1 & 8 & high & double, timestamp, date modified, secs since 1/1/60 (for SAS version 8.x and higher) \\ \hline 180+a1 & 16 & low & \emph{????????????} \\ \hline 196+a1 & 4 & high & int, length of SAS7BDAT header := HL \\ \hline 200+a1 & 4 & high & int, page size := % \phantomsection\label{pl}PL \\ \hline 204+a1 & 4+a2 & high & int, page count := PC . Length 4 or 8 (\textbf{u64}), henceforth denoted \textbf{4|8} \\ \hline 208+a1+a2 & 8 & low & \emph{????????????} \\ \hline 216+a1+a2 & 8 & high & ascii, SAS release (e.g. 9.0101M3 ) \\ \hline 224+a1+a2 & 16 & high & ascii, host (SAS server type, longest observed string has 9 bytes) \\ \hline 240+a1+a2 & 16 & high & ascii, OS version number (for UNIX, else null) \\ \hline 256+a1+a2 & 16 & high & ascii, OS maker or version (SUN, IBM, sometimes WIN) \\ \hline 272+a1+a2 & 16 & high & ascii, OS name (for UNIX, else null) \\ \hline 288+a1+a2 & 32 & low & \emph{????????????} \\ \hline 320+a1+a2 & 4 & low & int, page sequence signature? (value is close to the value at start of each Page Offset Table) \\ \hline 324+a1+a2 & 4 & low & \emph{????????????} \\ \hline 328+a1+a2 & 8 & medium & double, 3rd timestamp, sometimes zero \\ \hline 336+a1+a2 & \%HL & medium & zeros \\ \hline 1024|8192 & & medium & Total length of header (8192 for \textbf{u64}), HL \\ \hline \end{longtable*} The 8 bytes beginning at offset 32 hold information which affects the offset of the 'release' and 'host' information. In particular: \begin{enumerate} \item The byte at offset 32 defines the \textbf{u64} (unix 64 bit) file format, which affects many field and header lengths (usually via 4 vs. 8 byte integers). \item The byte at offset 35 controls an offset before the timestamps. \item The byte at offset 37 defines byte ordering of ints and doubles (most test files were created on Windows and use Intel byte ordering; little endian). \item The byte at offset 39 appears to distinguish the OS type, where '1' indicates that the file was generated on a UNIX-like system, such as Linux or SunOS, and '2' indicates the file was generated on a Microsoft Windows platform. However, this does not affect any important fields in the file format. \end{enumerate} The following table describes some of the possible polymorphisms for the 8 bytes at offset 32. The first field lists the name of the file where the sequence was found (see \texttt{data/sas7bdat.sources.RData}), the second lists the eight byte values (hexadecimal), the third field shows bytes 216-239 in ASCII ('.' represents a non-ASCII character or '0'), and the fourth field lists the SAS7BDAT sub-format. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.229\DUtablewidth}|p{0.294\DUtablewidth}|p{0.237\DUtablewidth}|p{0.189\DUtablewidth}|} \hline \textbf{% filename } & \textbf{% bytes 32-39 } & \textbf{% bytes 216-239 } & \textbf{% format } \\ \hline \endfirsthead \hline \textbf{% filename } & \textbf{% bytes 32-39 } & \textbf{% bytes 216-239 } & \textbf{% format } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot \texttt{compress\_no.sas7bdat} & \texttt{x22 x22 x00 x32 x22 x01 x02 x32} & \texttt{9.0101M3NET\_ASRV........} & Windows Intel \\ \hline \texttt{compress\_yes.sas7bdat} & \texttt{x22 x22 x00 x32 x22 x01 x02 x32} & \texttt{9.0101M3NET\_ASRV........} & Windows Intel \\ \hline \texttt{lowbwt\_i386.sas7bdat} & \texttt{x22 x22 x00 x32 x22 x01 x02 x32} & \texttt{9.0202M0W32\_VSPRO.......} & Windows Intel \\ \hline \texttt{missing\_values.sas7bdat} & \texttt{x22 x22 x00 x32 x22 x01 x02 x32} & \texttt{9.0202M0W32\_VSPRO.......} & Windows Intel \\ \hline \texttt{obs\_all\_perf\_1.sas7bdat} & \texttt{x22 x22 x00 x32 x22 x01 x02 x32} & \texttt{9.0101M3XP\_PRO..........} & Windows Intel \\ \hline \texttt{adsl.sas7bdat} & \texttt{x22 x22 x00 x33 x33 x01 x02 x32} & \texttt{....9.0202M3X64\_ESRV....} & Windows x64 Intel \\ \hline \texttt{eyecarex.sas7bdat} & \texttt{x22 x22 x00 x33 x22 x00 x02 x31} & \texttt{....9.0000M0WIN.........} & Unix non-Intel \\ \hline \texttt{lowbwt\_x64.sas7bdat} & \texttt{x22 x22 x00 x33 x33 x01 x02 x32} & \texttt{....9.0202M2X64\_VSPRO...} & Windows x64 Intel \\ \hline \texttt{natlterr1994.sas7bdat} & \texttt{x33 x22 x00 x33 x33 x00 x02 x31} & \texttt{........9.0101M3SunOS...} & u64 Unix non-Intel \\ \hline \texttt{natlterr2006.sas7bdat} & \texttt{x33 x22 x00 x33 x33 x00 x02 x31} & \texttt{........9.0101M3SunOS...} & u64 Unix non-Intel \\ \hline \texttt{txzips.sas7bdat} & \texttt{x33 x22 x00 x33 x33 x01 x02 x31} & \texttt{........9.0201M0Linux...} & u64 Unix Intel \\ \hline \end{longtable*} \phantomsection\label{table-of-hexadecimal-decimal-and-binary-values} The binary representation for the hexadecimal values present in the table above are given below. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.145\DUtablewidth}|p{0.098\DUtablewidth}|p{0.168\DUtablewidth}|} \hline \textbf{% hexadecimal } & \textbf{% decimal } & \textbf{% binary } \\ \hline \endfirsthead \hline \textbf{% hexadecimal } & \textbf{% decimal } & \textbf{% binary } \\ \hline \endhead \multicolumn{3}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot \texttt{x01} & \texttt{001} & \texttt{b00000001} \\ \hline \texttt{x02} & \texttt{002} & \texttt{b00000010} \\ \hline \texttt{x22} & \texttt{034} & \texttt{b00010010} \\ \hline \texttt{x31} & \texttt{049} & \texttt{b00011001} \\ \hline \texttt{x32} & \texttt{050} & \texttt{b00011010} \\ \hline \texttt{x33} & \texttt{051} & \texttt{b00011011} \\ \hline \end{longtable*} \subsubsection{Alignment% \label{alignment}% } In files generated by 64 bit builds of SAS, 'alignment' means that all data field offsets containing doubles or 8 byte ints should be a factor of 8 bytes. For files generated by 32 bit builds of SAS, the alignment is 4 bytes. Because \hyperref[sas7bdat-packed-binary-data]{SAS7BDAT Packed Binary Data} may contain double precision values, it appears that all data rows are 64 bit aligned, regardless of whether the file was written with a 32 bit or 64 bit build of SAS. Alignment of data structures according to the platform word length (4 bytes for 32 bit, and 8 bytes for 64 bit architectures) facilitates efficient operations on data stored in memory. It also suggests that parts of SAS7BDAT data file format are platform dependent. One theory is that the SAS implementation utilizes a common C or C++ structure or class to reference data stored in memory. When compiled, these structures are aligned according to the word length of the target platform. Of course, when SAS was originally written, platform differences may not have been forseeable. Hence, these inconsistencies may not have been intentional. \subsubsection{Magic Number% \label{magic-number}% } The SAS7BDAT magic number is the following 32 byte (hex) sequence.: \begin{quote} \begin{alltt} x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 xc2 xea x81 x60 xb3 x14 x11 xcf xbd x92 x08 x00 x09 xc7 x31 x8c x18 x1f x10 x11 \end{alltt} \end{quote} In all test files except one (not listed in \texttt{data/sas7bdat.sources.RData}), the magic number above holds. The one anomalous file has the following magic number: \begin{quote} \begin{alltt} x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x00 x18 x1f x10 x11 \end{alltt} \end{quote} In addition, the anomalous file is associated with the SAS release \textquotedbl{}3.2TK\textquotedbl{}. Indeed, this file may not have been written by SAS. Otherwise, the anomalous file appears to be formatted similarly to other test files. \subsubsection{Character Encoding% \label{character-encoding}% } The integer (one or two bytes) at header offset 70 (bytes) indicates the character encoding of string data. The table below lists the values that are known to occur and the associated character encoding. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.179\DUtablewidth}|p{0.179\DUtablewidth}|p{0.168\DUtablewidth}|} \hline \textbf{% bytes 70-72 } & \textbf{% SAS name } & \textbf{% iconv name } \\ \hline \endfirsthead \hline \textbf{% bytes 70-72 } & \textbf{% SAS name } & \textbf{% iconv name } \\ \hline \endhead \multicolumn{3}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & (Unspecified) & (Unspecified) \\ \hline 20 & utf-8 & UTF-8 \\ \hline 28 & us-ascii & US-ASCII \\ \hline 29 & latin1 & ISO-8859-1 \\ \hline 30 & latin2 & ISO-8859-2 \\ \hline 31 & latin3 & ISO-8859-3 \\ \hline 34 & arabic & ISO-8859-6 \\ \hline 36 & hebrew & ISO-8859-8 \\ \hline 39 & thai & ISO-8859-11 \\ \hline 40 & latin5 & ISO-8859-9 \\ \hline 60 & wlatin2 & WINDOWS-1250 \\ \hline 61 & wcyrillic & WINDOWS-1251 \\ \hline 62 & wlatin1 & WINDOWS-1252 \\ \hline 63 & wgreek & WINDOWS-1253 \\ \hline 64 & wturkish & WINDOWS-1254 \\ \hline 65 & whebrew & WINDOWS-1255 \\ \hline 66 & warabic & WINDOWS-1256 \\ \hline 119 & euc-tw & EUC-TW \\ \hline 123 & big5 & BIG-5 \\ \hline 125 & euc-cn & EUC-CN \\ \hline 134 & euc-jp & EUC-JP \\ \hline 138 & shift-jis & SHIFT-JIS \\ \hline 140 & euc-kr & EUC-KR \\ \hline \end{longtable*} When the encoding is unspecified, the file uses the encoding of the SAS session that produced it (usually Windows-1252). \section{SAS7BDAT Pages% \label{sas7bdat-pages}% } Following the SAS7BDAT header are pages of data. Each page can be one of (at least) four types. The first three are those that contain meta-information (e.g. field/column attributes), packed binary data, or a combination of both. These types are denoted 'meta', 'data', and 'mix' respectively. Meta-information is required to correctly interpret the packed binary information. Hence, this information must be parsed first. In test files, 'meta' and 'mix' pages always precede 'data' pages. In some test data files, there is a fourth page type, denoted 'amd' which appears to encode additional meta information. This page usually occurs last, and appears to contain amended meta information. The \hyperref[page-offset-table]{page offset table} below describes each page type. Byte offsets appended with one of '(meta/mix)', '(mix)', or '(data)' indicate that the corresponding length and description apply only to pages of the listed type. Provisionally, the internal structure of the 'amd' page type is considered identical to the 'meta' page type. \subsection{Page Offset Table% \label{page-offset-table}% } \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.146\DUtablewidth}|p{0.146\DUtablewidth}|p{0.071\DUtablewidth}|p{0.587\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4 & low & int, page sequence signature? \\ \hline 4 & 12|28 & low & \emph{????????????} length 12 or 28 (\textbf{u64}) \\ \hline B & 2 & medium & int, bit field \hyperref[page-type]{page type} := \_PGTYPE; B = 16|32 \\ \hline B+2 & 2 & medium & int, data block count := % \phantomsection\label{bc}BC \\ \hline B+4 & 2 & medium & int, \hyperref[subheader-pointers]{subheader pointers} count := % \phantomsection\label{sc}SC <= \hyperref[bc]{BC} \\ \hline B+6 & 2 & low & \emph{????????????} \\ \hline B+8 & SC*SL & medium & SC \hyperref[subheader-pointers]{subheader pointers}, SL = 12|24 \\ \hline B+8+SC*SL & DL & medium & if NRD>0, 8 byte alignment; DL = (B+8+SC*SL+7) \% 8 * 8 \\ \hline B+8+SC*SL+DL & RC*`RL`\_ & medium & \hyperref[sas7bdat-packed-binary-data]{SAS7BDAT packed binary data} data row count := RC = (BC-SC) \\ \hline C & \%`PL`\_ & medium & subheader data and/or filler; C = (B+8+SC*SL+DL+RC*RL) \\ \hline \end{longtable*} \subsubsection{Page Type% \label{page-type}% } \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.070\DUtablewidth}|p{0.051\DUtablewidth}|p{0.107\DUtablewidth}|p{0.386\DUtablewidth}|p{0.340\DUtablewidth}|} \hline \textbf{% PGTYPE } & \textbf{% name } & \textbf{% subheaders } & \textbf{% uncompressed row data (after subheaders) } & \textbf{% compressed row data (in subheaders) } \\ \hline \endfirsthead \hline \textbf{% PGTYPE } & \textbf{% name } & \textbf{% subheaders } & \textbf{% uncompressed row data (after subheaders) } & \textbf{% compressed row data (in subheaders) } \\ \hline \endhead \multicolumn{5}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & meta & yes (SC>0) & no (BC=SC) & yes \\ \hline 256 & data & no (SC=0) & yes (RC=BC) & no \\ \hline 512 & mix & yes (SC>0) & yes (RC=BC-SC) & no \\ \hline 1024 & amd & yes? & yes? & no? \\ \hline 16384 & meta & yes (SC>0) & no (BC=SC) & yes \\ \hline -28672 & comp & no & no & no \\ \hline \end{longtable*} There are at least four page types 'meta', 'data', 'mix', and 'amd'. These types are encoded in the most significant byte of a two byte bit field at page offset 16|32. If no bit is set, the following page is of type 'meta'. If the first, second, or third bits are set, then the page is of type 'data', 'mix', or 'amd', respectively. Hence, if the two bytes are interpreted as an unsigned integer, then the 'meta', 'data', 'mix', and 'amd' types correspond to 0, 256, 512, and 1024, respectively. In compressed files, other bits (and sometimes multiple bits) have been set (e.g., \texttt{1 <{}< 16 | 1 <{}< 13}, which is \texttt{-28672} signed, or \texttt{36864} unsigned). However, the pattern is unclear. If a page is of type 'meta', 'mix', or 'amd', data beginning at offset byte 24|40 are a sequence of SC SL-byte \hyperref[subheader-pointers]{subheader pointers}, which point to an offset farther down the page. \hyperref[sas7bdat-subheaders]{SAS7BDAT Subheaders} stored at these offsets hold meta information about the database, including the column names, labels, and types. If a page is of type 'mix', then \textbf{packed binary data begin at the next 8 byte boundary following the last subheader pointer}. In this case, the data begin at offset B+8+SC*SL+DL, where DL = (B+8+SC*SL+PL+7) \% 8 * 8, and '\%' is the modulo operator. If a page is of type 'data', then packed binary data begin at offset 24|40. The 'comp' page was observed as page 2 of the compress\_yes.sas7bdat test file (not distributed with the \texttt{sas7bdat} package). It has BC and SC fields, but no subheader pointers. It contains some initial data and 2 tables. The first table has many rows of length 24; its purpose is unknown. The second table has one entry per data page with the page number and the number of data rows on the page for SC pages. It could be used to access a particular row without reading all preceding data pages. \subsubsection{Subheader Pointers% \label{subheader-pointers}% } The subheader pointers encode information about the offset and length of subheaders relative to the beginning of the page where the subheader pointer is located. The purpose of the last four bytes of the subheader pointer are uncertain, but may indicate that additional subheader pointers are to be found on the next page, or that the corresponding subheader is not crucial. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.098\DUtablewidth}|p{0.086\DUtablewidth}|p{0.086\DUtablewidth}|p{0.633\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & int, offset from page start to subheader \\ \hline 4|8 & 4|8 & high & int, length of subheader := % \phantomsection\label{ql}QL \\ \hline 8|16 & 1 & medium & int, compression := % \phantomsection\label{comp}COMP \\ \hline 9|17 & 1 & low & int, subheader type := ST \\ \hline 10|18 & 2|6 & low & zeroes \\ \hline 12|24 & & high & Total length of subheader pointer 12|24 (\textbf{u64}), SL \\ \hline \end{longtable*} QL is sometimes zero, which indicates that no data is referenced by the corresponding subheader pointer. When this occurs, the subheader pointer may be ignored. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.098\DUtablewidth}|p{0.493\DUtablewidth}|} \hline \textbf{% \hyperref[comp]{COMP} } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% \hyperref[comp]{COMP} } & \textbf{% description } \\ \hline \endhead \multicolumn{2}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & uncompressed \\ \hline 1 & truncated (ignore data) \\ \hline 4 & RLE compressed row data with control byte \\ \hline \end{longtable*} \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.056\DUtablewidth}|p{0.884\DUtablewidth}|} \hline \textbf{% ST } & \textbf{% subheaders } \\ \hline \endfirsthead \hline \textbf{% ST } & \textbf{% subheaders } \\ \hline \endhead \multicolumn{2}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & Row Size, Column Size, Subheader Counts, Column Format and Label, in Uncompressed file \\ \hline 1 & Column Text, Column Names, Column Attributes, Column List \\ \hline 1 & all subheaders (including row data), in Compressed file. \\ \hline \end{longtable*} \section{SAS7BDAT Subheaders% \label{sas7bdat-subheaders}% } Subheaders contain meta information regarding the SAS7BDAT database, including row and column counts, column names, labels, and types. Each subheader is associated with a four- or eight-byte 'signature' (\textbf{u64}) that identifies the subheader type, and hence, how it should be parsed. \subsection{Row Size Subheader% \label{row-size-subheader}% } The row size subheader holds information about row length (in bytes), their total count, and their count on a page of type 'mix'. Fields at offset 28|56 and higher are not needed to read the file, but are documented here for completeness. The four test files used for example data in the higher fields are \texttt{eyecarex.sas7bdat}, \texttt{acadindx.sas7bdat}, \texttt{natlterr1994.sas7bdat}, \texttt{txzips.sas7bdat} (non-Intel/Intel x regular/u64). \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.076\DUtablewidth}|p{0.076\DUtablewidth}|p{0.055\DUtablewidth}|p{0.743\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & binary, signature xF7F7F7F7|xF7F7F7F700000000 \\ \hline 4|8 & 16|32 & low & \emph{????????????} \\ \hline 20|40 & 4|8 & high & int, row length (in bytes) := % \phantomsection\label{rl}RL \\ \hline 24|48 & 4|8 & high & int, total row count := TRC \\ \hline 28|56 & 8|16 & low & \emph{????????????} \\ \hline 36|72 & 4|8 & medium & int, number of \hyperref[column-format-and-label-subheader]{Column Format and Label Subheader} on first page where they appear := % \phantomsection\label{ncfl1}NCFL1 \\ \hline 40|80 & 4|8 & medium & int, number of \hyperref[column-format-and-label-subheader]{Column Format and Label Subheader} on second page where they appear (or 0) := % \phantomsection\label{ncfl2}NCFL2 \\ \hline 44|88 & 8|16 & low & \emph{????????????} \\ \hline 52|104 & 4|8 & medium & int, page size, equals PL \\ \hline 56|112 & 4|8 & low & \emph{????????????} \\ \hline 60|120 & 4|8 & medium & int, max row count on \textquotedbl{}mix\textquotedbl{} page := % \phantomsection\label{mrc}MRC \\ \hline 64|128 & 8|16 & medium & sequence of 8|16 FF, end of initial header \\ \hline 72|144 & 148|296 & medium & zeroes \\ \hline 220|440 & 4 & low & int, page sequence signature (equals current page sequence signature) \\ \hline 224|444 & 40|68 & low & zeroes \\ \hline 264|512 & 4|8 & low & int, value 1 observed in 4 test files \\ \hline 268|520 & 2 & low & int, value 2 observed \\ \hline 270|522 & 2|6 & low & zeroes (pads length of 3 fields to 8|16) \\ \hline 272|528 & 4|8 & medium & int, number of pages with subheader data := NPSHD \\ \hline 276|536 & 2 & medium & int, number of subheaders with positive length on last page with subheader data := NSHPL \\ \hline 278|538 & 2|6 & low & zeroes \\ \hline 280|544 & 4|8 & low & int, values equal to NPSHD observed \\ \hline 284|552 & 2 & low & int, values equal to NSHPL+2 observed \\ \hline 286|554 & 2|6 & low & zeroes \\ \hline 288|560 & 4|8 & medium & int, number of pages in file, equals PC \\ \hline 292|568 & 2 & low & int, values 22,26,9,56 observed \\ \hline 294|570 & 2|6 & low & zeroes \\ \hline 296|576 & 4|8 & low & int, value 1 observed \\ \hline 300|584 & 2 & low & int, values 7|8 observed \\ \hline 302|586 & 2|6 & low & zeroes \\ \hline 304|592 & 40|80 & low & zeroes \\ \hline 344|672 & 2 & low & int, value 0 \\ \hline 346|674 & 2 & low & int, values 0|8 \\ \hline 348|676 & 2 & low & int, value 4 \\ \hline 350|678 & 2 & low & int, value 0 \\ \hline 352|680 & 2 & low & int, values 12,32|0 \\ \hline 354|682 & 2 & low & int, length of Creator Software string := LCS \\ \hline 356|684 & 2 & low & int, value 0 \\ \hline 358|686 & 2 & low & int, value 20 \\ \hline 360|688 & 2 & low & int, value of 8 indicates MXNAM and MXLAB valid := IMAXN \\ \hline 362|690 & 8 & low & zeroes \\ \hline 370|698 & 2 & low & int, value 12 \\ \hline 372|700 & 2 & low & int, value 8 \\ \hline 374|702 & 2 & low & int, value 0 \\ \hline 376|704 & 2 & low & int, value 28 \\ \hline 378|706 & 2 & low & int, length of Creator PROC step name := LCP \\ \hline 380|708 & 36 & low & zeroes \\ \hline 416|744 & 2 & low & int, value 4 \\ \hline 418|746 & 2 & low & int, value 1 \\ \hline 420|748 & 2 & low & int, number of Column Text subheaders in file := % \phantomsection\label{nct}NCT \\ \hline 422|750 & 2 & low & int, max length of column names := MXNAM (see IMAXN) \\ \hline 424|752 & 2 & low & int, max length of column labels := MXLAB (see IMAXN) \\ \hline 426|754 & 12 & low & zeroes \\ \hline 438|766 & 2 & medium & int, number of data rows on a full page INT{[}(PL - 24 / 40)/\hyperref[rl]{RL}{]}; 0 for compressed file \\ \hline 440|768 & 27 & low & zeroes \\ \hline 467|795 & 1 & low & int, bit field, values 1,5 \\ \hline 468|796 & 12 & low & zeroes \\ \hline 480|808 & & medium & Total length of subheader, QL \\ \hline \end{longtable*} \subsection{Column Size Subheader% \label{column-size-subheader}% } The \hyperref[column-size-subheader]{column size subheader} holds the number of columns (variables). \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.098\DUtablewidth}|p{0.086\DUtablewidth}|p{0.086\DUtablewidth}|p{0.540\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & binary, signature xF6F6F6F6|xF6F6F6F600000000 \\ \hline 4|8 & 4|8 & high & int, number of columns := NCOL \\ \hline 8|16 & 4|8 & low & \emph{????????????} usually zeroes \\ \hline 12|24 & & medium & Total length of subheader, QL \\ \hline \end{longtable*} \subsection{Subheader Counts Subheader% \label{subheader-counts-subheader}% } This subheader contains information on the first and last appearances of at least 7 common subheader types. Any of these subheaders may appear once or more. Multiple instances of a subheader provide information for an exclusive subset of columns. The order in which data is read from multiple subheaders corresponds to the reading order (left to right) of columns. The structure of this subheader was deduced and reported by Clint Cummins. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.108\DUtablewidth}|p{0.088\DUtablewidth}|p{0.077\DUtablewidth}|p{0.677\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & int, signature -1024 (x00FCFFFF|x00FCFFFFFFFFFFFF) \\ \hline 4|8 & 4|8 & low & int, length or offset, usually >= 48 \\ \hline 8|16 & 4|8 & low & int, usually 4 \\ \hline 12|24 & 2 & low & int, usually 7 (number of nonzero SCVs?) \\ \hline 14|26 & 50|94 & low & \emph{????????????} \\ \hline 64|120 & 12*LSCV & medium & 12 \hyperref[subheader-count-vectors]{subheader count vectors}, length := LSCV = 20|40 bytes each \\ \hline 304|600 & & medium & Total length of subheader, QL \\ \hline \end{longtable*} \subsubsection{Subheader Count Vectors% \label{subheader-count-vectors}% } The subheader count vectors encode information for each of 4 common subheader types, and potentially 12 total subheader types. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.098\DUtablewidth}|p{0.086\DUtablewidth}|p{0.086\DUtablewidth}|p{0.633\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & int, signature (see list below) \\ \hline 4|8 & 4|8 & medium & int, page where this subheader first appears := PAGE1 \\ \hline 8|16 & 2 & medium & int, position of subheader pointer in PAGE1 := LOC1 \\ \hline 10|18 & 2|6 & low & \emph{????????????} zero padding \\ \hline 12|24 & 4|8 & medium & int, page where this subheader last appears := PAGEL \\ \hline 16|32 & 2 & medium & int, position of subheader pointer in PAGEL := LOCL \\ \hline 18|34 & 2|6 & low & \emph{????????????} zero padding \\ \hline 20|40 & & medium & Total length of subheader count vector, LSCV \\ \hline \end{longtable*} The LOC1 and LOCL give the positions of the corresponding subheader pointer in PAGE1 and PAGEL, respectively. That is, if there are SC subheader pointers on page PAGE1, then the corresponding subheader pointer first occurs at the LOC1'th position in this array, enumerating from 1. If PAGE1=0, the subheader is not present. If PAGE1=PAGEL and LOC1=LOCL, the subheader appears exactly once. If PAGE1!=PAGEL or LOC1!=LOCL, the subheader appears 2 or more times. In all test files, PAGE1 <= PAGEL, and the corresponding subheaders appear only once per page. The variable \hyperref[nct]{NCT} in the \hyperref[row-size-subheader]{Row Size Subheader} should be used to ensure that all Column Text subheaders are located (and to avoid scanning through all pages in the file when all subheaders are already located). The first 7 binary signatures in the \hyperref[subheader-count-vectors]{Subheader Count Vectors} array are always: \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.121\DUtablewidth}|p{0.249\DUtablewidth}|} \hline \textbf{% signature } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% signature } & \textbf{% description } \\ \hline \endhead \multicolumn{2}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot -4 & Column Attributes \\ \hline -3 & Column Text \\ \hline -1 & Column Names \\ \hline -2 & Column List \\ \hline -5 & unknown signature \#1 \\ \hline -6 & unknown signature \#2 \\ \hline -7 & unknown signature \#3 \\ \hline \end{longtable*} The remaining 5 out of 12 signatures are zeros in the observed source files. Presumably, these are for subheaders not yet defined, or not present in the collection of test files. A \hyperref[column-format-and-label-subheader]{Column Format and Label Subheader} may appear on multiple pages, but are not indexed in Subheader Counts. The variables NCFL1 and NCFL2 in the \hyperref[row-size-subheader]{Row Size subheader} may be helpful if you want to know in advance if these appear across multiple pages. \subsection{Column Text Subheader% \label{column-text-subheader}% } The column text subheader contains a block of text associated with columns, including the column names, labels, and formats. However, this subheader is not sufficient to parse this information. Other subheaders (e.g. the \hyperref[column-name-subheader]{column name subheader}), which point to specific elements within this subheader are also needed. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.092\DUtablewidth}|p{0.081\DUtablewidth}|p{0.081\DUtablewidth}|p{0.697\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & int, signature -3 (xFDFFFFFF|xFDFFFFFFFFFFFFFF) \\ \hline 4|8 & 2 & medium & int, size of text block (QL - 16|20) \\ \hline 6|10 & 2 & low & \emph{????????????} \\ \hline 8|12 & 2 & low & \emph{????????????} \\ \hline 10|14 & 2 & low & \emph{????????????} \\ \hline 12|16 & 2 & low & \emph{????????????} \\ \hline 14|18 & 2 & low & \emph{????????????} \\ \hline 16|20 & varies & medium & ascii, compression \& Creator PROC step name that generated data \\ \hline varies & \%QL & high & ascii, combined column names, labels, formats \\ \hline \end{longtable*} This subheader sometimes appears more than once; each is a separate array. If so, the \textquotedbl{}column name index\textquotedbl{} field in \hyperref[column-name-pointers]{column name pointers} selects a particular text array - 0 for the first array, 1 for the second, etc. Similarly, \textquotedbl{}column format index\textquotedbl{} and \textquotedbl{}column label index\textquotedbl{} fields also select a text array. Offsets to strings within the text array are multiples of 4, so the column names and labels section of the array often contains many nulls for padding. The variables LCS and LCP from the \hyperref[row-size-subheader]{Row Size subheader} refer to a text field at the start of the text array (at offset 16|20) in the first Column Text subheader (before the column name strings). This text field also contains compression information. The following logic decodes this initial field: \begin{enumerate} \item If the first 8 bytes of the field are blank, file is not compressed, and set LCS=0. The Creator PROC step name is the LCP bytes starting at offset 16. \item If LCS > 0 (still), the file is not compressed, the first LCS bytes are the Creator Software string (padded with nulls). Set LCP=0. Stat/Transfer files use this pattern. \item If the first 8 bytes of the field are \texttt{SASYZCRL}, the file is compressed with Run Length Encoding. The Creator PROC step name is the LCP bytes starting at offset 24. \item If the first 8 bytes are nonblank and options 2 or 3 above are not used, this probably indicates COMPRESS=BINARY. We need test files to confirm this, though. \end{enumerate} \subsection{Column Name Subheader% \label{column-name-subheader}% } Column name subheaders contain a sequence of \hyperref[column-name-pointers]{column name pointers} to the offset of each column name \textbf{relative to a} \hyperref[column-text-subheader]{column text subheader}. There may be multiple column name subheaders, indexing into multiple column text subheaders. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.098\DUtablewidth}|p{0.086\DUtablewidth}|p{0.086\DUtablewidth}|p{0.644\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & int, signature -1 (xFFFFFFFF|xFFFFFFFFFFFFFFFF) \\ \hline 4|8 & 2 & medium & int, length of remaining subheader (QL - 16|20) \\ \hline 6|10 & 2 & low & \emph{????????????} \\ \hline 8|12 & 2 & low & \emph{????????????} \\ \hline 10|14 & 2 & low & \emph{????????????} \\ \hline 12|16 & 8*CMAX & medium & \hyperref[column-name-pointers]{column name pointers} (see below), CMAX=(QL-20|28)/8 \\ \hline MCN & 8|12 & low & zeros, 12|16 + 8*CMAX := MCN \\ \hline \end{longtable*} Each column name subheader holds CMAX column name pointers. When there are multiple column name subheaders, CMAX will be less than NCOL. \subsubsection{Column Name Pointers% \label{column-name-pointers}% } \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.061\DUtablewidth}|p{0.061\DUtablewidth}|p{0.061\DUtablewidth}|p{0.767\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 2 & high & int, column name index to select \hyperref[column-text-subheader]{Column Text Subheader} \\ \hline 2 & 2 & high & int, column name offset w.r.t. end of selected Column Text signature. Always a multiple of 4. \\ \hline 4 & 2 & high & int, column name length \\ \hline 6 & 2 & low & zeros \\ \hline 8 & & high & Total length of column name pointer \\ \hline \end{longtable*} \subsection{Column Attributes Subheader% \label{column-attributes-subheader}% } The column attribute subheader holds information regarding the column offsets within a data row, the column widths, and the column types (either numeric or character). The column attribute subheader sometimes occurs more than once (in test data). In these cases, column attributes are applied in the order they are parsed. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.080\DUtablewidth}|p{0.099\DUtablewidth}|p{0.071\DUtablewidth}|p{0.700\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & int, signature -4 (hex xFCFFFFFF|FCFFFFFFFFFFFFFF) \\ \hline 4|8 & 2 & medium & int, length of remaining subheader \\ \hline 6|10 & 2 & low & \emph{????????????} \\ \hline 8|12 & 2 & low & \emph{????????????} \\ \hline 10|14 & 2 & low & \emph{????????????} \\ \hline 12|16 & LCAV*CMAX & high & \hyperref[column-attribute-vectors]{column attribute vectors} (see below), CMAX=(QL-20|28)/LCAV, LCAV=12|16 \\ \hline MCA & 8|12 & low & MCA = 12|16 + LCAV*CMAX \\ \hline \end{longtable*} \subsubsection{Column Attribute Vectors% \label{column-attribute-vectors}% } \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.179\DUtablewidth}|p{0.086\DUtablewidth}|p{0.086\DUtablewidth}|p{0.563\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & int, column offset in data row (in bytes) \\ \hline 4|8 & 4 & high & int, column width \\ \hline 8|12 & 2 & low & name length flag \\ \hline 10|14 & 1 & high & int, column type (1 = numeric, 2 = character) \\ \hline 11|15 & 1 & low & \emph{????????????} \\ \hline 12|16 & & high & Total length of column attribute vector, LCAV \\ \hline \end{longtable*} Observed values of name length flag in the source files: \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.183\DUtablewidth}|p{0.757\DUtablewidth}|} \hline \textbf{% name length flag } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% name length flag } & \textbf{% description } \\ \hline \endhead \multicolumn{2}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 4 & name length <= 8 \\ \hline 1024 & usually means name length <= 8 , but sometimes the length is 9-12 \\ \hline 2048 & name length > 8 \\ \hline 2560 & name length > 8 \\ \hline \end{longtable*} \subsection{Column Format and Label Subheader% \label{column-format-and-label-subheader}% } The column format and label subheader contains pointers to a column format and label \textbf{relative to a} \hyperref[column-text-subheader]{column text subheader}. Since the column label subheader only contains information regarding a single column, there are typically as many of these subheaders as columns. The structure of column format pointers was contributed by Clint Cummins. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.071\DUtablewidth}|p{0.071\DUtablewidth}|p{0.063\DUtablewidth}|p{0.746\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & int, signature -1026 (hex FEFB \& 2 or 6 FFs) \\ \hline 4|8 & 30|38 & low & \emph{????????????} \\ \hline 34|46 & 2 & high & int, column format index to select \hyperref[column-text-subheader]{Column Text Subheader} \\ \hline 36|48 & 2 & high & int, column format offset w.r.t. end of selected Column Text signature. A multiple of 4. \\ \hline 38|50 & 2 & high & int, column format length \\ \hline 40|52 & 2 & high & int, column label index to select \hyperref[column-text-subheader]{Column Text Subheader} \\ \hline 42|54 & 2 & high & int, column label offset w.r.t. end of selected Column Text signature. A multiple of 4. \\ \hline 44|56 & 2 & high & int, column label length \\ \hline 46|58 & 6 & low & \emph{????????????} \\ \hline 52|64 & & medium & Total length of subheader, QL \\ \hline \end{longtable*} \subsection{Column List Subheader% \label{column-list-subheader}% } The purpose of this subheader is not clear. But the structure is partly identified. Information related to this subheader was contributed by Clint Cummins. eyecarex (created by Stat/Transfer) does not have this subheader. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.098\DUtablewidth}|p{0.086\DUtablewidth}|p{0.086\DUtablewidth}|p{0.610\DUtablewidth}|} \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endfirsthead \hline \textbf{% offset } & \textbf{% length } & \textbf{% conf. } & \textbf{% description } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 4|8 & high & int, signature -2 (hex FE \& 3 or 7 FFs) \\ \hline 4|8 & 2 & low & int, value close to offset in subheader pointer \\ \hline 6|10 & 6 & low & \emph{????????????} \\ \hline 12|16 & 4|8 & medium & int, length of remaining subheader \\ \hline 16|24 & 2 & low & int, usually equals NCOL \\ \hline 18|26 & 2 & medium & int, length of column list := CL, usually CL > NCOL \\ \hline 20|28 & 2 & low & int, usually 1 \\ \hline 22|30 & 2 & low & int, usually equals NCOL \\ \hline 24|32 & 2 & low & int, usually 3 equal values \\ \hline 26|34 & 2 & low & int, usually 3 equal values \\ \hline 28|36 & 2 & low & int, usually 3 equal values \\ \hline 30|38 & 2*CL & medium & \hyperref[column-list-values]{column list values} (see below) \\ \hline MCL & 8 & low & usually zeros, 30|38 + 2*CL := MCL \\ \hline \end{longtable*} \subsubsection{Column List Values% \label{column-list-values}% } These values are 2 byte integers, with (CL-NCOL) zero values. Each nonzero value is unique, between -NCOL and NCOL. The significance of signedness and ordering is unknown. The values do not correspond to a sorting order of columns. \subsection{Compressed Binary Data Subheader% \label{compressed-binary-data-subheader}% } When a SAS7BDAT file is created by SAS with the option COMPRESS=CHAR or COMPRESS=YES, each row of data is compressed independently with a Run Length Encoding (RLE) structure. This yields a variable length compressed row. Each such row is stored in a single subheader in sequential order, indexed by the \hyperref[subheader-pointers]{subheader pointers}. A RLE compressed data row is identified by COMP=4 in the subheader pointer, and does not have a subheader signature. If a particular row had highly variable data and yielded no compression, it is still stored in a subheader, but uncompressed with COMP=0 instead of COMP=4. The test file \texttt{compress\_yes.sas7bdat} has such highly variable (random) data and all its rows are in this COMP=0 form of subheaders. It takes up more space than the uncompressed version \texttt{compress\_no.sas7bdat}, due to the extra length of the subheader pointers. The final subheader on a page is usually COMP=1, which indicates a truncated row to be ignored; the complete data row appears on the next page. The SAS option COMPRESS=BINARY apparently uses a RDC (Ross Data Compression) structure instead of RLE. We need more test files to investigate this structure, and only document RLE at present. \subsubsection{Run Length Encoding% \label{run-length-encoding}% } In RLE, the compressed row data is a series of control bytes, each optionally followed by data bytes. The control byte specifies how the data bytes are interpreted, or is self contained. The control byte has 2 parts - the upper 4 bits are the Command, and the lower 4 bits are the Length. Each is an uint in the range 0-15. For example, control byte 82 (hex) is Command 8 and Length 2, and control byte F4 (hex) is command 15 (F hex) and Length 4. We have identified the functions of the 11 different Command values which are observed in the test files. The RLE structure was contributed by Clint Cummins. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.052\DUtablewidth}|p{0.046\DUtablewidth}|p{0.087\DUtablewidth}|p{0.765\DUtablewidth}|} \hline \textbf{% Command } & \textbf{% Length } & \textbf{% Name } & \textbf{% Function } \\ \hline \endfirsthead \hline \textbf{% Command } & \textbf{% Length } & \textbf{% Name } & \textbf{% Function } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 0 & 0 & Copy64 & using the first byte as a uint length L (0-255), Copy the next N=64+L bytes from the input to the output (copies 64 to 319 bytes) \\ \hline 1 & ? & ? & \emph{????????????} (not observed in test files) \\ \hline 2 & ? & ? & \emph{????????????} (not observed in test files) \\ \hline 3 & ? & ? & \emph{????????????} (not observed in test files) \\ \hline 4 & ? & ? & \emph{????????????} (not observed in test files) \\ \hline 5 & ? & ? & \emph{????????????} (not observed in test files) \\ \hline 6 & 0 & InsertBlank17 & using the first byte as a uint length L, Insert N=17+L blanks (decimal 32, hex 20) in the output (inserts 17 to 273 blanks) \\ \hline 7 & 0 & InsertZero17 & using the first byte as a uint length L, Insert N=17+L zero bytes in the output \\ \hline 8 & L & Copy1 & using the Length bits as a uint length L (0-15), Copy the next N=1+L bytes from the input to the output (copies 1 to 16 bytes) \\ \hline 9 & L & Copy17 & Copy the next N=17+L bytes from the input to the output (copies 17 to 32 bytes) \\ \hline 10 (A) & L & Copy33 & Copy the next N=33+L bytes from the input to the output (copies 33 to 48 bytes) \\ \hline 11 (B) & L & Copy49 & Copy the next N=49+L bytes from the input to the output (copies 49 to 64 bytes) \\ \hline 12 (C) & L & InsertByte3 & Insert N=3+L copies of the next byte in the output (inserts 3 to 18 bytes) \\ \hline 13 (D) & L & Insert@2 & Insert N=2+L @ (decimal 64, hex 40) bytes in the output (inserts 2 to 17 @ bytes) \\ \hline 14 (E) & L & InsertBlank2 & Insert N=2+L blanks in the output \\ \hline 15 (F) & L & InsertZero2 & Insert N=2+L zero bytes in the output \\ \hline \end{longtable*} The most common Commands in \texttt{obs\_all\_perf\_1.sas7bdat} are F and 8 (alternating). This file is entirely 8 byte doubles, so the F commands often handle consecutive zero bytes in zero value doubles. \subsubsection{RLE Example 1% \label{rle-example-1}% } Compressed data row: \texttt{87 A B C D E F G H F2 8A 1 2 3 4 5 6 7 8 9 A B D0 A1 a b c d e f g ... z} \texttt{CB -8-data-bytes-{}- CB CB -{}-11-data-bytes-{}-{}-{}-{}-{}- CB CB -{}-34-data-bytes-{}-} \texttt{Copy1 ~ ~ ~ ~ ~ ~ ~InsertZero2 ~ ~ ~ ~ ~ ~ ~ ~ Ins Copy33 next 34 bytes} \texttt{Next 8 bytes ~ ~ ~ 4 00h bytes ~ ~ ~ ~ ~ ~ ~ ~ 2 40h} There are 5 Control Bytes (CB) in the above sequence. \begin{enumerate} \item 87: Copy1 next 8 bytes \item F2: InsertZero2 4 00h bytes \item 8A: Copy1 next 11 bytes \item D0: Insert@2 2 40h bytes \item A1: Copy33 next 34 bytes \end{enumerate} Output uncompressed row: \texttt{A B C D E F G H 00 00 00 00 1 2 3 4 5 6 7 8 9 A B 40 40 a b c ... z} \subsubsection{RLE Example 2% \label{rle-example-2}% } Compressed data row: \texttt{87 A B C D E F G H C1 99 A5 a b c ... z} \texttt{CB -8-data-bytes-{}- CB ar CB -last-bytes} \texttt{Copy1 8 ~ ~ ~ ~ ~ ~InsBy Copy33 38 bytes} Control Bytes in Example 2: \begin{enumerate} \item 87: Copy1 next 8 bytes \item C1,99: InsertByte3 4 99h bytes \item A5: Copy33 next 38 bytes \end{enumerate} Output uncompressed row: \texttt{A B C D E F G H 99 99 99 99 a b c ... z} Once a data row is uncompressed, use the \hyperref[sas7bdat-packed-binary-data]{SAS7BDAT Packed Binary Data} description below to read the variables. \section{SAS7BDAT Packed Binary Data% \label{sas7bdat-packed-binary-data}% } SAS7BDAT packed binary are uncompressed, and appear after any subheaders on the page; see the \hyperref[page-offset-table]{Page Offset Table}. These data are stored by rows, where the size of a row (in bytes) is defined by the \hyperref[row-size-subheader]{row size subheader}. When multiple rows occur on a single page, they are immediately adjacent. When a database contains many rows, it is typical that the collection of rows (i.e. their data) is evenly distributed to a number of 'data' pages. However, in test files, no single row's data is broken across two or more pages. A single data row is parsed by interpreting the binary data according to the collection of column attributes contained in the \hyperref[column-attributes-subheader]{column attributes subheader}. Binary data can be interpreted in two ways, as ASCII characters, or as floating point numbers. The column width attribute specifies the number of bytes associated with a column. For character data, this interpretation is straight-forward. For numeric data, interpretation of the column width is more complex. The common binary representation of floating point numbers has three parts; the sign (\texttt{s}), exponent (\texttt{e}), and mantissa (\texttt{m}). The corresponding floating point number is \texttt{s * m * b \textasciicircum{} e}, where \texttt{b} is the base (2 for binary, 10 for decimal). Under the IEEE 754 floating point standard, the sign, exponent, and mantissa are encoded by 1, 11, and 52 bits respectively, totaling 8 bytes. In SAS7BDAT file, numeric quantities can be 3, 4, 5, 6, 7, or 8 bytes in length. For numeric quantities of less than 8 bytes, the remaining number of bytes are truncated from the least significant part of the mantissa. Hence, the minimum and maximum numeric values are identical for all byte lengths, but shorter numeric values have reduced precision. Reduction in precision is characterized by the largest integer such that itself and all smaller integers have an exact representation, denoted \texttt{M}. At best, all integers greater than \texttt{M} are approximated to the nearest multiple of \texttt{b}. The table of \hyperref[numeric-binary-formats]{numeric binary formats} below lists \texttt{M} values and describes how bits are distributed among the six possible column widths in SAS7BDAT files, and lists. \subsection{Numeric Binary Formats% \label{numeric-binary-formats}% } \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.075\DUtablewidth}|p{0.075\DUtablewidth}|p{0.063\DUtablewidth}|p{0.110\DUtablewidth}|p{0.110\DUtablewidth}|p{0.203\DUtablewidth}|} \hline \textbf{% size } & \textbf{% bytes } & \textbf{% sign } & \textbf{% exponent } & \textbf{% mantissa } & \textbf{% \texttt{M} } \\ \hline \endfirsthead \hline \textbf{% size } & \textbf{% bytes } & \textbf{% sign } & \textbf{% exponent } & \textbf{% mantissa } & \textbf{% \texttt{M} } \\ \hline \endhead \multicolumn{6}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot 24bit & 3 & 1 & 11 & 12 & 8192 \\ \hline 32bit & 4 & 1 & 11 & 20 & 2097152 \\ \hline 40bit & 5 & 1 & 11 & 28 & 536870912 \\ \hline 48bit & 6 & 1 & 11 & 36 & 137438953472 \\ \hline 56bit & 7 & 1 & 11 & 44 & 35184372088832 \\ \hline 64bit & 8 & 1 & 11 & 52 & 9007199254740990 \\ \hline \end{longtable*} \subsection{Dates, Currency, and Formatting% \label{dates-currency-and-formatting}% } Column formatting infomation is encoded within the \hyperref[column-text-subheader]{Column Text Subheader} and \hyperref[column-format-and-label-subheader]{Column Format and Label Subheader}. Columns with formatting information have special meaning and interpretation. For example, numeric values may represent dates, encoded as the number of seconds since midnight, January 1, 1960. The format string for fields encoded this way is \textquotedbl{}DATETIME\textquotedbl{}. Using R, these values may be converted using the as.POSIXct or as.POSIXlt functions with argument \texttt{origin=\textquotedbl{}1960-01-01\textquotedbl{}}. The most common date format strings correspond to numeric fields, and are interpreted as follows: \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.110\DUtablewidth}|p{0.470\DUtablewidth}|p{0.156\DUtablewidth}|} \hline \textbf{% Format } & \textbf{% Interpretation } & \textbf{% R Function } \\ \hline \endfirsthead \hline \textbf{% Format } & \textbf{% Interpretation } & \textbf{% R Function } \\ \hline \endhead \multicolumn{3}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot DATE & Number of days since January 1, 1960 & chron::chron \\ \hline TIME & Number of seconds since midnight & as.POSIXct \\ \hline DATETIME & Number of seconds since January 1, 1960 & as.POSIXct \\ \hline \end{longtable*} There are many additional format strings for numeric and character fields. \section{Platform Differences% \label{platform-differences}% } The test files referenced in \texttt{data/sas7bdat.sources.RData} were examined over a period of time. Files with non-Microsoft Windows markings were only observed late into the writing of this document. Consequently (but not intentionally), the SAS7BDAT description above was first deduced for SAS datasets generated on the most commonly observed platform: Microsoft Windows. The extensions to SAS7BDAT files for \textbf{u64} and non-Intel formats was contributed a little later by Clint Cummins. In particular, the files \texttt{natlerr1944.sas7bdat}, \texttt{natlerr2006.sas7bdat} appear to be generated on the 'SunOS' platform (\textbf{u64}, non-Intel). \texttt{txzips.sas7bdat} was created on Linux 64-bit SAS server (\textbf{u64}, Intel). \texttt{eyecarex.sas7bdat} is non-Intel, possibly 32-bit PowerPC. The files \texttt{cfrance2.sas7bdat}, \texttt{cfrance.sas7bdat}, \texttt{coutline.sas7bdat}, \texttt{gfrance2.sas7bdat}, \texttt{gfrance.sas7bdat}, \texttt{goutline.sas7bdat}, \texttt{xfrance2.sas7bdat}, \texttt{xfrance.sas7bdat}, \texttt{xoutline.sas7bdat} appear to be generated on a 32-bit 'Linux' Intel system. They have the same format as Windows files, except for the (ignorable) OS strings in the first header. Text may appear in non-ASCII compatible, partially ASCII compatible, or multi-byte encodings. In particular, Kasper Sorenson discovered some text that appears to be encoded using the Windows-1252 'code page'. \textbf{Key Test Files} \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.301\DUtablewidth}|p{0.639\DUtablewidth}|} \hline \textbf{% filename } & \textbf{% format features } \\ \hline \endfirsthead \hline \textbf{% filename } & \textbf{% format features } \\ \hline \endhead \multicolumn{2}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot \texttt{acadindx.sas7bdat} & non-u64, Intel (most files are like this one) \\ \hline \texttt{br.sas7bdat} & truncated doubles (widths 3,4,6; compare with br2 widths all 8) \\ \hline \texttt{eyecarex.sas7bdat} & non-u64, non-Intel, written by Stat/Transfer \\ \hline \texttt{txzips.sas7bdat} & u64, Intel \\ \hline \texttt{natlterr1994.sas7bdat} & u64, non-Intel \\ \hline \texttt{hltheds2006.sas7bdat} & 2 Column Attributes subheaders \\ \hline \texttt{moshim.sas7bdat} & 3 Column Attributes subheaders \\ \hline \texttt{flightdelays.sas7bdat} & 2 Column Text subheaders \\ \hline \texttt{ymcls\_p2\_long\_040506.sas7bdat} & 5 Column Text subheaders, first Column Attributes subheader is on page 6 \\ \hline \texttt{flightschedule.sas7bdat} & 2+ Column Text subheaders \\ \hline \texttt{internationalflight.sas7bdat} & 2+ Column Text subheaders \\ \hline \texttt{marchflights.sas7bdat} & 2+ Column Text subheaders \\ \hline \texttt{mechanicslevel1.sas7bdat} & 2+ Column Text subheaders \\ \hline \texttt{compress\_yes.sas7bdat} & COMPRESS=CHAR, one PGTYPE=-28672, no RLE compression (COMP=0) \\ \hline \texttt{obs\_all\_perf\_1.sas7bdat} & COMPRESS=CHAR, many PGTYPE=16384, much RLE compression (COMP=4) \\ \hline \end{longtable*} \section{Compression Data% \label{compression-data}% } The table below presents the results of compression tests on a collection of 142 SAS7BDAT data files (sources in \texttt{data/}). The 'type' field represents the type of compression, 'ctime' is the compression time (in seconds), 'dtime' is the decompression time, and the 'compression ratio' field holds the cumulative disk usage (in megabytes) before and after compression. Although the \texttt{xz} algorithm requires significantly more time to compress these data, the decompression time is on par with gzip. \setlength{\DUtablewidth}{\linewidth}% \begin{longtable*}{|p{0.168\DUtablewidth}|p{0.086\DUtablewidth}|p{0.086\DUtablewidth}|p{0.307\DUtablewidth}|} \hline \textbf{% type } & \textbf{% ctime } & \textbf{% dtime } & \textbf{% compression ratio } \\ \hline \endfirsthead \hline \textbf{% type } & \textbf{% ctime } & \textbf{% dtime } & \textbf{% compression ratio } \\ \hline \endhead \multicolumn{4}{c}{\hfill ... continued on next page} \\ \endfoot \endlastfoot gzip -9 & 76.7s & 2.6s & 541M / 30.3M = 17.9 \\ \hline bzip2 -9 & 92.7s & 11.2s & 541M / 19.0M = 28.5 \\ \hline xz -9 & 434.2s & 2.7s & 541M / 12.8M = 42.3 \\ \hline \end{longtable*} \section{Software Prototype% \label{software-prototype}% } The prototype program for reading SAS7BDAT formatted files is implemented entirely in R (see file \texttt{src/sas7bdat.R}). Files not recognized as having been generated under a Microsoft Windows platform are rejected (for now). Implementation of the \texttt{read.sas7bdat} function should be considered a 'reference implementation', and not one designed with performance in mind. There are certain advantages and disadvantages to developing a prototype of this nature in R. Advantages: \begin{enumerate} \item R is an interpreted language with built-in debugger. Hence, experimental routines may be implemented and debugged quickly and interactively, without the need of external compiler or debugger tools (e.g. gcc, gdb). \item R programs are portable across a variety of computing platforms. This is especially important in the present context, because manipulating files stored on disk is a platform-specific task. Platform-specific operations are abstracted from the R user. \end{enumerate} Disadvantages: \begin{enumerate} \item Manipulating binary (raw) data in R is a relatively new capability. The best tools and practices for binary data operations are not as developed as those for other data types. \item Interpreted code is often much less efficient than compiled code. This is not major disadvantage for prototype implementations because human code development is far less efficient than the R interpreter. Gains made in efficient code development using an interpreted language far outweigh benefit of compiled languages. \end{enumerate} Another software implementation was made by Clint Cummins, in the TSP econometrics package (mainly as an independent platform for exploring the format). \section{ToDo% \label{todo}% } \begin{itemize} \item obtain test files which use COMPRESS=BINARY, and develop identification and uncompression procedures \item look for data which will reliably distinguish between structural subheaders (which have one of the known signatures) and uncompressed row data, which may have row data in the signature position that matches one of the known signatures. Both use COMP=0. Are NPSHD and NSHPL sufficient to do this? \item obtain test files with more than 2.1 billion (and more than 4.2 billion) data rows, i.e. where 8 byte integer TRC in \textbf{u64} is apparently needed. Do the non-u64 files handle this, with additional fields beyond the 4 byte TRC used for segmentation? Is TRC a (signed) int or (unsigned) uint? \item identify any SAS7BDAT encryption flag (this is not the same as 'cracking', or breaking encryption); we just identify if a file is encrypted and not readable without a key \item experiment further with 'amendment page' concept \item consider header bytes -by- SAS\_host \item check that only one page of type \textquotedbl{}mix\textquotedbl{} is observed. If so insert \textquotedbl{}In all test cases (\texttt{data/sources.csv}), there are exactly zero or one pages of type 'mix'.\textquotedbl{} under the \hyperref[page-offset-table]{Page Offset Table} header. {[}May not be needed, because the BC and SC fields in each Page Offset Table make the \hyperref[mrc]{MRC} field in the initial header unnecessary.{]} \item identify all missing value representations: missing numeric values appear to be represented as '0000000000D1FFFF' (nan) for numeric 'double' quantities. \item identify purpose of various unknown header quantities \item determine purpose of Column List subheader \item determine purpose and pattern of 'page sequence signature' fields. Are they useful? \item identify how non-ASCII encoding is specified \item implement R options to read just header (and subheader) information without data, and an option to read just some data fields, and not all fields. {[}The TSP implemenation already does this, and can also read a subset of the data rows.{]} \end{itemize} \end{document}