Honey, I Serialized the Data

Nan Xiao 2022-05-01 20 min read

The R code to reproduce the results in this post is available from https://github.com/nanxstats/r-serialize-timemachine.

Photo by Alex Gogan.

A mystery on serialize()

Serialization/deserialization is an important topic for exchanging data efficiently at scale. In R, there is a native choice for this: serialize()/unserialize() and their more convenient interface saveRDS()/readRDS().

Yihui once asked why the first 14 bytes in R serialized data were skipped in digest::digest(), instead of the first 17 bytes for the binary format, as the additional three filling zero-bytes are always there.

Although there is an entire section in R Internals about serialization formats, I did not find any detailed technical explanations about the bytes in the header. So I decided to collect more empirical evidence to answer the question.

An unlikely solution

My first assumption is that seeing the same data serialized in different R versions instead of different data serialized in the same R version might give us more information. This is because the non-data-encoding section in the header likely only changes when the R versions are different, which will make any minor variations more observable and thus more interpretable.

This solution then becomes a pure automation exercise. To maximize the number of R versions I can test, we need to choose the right platform.

  • We should avoid compiling from source because it is almost impossible to reuse the original toolchains after so many years. Using compiled R binaries would be our best bet.
  • To run all the previously compiled R binaries on a single, modern platform, we will want to choose Windows because it has probably the best ABI compatibility among the common platforms.

It eventually took ~130 lines of R code to accomplish this automation. The project is available at https://github.com/nanxstats/r-serialize-timemachine. You can click the button below to view the serialization results.

Click here to expand the table
R Version Hex value of serialized "ABCDEF"
1.9.1 58 0a 00 00 00 02 00 01 09 01 00 01 04 00 00 00 04 10 00 00 00 01 00 00 04 09 00 00 00 06 41 42 43 44 45 46
2.0.0 58 0a 00 00 00 02 00 02 00 00 00 01 04 00 00 00 04 10 00 00 00 01 00 00 04 09 00 00 00 06 41 42 43 44 45 46
2.0.1 58 0a 00 00 00 02 00 02 00 01 00 01 04 00 00 00 04 10 00 00 00 01 00 00 04 09 00 00 00 06 41 42 43 44 45 46
2.1.0 58 0a 00 00 00 02 00 02 01 00 00 01 04 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.1.1 58 0a 00 00 00 02 00 02 01 01 00 01 04 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.2.0 58 0a 00 00 00 02 00 02 02 00 00 01 04 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.2.1 58 0a 00 00 00 02 00 02 02 01 00 01 04 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.3.0 58 0a 00 00 00 02 00 02 03 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.3.1 58 0a 00 00 00 02 00 02 03 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.4.0 58 0a 00 00 00 02 00 02 04 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.4.1 58 0a 00 00 00 02 00 02 04 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.5.0 58 0a 00 00 00 02 00 02 05 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.5.1 58 0a 00 00 00 02 00 02 05 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.6.0 58 0a 00 00 00 02 00 02 06 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.6.1 58 0a 00 00 00 02 00 02 06 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.6.2 58 0a 00 00 00 02 00 02 06 02 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.7.0 58 0a 00 00 00 02 00 02 07 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.7.1 58 0a 00 00 00 02 00 02 07 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.7.2 58 0a 00 00 00 02 00 02 07 02 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.8.0 58 0a 00 00 00 02 00 02 08 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.8.1 58 0a 00 00 00 02 00 02 08 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.9.0 58 0a 00 00 00 02 00 02 09 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.9.1 58 0a 00 00 00 02 00 02 09 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.9.2 58 0a 00 00 00 02 00 02 09 02 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.10.0 58 0a 00 00 00 02 00 02 0a 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.10.1 58 0a 00 00 00 02 00 02 0a 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.11.0 58 0a 00 00 00 02 00 02 0b 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.11.1 58 0a 00 00 00 02 00 02 0b 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.12.0 58 0a 00 00 00 02 00 02 0c 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.12.1 58 0a 00 00 00 02 00 02 0c 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.12.2 58 0a 00 00 00 02 00 02 0c 02 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.13.0 58 0a 00 00 00 02 00 02 0d 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.13.1 58 0a 00 00 00 02 00 02 0d 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.13.2 58 0a 00 00 00 02 00 02 0d 02 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.14.0 58 0a 00 00 00 02 00 02 0e 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.14.1 58 0a 00 00 00 02 00 02 0e 01 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.14.2 58 0a 00 00 00 02 00 02 0e 02 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 41 42 43 44 45 46
2.15.0 58 0a 00 00 00 02 00 02 0f 00 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
2.15.1 58 0a 00 00 00 02 00 02 0f 01 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
2.15.2 58 0a 00 00 00 02 00 02 0f 02 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
2.15.3 58 0a 00 00 00 02 00 02 0f 03 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.0.0 58 0a 00 00 00 02 00 03 00 00 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.0.1 58 0a 00 00 00 02 00 03 00 01 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.0.2 58 0a 00 00 00 02 00 03 00 02 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.0.3 58 0a 00 00 00 02 00 03 00 03 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.1.0 58 0a 00 00 00 02 00 03 01 00 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.1.1 58 0a 00 00 00 02 00 03 01 01 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.1.2 58 0a 00 00 00 02 00 03 01 02 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.1.3 58 0a 00 00 00 02 00 03 01 03 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.2.0 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.2.1 58 0a 00 00 00 02 00 03 02 01 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.2.2 58 0a 00 00 00 02 00 03 02 02 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.2.3 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.2.4 58 0a 00 00 00 02 00 03 02 04 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.2.5 58 0a 00 00 00 02 00 03 02 05 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.3.0 58 0a 00 00 00 02 00 03 03 00 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.3.1 58 0a 00 00 00 02 00 03 03 01 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.3.2 58 0a 00 00 00 02 00 03 03 02 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.3.3 58 0a 00 00 00 02 00 03 03 03 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.4.0 58 0a 00 00 00 02 00 03 04 00 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.4.1 58 0a 00 00 00 02 00 03 04 01 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.4.2 58 0a 00 00 00 02 00 03 04 02 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.4.3 58 0a 00 00 00 02 00 03 04 03 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.4.4 58 0a 00 00 00 02 00 03 04 04 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.5.0 58 0a 00 00 00 02 00 03 05 00 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.5.1 58 0a 00 00 00 02 00 03 05 01 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.5.2 58 0a 00 00 00 02 00 03 05 02 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.5.3 58 0a 00 00 00 02 00 03 05 03 00 02 03 00 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.6.0 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.6.1 58 0a 00 00 00 03 00 03 06 01 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.6.2 58 0a 00 00 00 03 00 03 06 02 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
3.6.3 58 0a 00 00 00 03 00 03 06 03 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.0.0 58 0a 00 00 00 03 00 04 00 00 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.0.1 58 0a 00 00 00 03 00 04 00 01 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.0.2 58 0a 00 00 00 03 00 04 00 02 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.0.3 58 0a 00 00 00 03 00 04 00 03 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.0.4 58 0a 00 00 00 03 00 04 00 04 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.0.5 58 0a 00 00 00 03 00 04 00 05 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.1.0 58 0a 00 00 00 03 00 04 01 00 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.1.1 58 0a 00 00 00 03 00 04 01 01 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.1.2 58 0a 00 00 00 03 00 04 01 02 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.1.3 58 0a 00 00 00 03 00 04 01 03 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46
4.2.0 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 06 41 42 43 44 45 46

An evolving language

From this small window, we can have a glimpse at how the infrastructure in R evolved in the last 20 years, tracing from the earliest release we can test (R 1.9.1, released in 2004):

  • The differences in the serialized data since R 3.6.0 are apparent. If you still remember, it was because the serialization format version 3 became the default, although it has already existed since R 3.5.0.
  • There are notable differences in R 4.2.0, although still using serialization format version 3. Perhaps this is related to the UCRT update?
  • serialize() return value. We cannot use serialize(connection = NULL) as our test payload directly since it returned a character string instead of a raw vector until R 2.4.0. Therefore, we used the higher-level function saveRDS() as a proxy to get the outputs.
  • saveRDS() compression option. For our purpose of cross-version comparison, we set saveRDS(compress = FALSE) because the default of compress was flipped to TRUE since R 2.4.0.
  • saveRDS() was called .saveRDS() before R 2.13.0.
  • Rscript.exe did not exist until R 2.5.0. Therefore, we used Rcmd.exe instead in the earlier versions.

I think these are all very positive language and tooling improvements—which benefit all R developers every day! The consistency and compatibility in other aspects are also amazingly high. If we don’t remove each R version after they are extracted into dist/, you can open them and run every app/bin/Rgui.exe on the latest Windows 10 without issues.

A possible answer

Here is my answer to the original question on why the skipping offset should be 14 instead of 17.

From the table above, there are many 00 as zero-bytes of fills. So naturally, it is critical to know how these filler bytes are used. If we look into the serialize() upstream serialization format XDR, its corresponding RFC 1832 offers an informative example and some clues:

BASIC BLOCK SIZE

The representation of all items requires a multiple of four bytes (or 32 bits) of data. … If the n bytes needed to contain the data are not a multiple of four, then the n bytes are followed by enough (0 to 3) residual zero bytes, r, to make the total byte count a multiple of 4.

Using R 4.2.0 as an example, the serialized "ABCDEF" is:

58 0a
00 00 00 03
00 04 02 00
00 03 05 00
00 00 00 05
...

We can annotate it like this:

OFFSET      HEX BYTES       ASCII    COMMENTS
------      ---------       -----    --------
     0      58 0a           X\n      -- X (XDR format) and line break
     2      00 00 00 03     ...3     -- serialization format version = 3
     6      00 04 02 00     .420     -- current R version = 4.2.0
    10      00 03 05 00     .350     -- format available since 3.5.0
    14      00 00 00 05     ...5     -- serialized data starting

This is a rough hypothesis, and I could be wrong. So, don’t be shy and leave a comment to add the correct explanation.