Finalization and Weak References in R

Luke Tierney
School of Statistics
University of Minnesota

Introduction

This note describes a preliminary implementation of weak references for R that subsumes the finalization mechanism added to R 1.2. The interface (except for the finalization component) is definitely subject to change.

Interfaces

For now, there is only a C level interface. An R level interface can easily be built, but there are some issues that suggest this should perhaps not be done until we have threading support.

Finalization Interface

<finalization interface>=
typedef void (*R_CFinalizer_t)(SEXP);
void R_RegisterFinalizer(SEXP s, SEXP fun);
void R_RegisterCFinalizer(SEXP s, R_CFinalizer_t fun);
void R_RegisterFinalizerEx(SEXP s, SEXP fun, Rboolean onexit);
void R_RegisterCFinalizerEx(SEXP s, R_CFinalizer_t fun, Rboolean onexit);

Weak Reference Interface

<weak reference interface>=
SEXP R_MakeWeakRef(SEXP key, SEXP val, SEXP fin, Rboolean onexit);
SEXP R_MakeWeakRefC(SEXP key, SEXP val, R_CFinalizer_t fin, Rboolean onexit);
SEXP R_WeakRefKey(SEXP w);
SEXP R_WeakRefValue(SEXP w);
void R_RunWeakRefFinalizer(SEXP w);

The design of the weak reference system is based on the one used in the Glasgow Haskell system. A weak reference contains a key and a value. Values are reachable if they are reachable directly from roots or through weak references with reachable keys. Whether or not the weak reference itself is reachable does not matter. This recursive definition requires a fixed point calculation to determine the reachable nodes.

When the collector determines that a key in a weak reference is no longer reachable, the the key and value of the reference are replaced by R_NilValue and the finalizer is scheduled to run.

The finalization interface is layered on top of the weak reference system. For example, the R_RegisterCFinalizerEx function is just

<finalization implementation>=
void R_RegisterCFinalizerEx(SEXP s, R_CFinalizer_t fun, Rboolean onexit)
{
    R_MakeWeakRefC(s, R_NilValue, fun, onexit);
}

Problems With An R Level Interface

Currently the collector runs finalizers after each collection. This means that finalizations conceptually run concurrently with other R code, and thus there is the potential for interference between finalization and non-finalization code. A conceptually cleaner approach, and the approach used in Haskell, Java, and other systems, would have the finalizations run by a separate thread. This would then allow thread synchronization mechanisms to be used to deal with any potential interference.

R does not yet have thread support, so this is not an option. With code written in C there is complete control over where a GC could occur, and hence where a finalizer might be run. This allows safe code to be written in C. In R we cannot control when the collector runs and hence cannot control when finalizers might run.

A possible interim solution might be to allow finalizations to be suspended temporarily from R, for example allowing

<suspend finalizations in R>=
without.finalizations(expr)

The main drawback of doing this is that it does not make sense in a threaded context and code that uses it would have to be changed once we add threads. On the other hand very little code will use this and all uses would be easy enough to find.

Another approach would be to borrow from MzScheme's weak boxes the idea that an object ready for finalization be placed on some sort of queue and leave it to the programmer to run the finalizers in that queue periodically. With threads, the default queue could be one that is managed by a system finalization thread, but an alternate queue could be provided if needed.

Example: Simple File Stream Interface

This example is available as a package wfile.

A simple interface to the fopen and fclose calls could be implemented using external pointer objects to represent file streams and finalization to insure files are closed. Weak references allow us to maintain a list of open files without preventing the garbage collection of unreachable files.

The internal portions of the interface might consist of a file wfile.c and the R portions might be in wfile.R.

<wfile.c>=
#include <stdio.h>
#include "Rinternals.h"
#include "R_ext/Rdynload.h"
<wfile.c declarations>
<wfile.c globals and macros>
<wfile.c functions>

*

<wfile.R>=
<wfile.R public functions>
<wfile.R initialization function>

File Stream Representation

To allow some type checking on the file pointer, we use a symbol with a reasonably unique name as a type tag. This symbol is stored in a local static variable; it is initialized by calling the package initialization function.

<wfile.c globals and macros>= (<-U) [D->]
static SEXP WFILE_type_tag;
Defines WFILE_type_tag (links are to index).

<initialize type tag>= (U->)
WFILE_type_tag = install("WFILE_TYPE_TAG");

Checking of a file stream argument is done by the macro CHECK_WFILE_STREAM:

<wfile.c globals and macros>+= (<-U) [<-D->]
#define CHECK_WFILE_STREAM(s) do { \
    if (TYPEOF(s) != EXTPTRSXP || \
        R_ExternalPtrTag(s) != WFILE_type_tag) \
        error("bad file stream"); \
} while (0)
Defines CHECK_WFILE_STREAM (links are to index).

An alternative to using a symbol as the type identifier would be to use an arbitrary allocated object, which would then have to be stored in the precious list. The advantage would be complete uniqueness within the session; the drawback is somewhat unclear semantics across save/load.

Opening and Closing File Streams

The R function fopen passes its file name and mode arguments to the C function WFILE_open.

<wfile.R public functions>= (U->) [D->]
fopen <- function(name, mode = "r")
    .Call("WFILE_open", as.character(name), as.character(mode),
          PACKAGE="wfile")
Defines fopen (links are to index).

The C function WFILE_open opens the file and creates a weak reference to register a finalizer and store the name of the file stream while it is reachable.

<wfile.c declarations>= (<-U) [D->]
static SEXP WFILE_open(SEXP name, SEXP mode);
Defines WFILE_open (links are to index).

<wfile.c functions>= (<-U) [D->]
static SEXP WFILE_open(SEXP name, SEXP mode)
{
    FILE *f = fopen(CHAR(STRING_ELT(name, 0)), CHAR(STRING_ELT(mode, 0)));
    if (f == NULL)
        return R_NilValue;
    else {
        SEXP val, ref;
        PROTECT(val = R_MakeExternalPtr(f, WFILE_type_tag, R_NilValue));
        PROTECT(ref = R_MakeWeakRefC(val, name,
                                     (R_CFinalizer_t) WFILE_close, TRUE));
        AddFileRef(ref);
        UNPROTECT(2);
        return val;
    }
}
Defines WFILE_open (links are to index).

The R function fclose just calls the C function WFILE_close:

<wfile.R public functions>+= (U->) [<-D->]
fclose <- function(stream)
    .Call("WFILE_close", stream, PACKAGE="wfile")
Defines fclose (links are to index).

The C function WFILE_close closes the stream and clears the pointer unless the pointer is already NULL, which would indicate that the file has already been closed.

<wfile.c declarations>+= (<-U) [<-D->]
static SEXP WFILE_close(SEXP s);
Defines WFILE_close (links are to index).

<wfile.c functions>+= (<-U) [<-D->]
static SEXP WFILE_close(SEXP s)
{
    FILE *f;
    CHECK_WFILE_STREAM(s);
    f = R_ExternalPtrAddr(s);
    if (f != NULL) {
        fclose(f);
        R_ClearExternalPtr(s);
    }
    return R_NilValue;
}
Defines WFILE_close (links are to index).

If a file stream is closed by user code, then there is no longer any need for finalization. But providing a mechanism for removing finalizers is more trouble than it is worth, so the finalization mechanism will eventually call fclose, but nothing much will happen since the stream pointer will have been cleared. But this issue needs to be kept in mind in designing finalizer functions.

Reading Lines From The Stream

Just to have something to do with these file pointers, we can add a simple fgets function that uses a fixed size buffer.

<wfile.R public functions>+= (U->) [<-D->]
fgets <- function(stream) .Call("WFILE_gets", stream, PACKAGE="wfile")
Defines fgets (links are to index).

<wfile.c declarations>+= (<-U) [<-D->]
static SEXP WFILE_gets(SEXP s);
<wfile.c functions>+= (<-U) [<-D->]
static SEXP WFILE_gets(SEXP s)
{
    char buf[512];
    FILE *f;
    CHECK_WFILE_STREAM(s);
    f = R_ExternalPtrAddr(s);
    if (f == NULL)
        error("file pointer is NULL");
    if (fgets(buf, sizeof(buf), f) == NULL)
        return R_NilValue;
    else {
        SEXP val;
        PROTECT(val = allocVector(STRSXP, 1));
        SET_STRING_ELT(val, 0, mkChar(buf));
        UNPROTECT(1);
        return val;
    }
}
Defines WFILE_gets (links are to index).

Managing The List Of Open Files

The table of open files is contained in a variable FileList. The value is a CONS cell that is registered as a permanent object. The actual list is stored in the CDR of the cell.

<wfile.c globals and macros>+= (<-U) [<-D->]
static SEXP FileList;
Defines FileList (links are to index).

<initialize file list>= (U->)
FileList = CONS(R_NilValue, R_NilValue);
R_PreserveObject(FileList);

This should probably be in a public header file:

<wfile.c declarations>+= (<-U) [<-D->]
extern void R_PreserveObject(SEXP);
Defines R_PreserveObject (links are to index).

A new file is added to the list with AddFileRef.

<wfile.c declarations>+= (<-U) [<-D->]
static void AddFileRef(SEXP ref);
Defines AddFileRef (links are to index).

<wfile.c functions>+= (<-U) [<-D->]
static void AddFileRef(SEXP ref)
{
    SEXP f, files, next = NULL, last = NULL;
    files = CDR(FileList);
    for (f = files; f != R_NilValue; f = next) {
        SEXP ref = CAR(f);
        SEXP key = R_WeakRefKey(ref);
        next = CDR(f);
        if (key == R_NilValue ||  R_ExternalPtrAddr(key) == NULL) {
            if (last == NULL) files = next;
            else SETCDR(last, next);
        }
        else last = f;
    }
    SETCDR(FileList, CONS(ref, files));
}
Defines AddFileRef (links are to index).

The function flist returns a list of the names, as specified to fopen, of the open files.

<wfile.R public functions>+= (U->) [<-D]
flist <- function() .Call("WFILE_list", PACKAGE="wfile")
Defines flist (links are to index).

<wfile.c declarations>+= (<-U) [<-D]
static SEXP WFILE_list(void);
Defines WFILE_list (links are to index).

<wfile.c functions>+= (<-U) [<-D->]
static SEXP WFILE_list(void)
{
    SEXP files, val = R_NilValue;
    for (files = CDR(FileList); files != R_NilValue; files = CDR(files)) {
        SEXP ref = CAR(files);
        SEXP key = R_WeakRefKey(ref);
        if (key != R_NilValue && R_ExternalPtrAddr(key) != NULL) {
            PROTECT(key);
            val = CONS(R_WeakRefValue(ref), val);
            UNPROTECT(1);
        }
    }
    return PairToVectorList(val);
}
Defines WFILE_list (links are to index).

The list returned reflect files that were open sometime while this routine was run. It is possible for files as the end of the list to be closed by an allocation needed for adding items to the beginning of the list. A more sophisticated implementation would return a list of the file objects and these objects would provide access to their file names.

Package Initialization

The routine registration entry for the package is

<wfile.c globals and macros>+= (<-U) [<-D]
static R_CallMethodDef WFILE_CallDefs[] = {
    {"WFILE_open", (DL_FUNC) WFILE_open, 2},
    {"WFILE_close", (DL_FUNC) WFILE_close, 1},
    {"WFILE_gets", (DL_FUNC) WFILE_gets, 1},
    {"WFILE_list", (DL_FUNC) WFILE_list, 0},
    {NULL}
};
Defines WFILE_CallDefs (links are to index).

The initialization routines are

<wfile.c functions>+= (<-U) [<-D]
void R_init_wfile(DllInfo *info)
{
    <initialize type tag>
    <initialize file list>
    R_registerRoutines(info, NULL, WFILE_CallDefs, NULL, 0);
}
Defines FILE_init (links are to index).

<wfile.R initialization function>= (U->)
.First.lib <- function(lib, pkg) {
    library.dynam( "wfile", pkg, lib )
}
Defines .First.lib (links are to index).

Sample Usage

Load the package and open some files:
> library(wfile) 
> f<-fopen("simpleref.nw")
> g<-fopen("weakfinex.nw")

The list of open files:

> flist()
[[1]]
[1] "simpleref.nw"

[[2]]
[1] "weakfinex.nw"

Read a few lines from each:

> fgets(g)
[1] "% -*- mode: Noweb; noweb-code-mode: c-mode -*-\n"
> fgets(g)
[1] "\n"
> fgets(f)
[1] "% -*- mode: Noweb; noweb-code-mode: c-mode -*-\n"
> fgets(f)
[1] "\n"

Now drop the reference to f, run the garbage collector and look at the new list of open files:

> f<-NULL
> gc()
         used (Mb) gc trigger (Mb)
Ncells 194292  5.2     407500 10.9
Vcells  37333  0.3     786432  6.0
> flist()
[[1]]
[1] "weakfinex.nw"

If we open a new file and explisitly close g, then the result will also be reflected in the open file list:

> f<-fopen("weakfin.nw")
> fclose(g)
NULL
> flist()
[[1]]
[1] "weakfin.nw"