Refactoring Rust Transpiled from C

Memory management errors are responsible for 70 to 80 percent of security vulnerabilities¹ ². This would be much less of a problem if computers were running more Rust and less (if any) C code.

So do we rewrite every widely used C project in Rust? Doing so entirely by hand is infeasible. Doing so in an entirely mechanized way is impossible. Between these extremes lies a point that strikes the optimal balance between the efficiency of automation and the intelligence of humans.

The C2Rust project shows that most C code can be converted into something that is technically Rust. That is not to say that we’ve exhausted all opportunities for automation; far from it. What we can do today is making syntactic changes. The structure of the program is preserved including the way memory is managed. Therefore, the Rust output is as unsafe as the input C. To take advantage of Rust’s type system, we must also make semantic changes so the code comports with Rust’s ownership-based model which is how we enforce memory safety.

This post explores how the current output of the C2Rust translator can be further hardened. Most blog posts on porting C code to Rust rely on human intelligence. The goal here is to explore the smallest set of changes we can make to get to safe Rust without a human in the loop. The idea is that the simpler the changes, the greater the odds that we can automate them.

Retyping the Central Data Structure of a Library

To keep things manageable, we’ll work on a small and easy-to-test buffer library.

If we grep for malloc, we see code like this: buffer_t *self = malloc(sizeof(buffer_t)); so we’ll focus on the buffer_t structure. Is defined like this in C and Rust (transpiled from C):

// C definition 
typedef struct {
 size_t len;
 char *alloc;
 char *data;
} buffer_t;

// Transpiled Rust
#[repr(C)]
pub struct buffer_t {
   pub len: size_t,
   pub alloc: *mut libc::c_char,
   pub data: *mut libc::c_char,
}

Notice that the transpiled Rust code uses raw pointers for the alloc and data fields. To retype these fields, we must first analyze how they are used. Let’s take a closer look at how buffer_t structures are allocated and deallocated.

// C allocation 
buffer_t *
buffer_new_with_size(size_t n) {
 buffer_t *self =   
   malloc(sizeof(buffer_t));
 if (!self) return NULL;
 self->len = n;
 self->data = self->alloc = calloc(n + 1, 1);
 return self;
}

// C deallocation
void
buffer_free(buffer_t *self) {
 free(self->alloc);
 free(self);
}

The next to last statement in buffer_new_with_size tells us that data and alloc initially point to the heap allocation returned from calloc. From buffer_free we learn that alloc owns the pointer returned by calloc meaning that data is probably a view into the allocation. Looking at some functions that use the data pointer confirms the hypothesis that data is a view into alloc.

void
buffer_trim_left(buffer_t *self) {
  int c;
  while ((c = *self->data) && isspace(c)) {
    ++self->data;
  }
}

int
buffer_equals(buffer_t *self, buffer_t *other) {
  return 0 == strcmp(self->data, other->data);
}

As programmers, we recognize these functions as string operations whereas an automated analysis would pay no attention to the function names and thus only see operations on strings. With our human knowledge, we can confirm that data points to the beginning of the string which may or may not coincide with the beginning of alloc. For automation purposes, we can look at all the way data and alloc can interact with well-known C APIs. For instance, we see that alloc is passed as the first parameter to realloc which means it must point to an allocation returned by an allocation function such as malloc. The call to realloc suggests that wes should retype alloc as a Vec<c_char> rather than Box<[c_char]>. As for data, we can statically determine that it initially points to the same memory as alloc but the pointer manipulations are (or can easily be) too complex to reason about with static program analysis. What we can do instead, is to run the program and see if any program execution violates the assumption that data points into alloc (or one element beyond if not dereferenced) in the same buffer_t instance. A dynamic analysis can also help us observe the lifetimes of data or alloc pointers in relation to the lifetimes of buffer_t instances.

Now, let’s consider a couple of ways to express buffer_t in the Rust type system.

// Unsafe
#[repr(C)]
pub struct buffer_t {
   pub len: size_t,
   pub alloc: *mut c_char,
   pub data: *mut c_char,
}

First try

// data as mutable slice of alloc
pub struct buffer_t<'a> {
   pub len: size_t,
   pub alloc: Vec<c_char>,
   pub data: &'a mut [c_char],
}

Since our starting point is a mutable raw pointer, the closest safe equivalent is a mutable reference. That gets us into trouble with Rust’s borrow checker once we start instantiating buffers however. The problem is that buffer_t is self-referential because the data field is initialized by borrowing the structure itself (buf.data = &mut buf.alloc). Unlike in C, Rust moves structures around which is troublesome because self-referential structures cannot simply be copied from one memory address to another.

Second try

// alloc as Rc<..> data as Weak<..>
pub struct buffer_t {
   pub len: size_t,
   pub alloc: Rc<Vec<c_char>>,
   pub data: Weak<Vec<c_char>>
}

When borrowing patterns are too complex to be understood by the compiler, we can use Rust’s smart pointer types to have ownership rules enforced at runtime instead. At first glance, it might seem that we can model the relationship between the owning pointer (alloc) and non-owning pointer (data) using the Rc and Weak smart-pointer types respectively. However, that doesn’t quite satisfy our needs since data needs to be able to point to any element of alloc, not just the first one. We could of course type data as a tuple: (Weak<Vec<c_char>>, size_t) but at that point, we don’t really need the reference anymore.

Third try

// data as index into alloc
pub struct buffer_t {
   pub len: size_t,
   pub alloc: Vec<c_char>,
   pub data: size_t,
}

We’ve arrived at a common workaround in Rust: using indexes in place of references. This means that in every expression where data was used as a pointer, we create a slice of alloc starting at data like so:

impl buffer_t {
   pub fn data_slice(&self) -> &[libc::c_char] {
       &self.alloc[self.data as usize..]
   }
   pub fn data_mut_slice(&mut self) -> &mut [libc::c_char] {
       &mut self.alloc[self.data as usize..]
   }
}

This means that we must call one of these two functions everywhere data is dereferenced in the original C code.

No matter how we use buffer_t outside of unsafe blocks, it must adhere to Rust’s ownership rules. For instance, we cannot store the return value of data_mut_slice into a variable and then access alloc while it is borrowed mutably. The equivalent pattern in C is perfectly legitimate (as long as the pointers do not have the restrict qualifier). This means some uses of buffer_t may fall outside of the set of patterns we can reasonably expect to handle automatically. Luckily, we did not encounter any such patterns in this code base.

Propagating the Change

Let’s see what the buffer_new_with_size function looks like using our safe buffer_t struct:

pub fn buffer_new_with_size(mut n: size_t) -> buffer_t {
    let mut self_0 = buffer_t { len: 0, alloc: vec![], data: 0 };
    self_0.len = n;
    self_0.alloc = vec![0; n as usize + 1];
    self_0.data = 0;
    self_0
}

Two things stand out. First, the code contains no unsafe code, which is an improvement over the initial code output by the transpiler. Second, the buffer structure is initialized twice. A human can easily detect and remove such redundancy by moving the expressions that initialize each field into the instantiation of the buffer_t struct. Automatically rewriting the code to avoid double initialization is, in the general case, much harder. This is because we’d have to make certain that there are no data dependencies that prevent such reordering; reasoning about data dependencies in presence of loops and pointer aliasing is often not possible. We could also have used MaybeUninit<buffer_t> to avoid double initialization without having to reorder any statements. Writing to a MaybeUninit instance is an unsafe operation, however, so this strategy would have left us with some initialization code in an unsafe block. Sometimes the juice just isn’t worth the squeeze.

Next, let’s update buffer_trim_left and buffer_equals to use the safe buffer_t:

pub fn buffer_trim_left(self_0: &mut buffer_t) {
   loop {
       let mut c = self_0.data_slice()[0] as c_int;
       if !(c != 0 && isspace(c) != 0) {
           break;
       }
       self_0.data += 1;
   }
}
pub fn buffer_equals(mut self_0: &buffer_t, mut other: &buffer_t) -> c_int {
   (strcmp(self_0.data_slice(), other.data_slice()) == 0) as c_int
}

Again we note that a Rust programmer could certainly express these functions more elegantly. For instance, buffer_trim_left could use the position iterator to find the first non-null character. At the same time, we have gone from having entirely unsafe Rust code to fully safe code in a way that lends itself to automation. One might object that strcmp is an unsafe function but note that it is accepting two slice arguments – it is actually a safe replacement that we provide to help change from raw pointers to slices. We intend to provide safe replacement functions for the most common C APIs to avoid having to convert safe Rust types such as slices back to raw pointers that we can pass to external, unsafe library code.

Finally, let’s look at buffer_resize and buffer_fill in C and safe, transpiled Rust.

int 
buffer_resize(buffer_t *self, size_t n) {
  n = nearest_multiple_of(1024, n);
  self->len = n;
  self->alloc = self->data = realloc(self->alloc, n + 1);
  if (!self->alloc) return -1;
  self->alloc[n] = '\0';
  return 0;
}

pub fn buffer_resize(
    self_0: &mut buffer_t,
    mut n: size_t
) -> libc::c_int {
    n = nearest_multiple_of(1024, n);
    self_0.len = n;
    self_0.data = 0;
    self_0.alloc.resize_with(n + 1, Default::default);
    // this store is dead because resize_with zero-initializes
    self_0.alloc[n] = 0;
    return 0;
}

The call to realloc (and the early return on error) has been replaced with a call to resize_with. This performs more work than realloc since the new slice elements are zero-initialized but is otherwise an entirely safe replacement for the C API.

buffer_t * 
buffer_slice(buffer_t *buf, size_t from, ssize_t to) {
  size_t len = strlen(buf->data);
  if (to < from) return NULL;   // bad range
  if (to < 0) to = len - ~to;   // relative to end
  if (to > len) to = len;       // cap end
  size_t n = to - from;
  buffer_t *self = buffer_new_with_size(n);
  memcpy(self->data, buf->data + from, n);
  return self;
}

pub fn buffer_slice(
    mut buf: &buffer_t,
    mut from: size_t,
    mut to: ssize_t,
) -> Option<buffer_t> {
    let mut len: size_t = strlen(buf.data_slice());
    if (to as size_t) < from { return None }      // bad range
    if to < 0 {
        to = (len - (!to as usize)) as ssize_t
    }                                             // relative to end
    if to as size_t > len { to = len as ssize_t } // cap end
    let mut n: size_t = (to as size_t) - from;
    let mut self_0 = buffer_new_with_size(n);
    let src = &buf.data_slice()[from..from + n];  // buf-> data + from
    let dst = &mut self_0.data_mut_slice()[..n];  // self->data
    dst.copy_from_slice(src);                     // memcpy(dst,src,n)
    return Some(self_0);
}

The buffer_slice function is interesting because the C version returns a null pointer to signal errors. Moreover, it creates a new buffer and transfers ownership of it to the caller; thus the appropriate return type is Option<buffer_t> and we must refactor the caller to properly handle options. In this case, it is clear from the source code that the return value can be null; in other complex cases, we may need to execute the code to find out. Finally, the memcpy call in the C version is replaced by a call to copy_from_slice. That is straightforward but translating the pointer expressions into slice expressions is more difficult and it is thus likely that we will be limited in the types of pointer expressions we’ll be able to handle.

The converted buffer library and a few safe helper methods are available on this branch.

The Road to Automation

Let’s step back and recap what we learned from refactoring a small string handling library:

C memory management APIs provide us with a lot of useful information. By looking at malloc calls in the sources, we identified the data structure we should focus on and by looking at calls to free helps us differentiate owning pointers from non-owning pointers. Similarly, calls to calloc is an indicator that the allocation is likely for an array-like type.
We likely need to execute programs to observe how data is referenced through pointers. Because C puts few restrictions on pointers, their relationships to allocations and other pointers are very hard to determine via static analysis. The drawback is that the programs we can handle must include a large test suite or some other way to execute a large number of paths through the code we are migrating. Guided fuzzing can help but it is no silver bullet.
We believe that it is feasible to eliminate large swaths of unsafe code automatically. At the same time, the code will not become fully idiomatic as a result of using safe Rust types. The C-style of coding will still be apparent and the output will not be as elegant as hand-written Rust code using iterators, generics, and closures. On the other hand, we think the transformed code we show in this post provides a better starting point than the original C code does. The semantic distance is less after refactoring and each additional refactoring performed by hand can be tested against the original test suite.
While many C allocation and deallocation can be replaced with Rust types that are automatically allocated and dropped, we must offer safe replacements for frequently used C APIs such those defined in string.h and used extensively in the buffer code.
C memory management routines such as realloc, memset, and memcpy can be replaced with operations on built-in Rust types via pattern matching and rewriting in the common case but likely not in complex cases.

Let us know what you think! The work required to perform the semantic changes we cover here is substantial. The effort is underway and we remain optimistic that reaching a meaningful level of automation is possible. Like the development of the Rust language itself, this has to be a collaborative effort. In other words, we need your inputs and insights if we are to have a shot at succeeding. If you have concerns, share them with us, if you have a good idea, share it with us. If you’d like to help us test out an early prototype, get in touch with us.

Also, we are hiring!