Ruby Learning by Reversing: Native Gems, Part 5

The first series of Learning by Reversing examines a Ruby native gem to understand how it works. Part 5 digs looks at the C code needed to interface with the Ruby code.

Previously, in this series, we have had:

We have now looked at all the bits on how the gem is built, developed, packaged and installed. It’s now time to look at the actual code that provides the capabilities. Our intention is to try to understand the Ruby-C interface (Ruby C API) that is used for creating extension libraries for Ruby. This post uses much of the documentation from https://docs.ruby-lang.org/en/3.0/extension_rdoc.html to explain the concepts.

I took the lines of C code that use parts of the C API and interface, and created a word cloud that came out looking a bit like this. We’ll try to cover many of these things in this post.

Background and Problem

C and Ruby are very different languages in the programmer model and experience. Ruby is object-oriented, manages garbage collection for allocated memmory and is duck-typed. In comparison, C does all these things differently. In addition, there are differences in how Ruby manages and stores objects and provides access to methods and data. For this reason, there is no “direct equivalent” that C code can use. This is instead provided by C macros and functions that are available to the C programmer to use in their code.

As we go through the post, we will cover the following:

  • Bringing in the C API/ interface
  • Telling Ruby about the native extension methods (and how to call them)
  • Accessing Ruby data and methods
  • Checking arguments and raising Ruby exceptions
  • Creating Ruby objects

Bringing in the C API/ interface

The C interface is brought into the C program by including ruby.h which is what you see very early in the C source code of the native extension.

#include <ruby.h>

Once this is done, we are ready to use the facilities provided by the interface.

Telling Ruby about the native extension methods

As far as a Ruby program is concerned, there is no difference in how a Ruby method or a native extension method is called. The Ruby interpreter knows how to manage this depending on where the target is. The Ruby C API provides a few functions that can be called from C code.

In Part 2 of the series, we looked at the Init_fast_polylines function that is called when the gem is loaded. We had explained back then that: this method actually defines a module called FastPolylines and creates two methods under that module, decode and encode, and makes them available to the Ruby code. Once the native extension is loaded, the FastPolylines module with the methods encode and decode are available to Ruby.

void Init_fast_polylines() {
	VALUE mFastPolylines = rb_define_module("FastPolylines");
	rb_define_module_function(mFastPolylines, "decode", rb_FastPolylines__decode, -1);
	rb_define_module_function(mFastPolylines, "encode", rb_FastPolylines__encode, -1);
}

Let’s look at the few Ruby specific things that are used here:

  • VALUE – a C data type for Ruby objects
  • rb_define_module – an interface for defining a Ruby module
  • rb_define_module_function – an interface for adding a C function to a module

VALUE

If you look at the word cloud above, VALUE jumps out in the middle as the largest word. That’s because it’s used a lot in C code that talks to Ruby. In C, variables have types and data do not have types. In contrast, Ruby variables do not have a static type, and data themselves have types, so data will need to be converted between the languages. Data in Ruby are represented by the C type `VALUE’. Each VALUE data has its data type and from the C perspective, we need to:

  • Identify the VALUE’s data type
  • Convert the VALUE into C data

In the code above, we only define a Ruby object by calling rb_define_module that returns the C VALUE for the Ruby object that we use in the next couple of function calls. Reading from VALUE types is covered later in the post.

rb_define_module

This defines a new Ruby module and returns a VALUE that represents it. THe interface is VALUE rb_define_module(const char *name) and it defines a module with the specified name. In our case, it defines a module called FastPolylines.

rb_define_module_function

This defines a method in a Ruby module that is linked to a C function. We call it as below:

static VALUE
rb_FastPolylines__decode(int argc, VALUE *argv, VALUE self) {
...
}

static VALUE
rb_FastPolylines__encode(int argc, VALUE *argv, VALUE self) {
...
}

...
void Init_fast_polylines() {
  VALUE mFastPolylines = rb_define_module("FastPolylines");
  rb_define_module_function(mFastPolylines, "decode", rb_FastPolylines__decode, -1);
  rb_define_module_function(mFastPolylines, "encode", rb_FastPolylines__encode, -1);
}

You can see that we pass rb_define_module_function the module that we had defined using rb_define_module, the name "decode" for the Ruby method, and it is connected to rb_FastPolylines__decode which is the C function that will be called when the developer does FastPolylines.decode(...) in their Ruby script. The final argument is -1 and this needs a little bit of explanation based on the documentation. When the final argument is -1 (or -2), it specifies the way in which the Ruby interpreter will call the C funtion. When it is -1, the function will be called as VALUE func(int argc, VALUE *argv, VALUE obj) which you can see matches how we define rb_FastPolylines__decode in the C code. The Ruby interpreter will call it with argc – the number of arguments, argv – a C array of the arguments and obj – the receiver object from Ruby. Pictorially, this is shown below.

As a note, if we had passed -2 instead of -1, the C function would have been called as VALUE func(VALUE obj, VALUE args) which would provide the Ruby receiver and the arguments as a Ruby array.

So, that is how we inform Ruby where our methods are and how they can be called. In effect, we implemented the Ruby code shown below from the C side.

1
2
3
4
5
6
7
8
9
module FastPolylines
  def decode
    ...
  end

  def encode
    ...
  end
end

Accessing Ruby Data and Methods

Since we are in C code, we need to convert Ruby data into C data types. As we mentioned above, to do this, we need to:

  • Identify the VALUE’s data type
  • Convert the VALUE into C data

Let’s look at the decode method first. The encode method is called like this from Ruby:

FastPolylines.decode(polyline, precision = 5) -> [[lat, lng], ...]

When we call it, we pass a String (polyline) and optionally, a Number (precision). Due to this, in our C code, we need to:

  • Decide if we received 1 or 2 arguments
  • Convert the first argument as a string
  • Convert the second argument, if any, as an integer (or assign the default precision to it)

This is all handled in the code shown below.

1
2
3
4
  rb_check_arity(argc, 1, 2);
  Check_Type(argv[0], T_STRING);
  char* polyline = StringValueCStr(argv[0]);
  uint precision = _get_precision(argc > 1 ? argv[1] : Qnil);

Let’s look at everything we use in this part:

  • rb_check_arity – this will check the number of arguments and raise an ArgumentError if the number (argc) is not within the range of the two numbers provided after that (min, max). If max is UNLIMITED_ARGUMENTS, then the upper bound is not checked. In our case, we expect to receive 1 or 2 arguments so we call rb_check_arity(argc, 1, 2) to check that we were given the correct number of arguments. If it was called with 0 arguments or with more than 2 arguments, this will automatically raise an ArgumentError.
  • Check_Type – this checks if the type of the 1st argument is the 2nd argument. It will raise an exception if the VALUE does not have the type specified. In our case, we check that the type of argv[0] (the first argument) is a T_STRING since we expect it to be a string.
  • StringValueCStr – this takes a VALUE and returns a pointer to a NUL-terminated C string.
  • T_STRING – definition for the string type. Used above by Check_Type.
  • Qnil – the equivalent C constants for nil

Most of the code is now clearer except the use of _get_precision which is called from both the decode and encode functions to check if precision is provided and valid. If argc is more than 1, we pass this function the value of argv[1], else we pass it Qnil which represents nil. The function then does this:

1
2
3
4
5
6
static inline uint _get_precision(VALUE value) {
  int precision = NIL_P(value) ? DEFAULT_PRECISION : NUM2INT(value);
  if (precision > MAX_PRECISION) rb_raise(rb_eArgError, "precision too high (https://xkcd.com/2170/)");
  if (precision < 0) rb_raise(rb_eArgError, "negative precision doesn't make sense");
  return (uint)precision;
}

In this piece of code, we use:

  • NIL_P – this is used to check if the provided value is nil
  • NUM2INT – this converts a number from a Ruby Numeric to a C integer (the reverse is INT2NUM)
  • rb_raise – this raises a class exception, together with a format format string just like printf().
  • rb_eArgError – this is the exception class for ArgumentError, defined in the Ruby C source code

In this function, then:

  • We assign the C integer precision the DEFAULT_PRECISION if the provided VALUE is Qnil; else, we convert the Numeric value to a C integer.
  • Next, we check if it is larger than MAX_PRECISION and raise an ArgumentError if it is
  • Next, we check if it is negative and raise an ArgumentError if it is
  • Finally, we return the precision to the caller

The encode function has a lot more stuff weaved into it but the start of the function should now be easy enough to understand:

1
2
3
4
5
6
7
  rb_check_arity(argc, 1, 2);
  Check_Type(argv[0], T_ARRAY);
  size_t len = RARRAY_LEN(argv[0]);
  uint64_t i;
  VALUE current_pair;
  int64_t prev_pair[2] = {0, 0};
  uint precision = _get_precision(argc > 1 ? argv[1] : Qnil);

There are a few new things that come in since this method call expects to receive an array of Latitude and Longitude pairs as the first argument. That’s why we see:

  • T_ARRAY – this represents the Type of Array and is used in Check_Type to ensure that the first argument is indeed an array
  • RARRAY_LEN – this returns the length / count / size of the Ruby array.

In this function, we use these to confirm that we have an array and then to check the size of the array. Since the data is provided to us as a nested array, unfortunately, we also need to check each element along the array to make sure that it meets our requirement. As a result, we see code like this in the rb_FastPolylines__encode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
  for (i = 0; i < len; i++) {
      current_pair = RARRAY_AREF(argv[0], i);
      uint8_t j;
      Check_Type(current_pair, T_ARRAY);
      if (RARRAY_LEN(current_pair) != 2) {
        free(chunks);
        rb_raise(rb_eArgError, "wrong number of coordinates");
      }
      for (j = 0; j < 2; j++) {
        VALUE current_value =  RARRAY_AREF(current_pair, j);
        switch (TYPE(current_value)) {
          case T_BIGNUM:
          case T_FLOAT:
          case T_FIXNUM:
          case T_RATIONAL:
            break;
          default:
            free(chunks);
            rb_raise(rb_eTypeError, "no implicit conversion to Float from %s", rb_obj_classname(current_value));
        };

		double parsed_value = NUM2DBL(current_value);
      ...
  }

As we iterate from the start (i = 0) to the size of the array (i < len), we need to do the following:

  • Get the current element from the array (which itself should be an array) into current_pair (declared as VALUE) using RARRAY_AREF
  • Check if the current element has a type of T_ARRAY
  • Check that its size is 2 since we expect 2 elements in each sub-array; raise an exception if it is not.
    • Then, get each element of the sub-array using RARRAY_AREF and:
    • Find the data type of each element (using TYPE(current_value))
    • If it is not a type that we can convert to double, raise an exception
    • If we can convert it, convert it to double using NUM2DBL

The code introduced the following new things:

  • RARRAY_AREF – from the provided Ruby array, this returns the element at the specified position. We use it as RARRAY_AREF(argv[0], i) which means from the first argument, we return the element at position i as a VALUE.
  • TYPE – this is used to check the type of the VALUE and it returns things like T_BIGNUM, T_FLOAT, etc.
  • T_BIGNUM, T_FLOAT, T_FIXNUM, T_RATIONAL: the type representations for the Ruby types
  • rb_eTypeError – this is the exception class for TypeError, defined in the Ruby C source code
  • NUM2DBL – this converts a numeric Ruby data into a C double

In addition to this, we used rb_raise as below:

rb_raise(rb_eTypeError, "no implicit conversion to Float from %s", rb_obj_classname(current_value));

In this, we see 2 things that help us to raise an exception that reads like “No implicit conversion to Float from String”

  • rb_obj_classname – this returns a C string with the name of the class of the object.
  • We also use "no implicit conversion to Float from %s", rb_obj_classname(current_value) to prepare the error string since it is formatted in the same way as a printf string.

Creating Ruby Objects

To return string data back after we encode the data, we need to create a Ruby String. Similarly, after we decode the data, we need to return an array.

We use the following methods to create Ruby objects that we hold in VALUE variables:

  • rb_define_module – discussed earlier, this defines a module.
  • rb_define_module_function – discussed earlier, this links a C function to a method name in a module.
  • rb_str_new – this creates a new string object. We call it as VALUE polyline = rb_str_new(chunks, chunks_index); passing it the pointer to the character array and the length of the string.
  • rb_float_new – this creates a new float object from a C double. We call it as rb_float_new((double) latlng[index] / precision_value) passing in the C double and getting back a VALUE for the Ruby float.
  • rb_ary_new – this creates a new array with no elements. In our code, we call it as VALUE ary = rb_ary_new(); in the decode function.
  • rb_ary_new_from_values – this creates a new n-element array from a C array that holds VALUE elements. We use it as rb_ary_new_from_values(2, sub_ary) to create a new array that comprises of 2 elements from the sub_ary array.

All of these functions return a VALUE that represents a Ruby object and can either be used where a Ruby object is needed or can be returned from the function as the return value. In our case, we:

  • return a string from the encode function
  • return an array of coordinates with each coordinate being represented as an array of 2 values (a latitude and longitude)

The Last Few

We have covered almost all the Ruby-C API interfaces that the code uses. There are only a couple of other items that are left.

rb_ary_push

Out of the pending methods, this is the simplest to cover. It simply pushes a VALUE at the end in an existing Ruby array. We pass it the VALUE corresponding to the array and the VALUE corresponding to the element that is to be pushed into the array. In our case, the code uses it as if (index) rb_ary_push(ary, rb_ary_new_from_values(2, sub_ary)); to push the new array created from the sub_ary into ary (which is eventually returned to the caller) in the decode function.

rb_funcall and rb_intern

Both of these are used in a macro to output a debug message using Ruby if DEBUG was set during installation. This is defined as:

1
2
3
4
5
6
7
#ifdef DEBUG
#define dbg(...) printf(__VA_ARGS__)
#define rdbg(value) rb_funcall(Qnil, rb_intern("p"), 1, value)
#else
#define dbg(...)
#define rdbg(...)
#endif

In short, the Ruby code it tries to execute is: p value but it requires a bit more code to connect up from C.

The function rb_funcall allows C code to invoke a Ruby method. It allows calling public, private and protected Ruby methods from C. The way to do this is to call rb_funcall with the receiver of the method call, the ID of the method to be called, the number of arguments and the arguments for the method after that. Since we may not readily know the ID of the method, we use rb_intern to provide the name of a method to be looked up. In our case, we do: rb_funcall(Qnil, rb_intern("p"), 1, value) which looks up the ID for the method p and requests the interpreter to call it with 1 argument – the value. Since p is a method that is in the top-level scope, we pass Qnil as the receiver on which to execute the method (instead of a Module, class or instance).

A Model to Follow

If you are writing simple C functions to replace some Ruby methods that are slow, we now have a model that we can follow:

  • First, create a module and add your methods to it. These will happen in the Init_ function of the extension.
  • Each of the methods must follow the signature that matches the definition when calling rb_define_module_function (e.g., we used -1 to ensure that it’s called with an argc/ argv combination)
  • Then, in each of the functions, at a minimum, make sure you do:
    • Use rb_check_arity to confirm that the correct number of arguments have been passed (or raise an exception)
    • For each mandatory argument, check the type is correct and convert it into the C data type you expect
    • Check if optional arguments are provided. For each, check the type is correct and convert them into C data types you expect
    • This might be needed multiple times if your code receives an array that has many elements – each of them needs to be checked and converted
    • You need a VALUE that will be returned to the Ruby caller and convert the C types into the correct VALUE
    • If you allocated memory using malloc, remember to free it.
    • If you allocated memory using malloc, remember to free it (just a reminder)
    • If you allocated memory using malloc, remember to free it wherever control is transferred back to the Ruby space, e.g., it’s especially easy to overlook freeing memory before raising an exception.
  • Where necessary, use methods like rb_raise to raise an exception
  • If needed, use rb_funcall to call arbitrary methods that exist in the Ruby space

The extension we looked at did not wrap an existing library or large body of C code in the native extension. For that reason, it could follow the above method and interleave the logic and the code to create VALUE items and use the Ruby API directly. If, on the other hand, we wrapped an existing C library, we would need to write interface code that would receive the arguments from the Ruby code, convert them all to C types, then call the existing C methods in the library, and finally convert the returned C types back for returning to the Ruby caller.

Summary

We have actually looked at all the Ruby related C functions that the code uses and have identified how C code could be used to add functionality. Specifically, we have looked at how to let the Ruby interpreter know where the native code exists and how to call it, how to check the number of arguments passed to the C function and their types, how to convert between Ruby data and C types, how to do type checking, how to create Ruby objects from C, how to raise errors and how to call Ruby methods from C.

It might seem like a lot but in reality, it is just the tip of the iceberg. There are many more functions and capabilities that are possible when using the Ruby C API. Also, our case has been particularly simple since we don’t store provide any classes that have a representation in C and persist data. If we were trying to create bindings from Ruby to an existing C library, we might need to define Ruby classes and have objects that store data. We don’t cover any of those since the gem we are looking at does not need these.

However, having come this far, we are now able to add simple functions to accelerate processing of select Ruby methods by providing a C implementation that compiles to native code and might run faster than the Ruby equivalent.

Looking ahead

In the next couple of parts, we look at documentation and benchmarking. Stay tuned for Part 6. If you have any comments, please feel free to leave them below.

Short list of references

I will add links and references later, possibly in the last post of the series. However, some of the links below were very useful and heavily used while creating this post.

  1. The main Ruby RDoc for extension – https://docs.ruby-lang.org/en/3.0/extension_rdoc.html
  2. List of Ruby exception classes in the source code – https://sonots.github.io/ruby-capi/da/da6/group__exception.html#gac5bd9245454f935493c41daa4f2651ae
  3. A note on memory leaks and using rb_funcall – https://blog.appsignal.com/2023/01/25/calling-ruby-methods-in-c-avoid-memory-leaks.html
  4. The Definitive Guide to Ruby’s C API (a bit dated, though) – http://silverhammermba.github.io/emberb/c/
comments powered by Disqus