The first series of Learning by Reversing examines a Ruby native gem to understand how it works. Part 5 digs looks at the C code needed to interface with the Ruby code.
Previously, in this series, we have had:
- Part 1 – Background to the gem we are looking at (including installing, changing and rebuilding it)
- Part 2 – How a native gem is loaded
- Part 3 – How the files get packaged so that the native extension is built during gem installation
- Part 4 – Understanding the Development Makefile
We have now looked at all the bits on how the gem is built, developed, packaged and installed. It’s now time to look at the actual code that provides the capabilities. Our intention is to try to understand the Ruby-C interface (Ruby C API) that is used for creating extension libraries for Ruby. This post uses much of the documentation from https://docs.ruby-lang.org/en/3.0/extension_rdoc.html to explain the concepts.
I took the lines of C code that use parts of the C API and interface, and created a word cloud that came out looking a bit like this. We’ll try to cover many of these things in this post.
Background and Problem
C and Ruby are very different languages in the programmer model and experience. Ruby is object-oriented, manages garbage collection for allocated memmory and is duck-typed. In comparison, C does all these things differently. In addition, there are differences in how Ruby manages and stores objects and provides access to methods and data. For this reason, there is no “direct equivalent” that C code can use. This is instead provided by C macros and functions that are available to the C programmer to use in their code.
As we go through the post, we will cover the following:
- Bringing in the C API/ interface
- Telling Ruby about the native extension methods (and how to call them)
- Accessing Ruby data and methods
- Checking arguments and raising Ruby exceptions
- Creating Ruby objects
Bringing in the C API/ interface
The C interface is brought into the C program by including ruby.h
which is what you see very early in the C source code of the native extension.
Once this is done, we are ready to use the facilities provided by the interface.
Telling Ruby about the native extension methods
As far as a Ruby program is concerned, there is no difference in how a Ruby method or a native extension method is called. The Ruby interpreter knows how to manage this depending on where the target is. The Ruby C API provides a few functions that can be called from C code.
In Part 2 of the series, we looked at the Init_fast_polylines
function that is called when the gem is loaded. We had explained back then that: this method actually defines a module called FastPolylines and creates two methods under that module, decode and encode, and makes them available to the Ruby code. Once the native extension is loaded, the FastPolylines
module with the methods encode
and decode
are available to Ruby.
Let’s look at the few Ruby specific things that are used here:
- VALUE – a C data type for Ruby objects
- rb_define_module – an interface for defining a Ruby module
- rb_define_module_function – an interface for adding a C function to a module
VALUE
If you look at the word cloud above, VALUE
jumps out in the middle as the largest word. That’s because it’s used a lot in C code that talks to Ruby. In C, variables have types and data do not have types. In contrast, Ruby variables do not have a static type, and data themselves have types, so data will need to be converted between the languages. Data in Ruby are represented by the C type `VALUE’. Each VALUE data has its data type and from the C perspective, we need to:
- Identify the VALUE’s data type
- Convert the VALUE into C data
In the code above, we only define a Ruby object by calling rb_define_module
that returns the C VALUE for the Ruby object that we use in the next couple of function calls. Reading from VALUE
types is covered later in the post.
rb_define_module
This defines a new Ruby module and returns a VALUE
that represents it. THe interface is VALUE rb_define_module(const char *name)
and it defines a module with the specified name. In our case, it defines a module called FastPolylines
.
rb_define_module_function
This defines a method in a Ruby module that is linked to a C function. We call it as below:
You can see that we pass rb_define_module_function
the module that we had defined using rb_define_module
, the name "decode"
for the Ruby method, and it is connected to rb_FastPolylines__decode
which is the C function that will be called when the developer does FastPolylines.decode(...)
in their Ruby script. The final argument is -1
and this needs a little bit of explanation based on the documentation. When the final argument is -1 (or -2), it specifies the way in which the Ruby interpreter will call the C funtion. When it is -1, the function will be called as VALUE func(int argc, VALUE *argv, VALUE obj)
which you can see matches how we define rb_FastPolylines__decode
in the C code. The Ruby interpreter will call it with argc
– the number of arguments, argv
– a C array of the arguments and obj
– the receiver object from Ruby. Pictorially, this is shown below.
As a note, if we had passed -2 instead of -1, the C function would have been called as VALUE func(VALUE obj, VALUE args)
which would provide the Ruby receiver and the arguments as a Ruby array.
So, that is how we inform Ruby where our methods are and how they can be called. In effect, we implemented the Ruby code shown below from the C side.
1
2
3
4
5
6
7
8
9
module FastPolylines
def decode
...
end
def encode
...
end
end
Accessing Ruby Data and Methods
Since we are in C code, we need to convert Ruby data into C data types. As we mentioned above, to do this, we need to:
- Identify the VALUE’s data type
- Convert the VALUE into C data
Let’s look at the decode
method first. The encode method is called like this from Ruby:
FastPolylines.decode(polyline, precision = 5) -> [[lat, lng], ...]
When we call it, we pass a String (polyline
) and optionally, a Number (precision
). Due to this, in our C code, we need to:
- Decide if we received 1 or 2 arguments
- Convert the first argument as a string
- Convert the second argument, if any, as an integer (or assign the default precision to it)
This is all handled in the code shown below.
1
2
3
4
rb_check_arity(argc, 1, 2);
Check_Type(argv[0], T_STRING);
char* polyline = StringValueCStr(argv[0]);
uint precision = _get_precision(argc > 1 ? argv[1] : Qnil);
Let’s look at everything we use in this part:
- rb_check_arity – this will check the number of arguments and raise an
ArgumentError
if the number (argc) is not within the range of the two numbers provided after that (min, max). If max isUNLIMITED_ARGUMENTS
, then the upper bound is not checked. In our case, we expect to receive 1 or 2 arguments so we callrb_check_arity(argc, 1, 2)
to check that we were given the correct number of arguments. If it was called with 0 arguments or with more than 2 arguments, this will automatically raise anArgumentError
. - Check_Type – this checks if the type of the 1st argument is the 2nd argument. It will raise an exception if the VALUE does not have the type specified. In our case, we check that the type of
argv[0]
(the first argument) is aT_STRING
since we expect it to be a string. - StringValueCStr – this takes a VALUE and returns a pointer to a NUL-terminated C string.
- T_STRING – definition for the string type. Used above by
Check_Type
. - Qnil – the equivalent C constants for
nil
Most of the code is now clearer except the use of _get_precision
which is called from both the decode and encode functions to check if precision is provided and valid. If argc
is more than 1, we pass this function the value of argv[1]
, else we pass it Qnil
which represents nil
. The function then does this:
1
2
3
4
5
6
static inline uint _get_precision(VALUE value) {
int precision = NIL_P(value) ? DEFAULT_PRECISION : NUM2INT(value);
if (precision > MAX_PRECISION) rb_raise(rb_eArgError, "precision too high (https://xkcd.com/2170/)");
if (precision < 0) rb_raise(rb_eArgError, "negative precision doesn't make sense");
return (uint)precision;
}
In this piece of code, we use:
- NIL_P – this is used to check if the provided value is nil
- NUM2INT – this converts a number from a Ruby Numeric to a C integer (the reverse is INT2NUM)
- rb_raise – this raises a class exception, together with a format format string just like printf().
- rb_eArgError – this is the exception class for ArgumentError, defined in the Ruby C source code
In this function, then:
- We assign the C integer
precision
theDEFAULT_PRECISION
if the provided VALUE is Qnil; else, we convert the Numeric value to a C integer. - Next, we check if it is larger than
MAX_PRECISION
and raise an ArgumentError if it is - Next, we check if it is negative and raise an ArgumentError if it is
- Finally, we return the
precision
to the caller
The encode
function has a lot more stuff weaved into it but the start of the function should now be easy enough to understand:
1
2
3
4
5
6
7
rb_check_arity(argc, 1, 2);
Check_Type(argv[0], T_ARRAY);
size_t len = RARRAY_LEN(argv[0]);
uint64_t i;
VALUE current_pair;
int64_t prev_pair[2] = {0, 0};
uint precision = _get_precision(argc > 1 ? argv[1] : Qnil);
There are a few new things that come in since this method call expects to receive an array of Latitude and Longitude pairs as the first argument. That’s why we see:
- T_ARRAY – this represents the Type of Array and is used in
Check_Type
to ensure that the first argument is indeed an array - RARRAY_LEN – this returns the length / count / size of the Ruby array.
In this function, we use these to confirm that we have an array and then to check the size of the array. Since the data is provided to us as a nested array, unfortunately, we also need to check each element along the array to make sure that it meets our requirement. As a result, we see code like this in the rb_FastPolylines__encode
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
for (i = 0; i < len; i++) {
current_pair = RARRAY_AREF(argv[0], i);
uint8_t j;
Check_Type(current_pair, T_ARRAY);
if (RARRAY_LEN(current_pair) != 2) {
free(chunks);
rb_raise(rb_eArgError, "wrong number of coordinates");
}
for (j = 0; j < 2; j++) {
VALUE current_value = RARRAY_AREF(current_pair, j);
switch (TYPE(current_value)) {
case T_BIGNUM:
case T_FLOAT:
case T_FIXNUM:
case T_RATIONAL:
break;
default:
free(chunks);
rb_raise(rb_eTypeError, "no implicit conversion to Float from %s", rb_obj_classname(current_value));
};
double parsed_value = NUM2DBL(current_value);
...
}
As we iterate from the start (i = 0) to the size of the array (i < len), we need to do the following:
- Get the current element from the array (which itself should be an array) into current_pair (declared as VALUE) using
RARRAY_AREF
- Check if the current element has a type of
T_ARRAY
- Check that its size is 2 since we expect 2 elements in each sub-array; raise an exception if it is not.
- Then, get each element of the sub-array using
RARRAY_AREF
and: - Find the data type of each element (using
TYPE(current_value)
) - If it is not a type that we can convert to double, raise an exception
- If we can convert it, convert it to double using
NUM2DBL
- Then, get each element of the sub-array using
The code introduced the following new things:
- RARRAY_AREF – from the provided Ruby array, this returns the element at the specified position. We use it as
RARRAY_AREF(argv[0], i)
which means from the first argument, we return the element at position i as a VALUE. - TYPE – this is used to check the type of the VALUE and it returns things like T_BIGNUM, T_FLOAT, etc.
- T_BIGNUM, T_FLOAT, T_FIXNUM, T_RATIONAL: the type representations for the Ruby types
- rb_eTypeError – this is the exception class for TypeError, defined in the Ruby C source code
- NUM2DBL – this converts a numeric Ruby data into a C double
In addition to this, we used rb_raise
as below:
rb_raise(rb_eTypeError, "no implicit conversion to Float from %s", rb_obj_classname(current_value));
In this, we see 2 things that help us to raise an exception that reads like “No implicit conversion to Float from String”
- rb_obj_classname – this returns a C string with the name of the class of the object.
- We also use
"no implicit conversion to Float from %s", rb_obj_classname(current_value)
to prepare the error string since it is formatted in the same way as a printf string.
Creating Ruby Objects
To return string data back after we encode the data, we need to create a Ruby String. Similarly, after we decode the data, we need to return an array.
We use the following methods to create Ruby objects that we hold in VALUE variables:
- rb_define_module – discussed earlier, this defines a module.
- rb_define_module_function – discussed earlier, this links a C function to a method name in a module.
- rb_str_new – this creates a new string object. We call it as
VALUE polyline = rb_str_new(chunks, chunks_index);
passing it the pointer to the character array and the length of the string. - rb_float_new – this creates a new float object from a C double. We call it as
rb_float_new((double) latlng[index] / precision_value)
passing in the C double and getting back a VALUE for the Ruby float. - rb_ary_new – this creates a new array with no elements. In our code, we call it as
VALUE ary = rb_ary_new();
in the decode function. - rb_ary_new_from_values – this creates a new n-element array from a C array that holds VALUE elements. We use it as
rb_ary_new_from_values(2, sub_ary)
to create a new array that comprises of 2 elements from thesub_ary
array.
All of these functions return a VALUE that represents a Ruby object and can either be used where a Ruby object is needed or can be returned from the function as the return value. In our case, we:
- return a string from the encode function
- return an array of coordinates with each coordinate being represented as an array of 2 values (a latitude and longitude)
The Last Few
We have covered almost all the Ruby-C API interfaces that the code uses. There are only a couple of other items that are left.
rb_ary_push
Out of the pending methods, this is the simplest to cover. It simply pushes a VALUE at the end in an existing Ruby array. We pass it the VALUE corresponding to the array and the VALUE corresponding to the element that is to be pushed into the array. In our case, the code uses it as if (index) rb_ary_push(ary, rb_ary_new_from_values(2, sub_ary));
to push the new array created from the sub_ary into ary (which is eventually returned to the caller) in the decode function.
rb_funcall and rb_intern
Both of these are used in a macro to output a debug message using Ruby if DEBUG
was set during installation. This is defined as:
1
2
3
4
5
6
7
#ifdef DEBUG
#define dbg(...) printf(__VA_ARGS__)
#define rdbg(value) rb_funcall(Qnil, rb_intern("p"), 1, value)
#else
#define dbg(...)
#define rdbg(...)
#endif
In short, the Ruby code it tries to execute is: p value
but it requires a bit more code to connect up from C.
The function rb_funcall
allows C code to invoke a Ruby method. It allows calling public, private and protected Ruby methods from C. The way to do this is to call rb_funcall
with the receiver of the method call, the ID of the method to be called, the number of arguments and the arguments for the method after that. Since we may not readily know the ID of the method, we use rb_intern
to provide the name of a method to be looked up. In our case, we do: rb_funcall(Qnil, rb_intern("p"), 1, value)
which looks up the ID for the method p
and requests the interpreter to call it with 1
argument – the value
. Since p
is a method that is in the top-level scope, we pass Qnil
as the receiver on which to execute the method (instead of a Module, class or instance).
A Model to Follow
If you are writing simple C functions to replace some Ruby methods that are slow, we now have a model that we can follow:
- First, create a module and add your methods to it. These will happen in the
Init_
function of the extension. - Each of the methods must follow the signature that matches the definition when calling
rb_define_module_function
(e.g., we used -1 to ensure that it’s called with an argc/ argv combination) - Then, in each of the functions, at a minimum, make sure you do:
- Use
rb_check_arity
to confirm that the correct number of arguments have been passed (or raise an exception) - For each mandatory argument, check the type is correct and convert it into the C data type you expect
- Check if optional arguments are provided. For each, check the type is correct and convert them into C data types you expect
- This might be needed multiple times if your code receives an array that has many elements – each of them needs to be checked and converted
- You need a VALUE that will be returned to the Ruby caller and convert the C types into the correct VALUE
- If you allocated memory using malloc, remember to free it.
- If you allocated memory using malloc, remember to free it (just a reminder)
- If you allocated memory using malloc, remember to free it wherever control is transferred back to the Ruby space, e.g., it’s especially easy to overlook freeing memory before raising an exception.
- Use
- Where necessary, use methods like
rb_raise
to raise an exception - If needed, use
rb_funcall
to call arbitrary methods that exist in the Ruby space
The extension we looked at did not wrap an existing library or large body of C code in the native extension. For that reason, it could follow the above method and interleave the logic and the code to create VALUE items and use the Ruby API directly. If, on the other hand, we wrapped an existing C library, we would need to write interface code that would receive the arguments from the Ruby code, convert them all to C types, then call the existing C methods in the library, and finally convert the returned C types back for returning to the Ruby caller.
Summary
We have actually looked at all the Ruby related C functions that the code uses and have identified how C code could be used to add functionality. Specifically, we have looked at how to let the Ruby interpreter know where the native code exists and how to call it, how to check the number of arguments passed to the C function and their types, how to convert between Ruby data and C types, how to do type checking, how to create Ruby objects from C, how to raise errors and how to call Ruby methods from C.
It might seem like a lot but in reality, it is just the tip of the iceberg. There are many more functions and capabilities that are possible when using the Ruby C API. Also, our case has been particularly simple since we don’t store provide any classes that have a representation in C and persist data. If we were trying to create bindings from Ruby to an existing C library, we might need to define Ruby classes and have objects that store data. We don’t cover any of those since the gem we are looking at does not need these.
However, having come this far, we are now able to add simple functions to accelerate processing of select Ruby methods by providing a C implementation that compiles to native code and might run faster than the Ruby equivalent.
Looking ahead
In the next couple of parts, we look at documentation and benchmarking. Stay tuned for Part 6. If you have any comments, please feel free to leave them below.
Short list of references
I will add links and references later, possibly in the last post of the series. However, some of the links below were very useful and heavily used while creating this post.
- The main Ruby RDoc for extension – https://docs.ruby-lang.org/en/3.0/extension_rdoc.html
- List of Ruby exception classes in the source code – https://sonots.github.io/ruby-capi/da/da6/group__exception.html#gac5bd9245454f935493c41daa4f2651ae
- A note on memory leaks and using rb_funcall – https://blog.appsignal.com/2023/01/25/calling-ruby-methods-in-c-avoid-memory-leaks.html
- The Definitive Guide to Ruby’s C API (a bit dated, though) – http://silverhammermba.github.io/emberb/c/