JRuby on Windows: Day 3 - Using Apache POI (Java) to create XLSX files

One of the great advantages of using JRuby is the ability to connect with Java libraries (JARs) and call Java code from your Ruby scripts. This post shows how to use Apache POI (the Java API for Microsoft Documents) to generate a simple XLSX file.

Apache POI is implemented in Java and can be called from other Java code if it is set up correctly. JRuby takes care of a lot of the plumbing that is required to make this work from your Ruby scripts. In this post, we go through the basics of making the code work.

Framing our Approach

At a very high level, there are two ways to approach the creation of the XLSX file:

  • Direct mode: in this mode, the data for the full Excel file is kept in memory and then finalised and flushed to disk all at once. The memory consumption in this case can be significant, especially if the file is very large.
  • Streaming mode: in this mode, the data is continuously written to a temporary file and this results in a much tighter memory profile since a specified number of records are kept in memory. The rest are written to the disk.

Picking one over the other is a matter of trade-offs and is outside the scope of this post. You should read the benefits and limitations of using the SXSSF Streaming User API on the Apache POI site. Also, see the table at the end of this page that shows which features are not available in which format. Finally, if you try something else, the main how-to page is very good reading.

In this post, our aim is also to be able to use the sample code on the how-to page for the SXSSF Streaming User API from JRuby and to see how it is approached and done.

If you want to jump ahead and just get the source code, take a look at my rb-xlsx-converter-comparison repository on GitHub and read the README and specifically look at the code in use_poi_stream.rb for what works.

Background to Apache POI

Here are a few things that we need to know about Apache POI and that will help us find the way to proceed when integrating from JRuby.

  1. As we said earlier, it’s Java – so, it’s distributed as compiled JARs. We will need to grab the latest stable distribution
  2. Many JAVA libraries are very well documented and Apache POI is no exception. We will need to refer to the API Documents for Apache POI frequently.
  3. Naturally, it helps to have some code that we use to get started. We will use the sample code on the how-to page for the SXSSF Streaming User API. If you try to do something more complicated and want to look up how to achieve it, the Busy Developers’ Guide to HSSF and XSSF Features is an excellent quick reference. Keep it handy.
  4. Of course, for this work, we are using JRuby. So, you need Java, and you also need to install JRuby on your system

Our Scope

Simply: use JRuby + Apache POI to create an Excel file that has 1000 rows with each row having 10 cells filled with some random data.

So, if we are to put in pseudocode, we just need to do this:

initialize the xlsx file
for 1000 rows
  for 10 columns
    generate and add a cell
  end
end
finalize the xlsx file

Going back to JRuby

By now, you probably know how to use jruby to run a ruby script. If you need a quick refresher take a look at this post on running scripts from JRuby on this site. In short, we expect to basically just do jruby script.rb on the command line.

In preparation, you might also want to read the page on Calling Java from JRuby on the JRuby wiki.

Putting this post together basically needed:

  • Download stuff
  • Read the wiki on how it works
  • Find the sample code
  • Try to convert it to JRuby
  • Stare at the failure
  • Read the wiki again
  • Check the sample again
  • Check the API documentation (ah!)
  • Fix the code and repeat

So, yes, anyone can do it!

Let’s get started – setting up

The first thing we need to do is set up the project and then we will get into the rest of the details.

For this post, we will follow a simple approach to get this working. The main Ruby script will go into the root of the project folder for now and we’ll include folders for the Java JARs.

Now, let’s download what we need.

The directory structure should like below:

use_poi.rb
 ├───log4j
 └───poi-bin-5.1.0
     ├───auxiliary
     ├───lib
     └───ooxml-lib

Good! We are ready to get started. If you have a different directory structure or layout, don’t worry – you’ll only need to make some minor changes in the require_relative statements.

I’ve used JRuby 9.2.18.0 but any recent version should work.

Quick Overview of Connecting up from JRuby

Here are some things to keep in mind when using a Java JAR from JRuby. There’s a lot more on the Calling Java from JRuby page which you should definitely read.

  1. You need to require 'java' so that you are able to bridge through to the Java world
  2. You need to require all the JAR files that are needed by the script (and by the JARs you require)
  3. Once that is done, you’ll be able to refer to Java classes via their full paths
  4. You will then be able to call the Java methods defined on any of these classes

I found that the most difficult part was #3 – finding what is a class and is available for you to call from your JRuby code. Sometimes, you get errors that do not point out obviously what might have failed, and this can take a bit of time to work out. It naturally gets easier as you do more of it (or if you’re also familiar with Java).

Constructing our script

So, we are ready to start now that we have the directories set up as required and the JAR files available.

Require Java & the JAR files

The first bits of code we need are:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require 'java'

require_relative 'log4j/log4j-core-2.16.0.jar'
require_relative 'poi-bin-5.1.0/poi-5.1.0.jar'
require_relative 'poi-bin-5.1.0/lib/commons-codec-1.15.jar'
require_relative 'poi-bin-5.1.0/lib/commons-collections4-4.4.jar'
require_relative 'poi-bin-5.1.0/lib/commons-io-2.11.0.jar'
require_relative 'poi-bin-5.1.0/lib/commons-math3-3.6.1.jar'
require_relative 'poi-bin-5.1.0/lib/log4j-api-2.16.0.jar'
require_relative 'poi-bin-5.1.0/lib/SparseBitSet-1.2.jar'
require_relative 'poi-bin-5.1.0/poi-ooxml-5.1.0.jar'
require_relative 'poi-bin-5.1.0/poi-ooxml-full-5.1.0.jar'
require_relative 'poi-bin-5.1.0/ooxml-lib/commons-compress-1.21.jar'
require_relative 'poi-bin-5.1.0/ooxml-lib/commons-logging-1.2.jar'
require_relative 'poi-bin-5.1.0/ooxml-lib/xmlbeans-5.0.2.jar'
require_relative 'poi-bin-5.1.0/ooxml-lib/curvesapi-1.06.jar'

This ensures that we have access to everything that we require (no puns intended). If you’re unsure, you can start by requiring everything first from the poi folders and then removing things till your code breaks.

You should also look at the Apache POI Component Map that explains which JARs are needed for which document format or capability.

Identify the Main Class

As you know, we intend to use the SXXSF Streaming User API and the way the code works is basically this:

  • Create a new workbook and then:
    • Use the API from the workbook to create a new sheet in it
    • Use the API from the sheet to create each row
    • Use the API from the row to create each cell
    • Assign a value to the cell
  • Finally, save to a file
  • Clear up the temporary workbook

A lot rests on the new workbook class since everything else is created from there. If you look at the code it does this to access the SXSSFWorkbook.

1
2
3
import org.apache.poi.xssf.streaming.SXSSFWorkbook;

SXSSFWorkbook wb = new SXSSFWorkbook(100); // keep 100 rows in memory, exceeding rows will be flushed to disk

In JRuby, this becomes:

1
2
3
4
5
# Point to the Java class for the streaming file
StWb = org.apache.poi.xssf.streaming.SXSSFWorkbook

# Create new file keeping only the latest 100 records in memory
wb = StWb.new(100)

The rest of the work is now not complicated.

1
2
3
4
5
6
7
8
9
Sheet sh = wb.createSheet();
for(int rownum = 0; rownum < 1000; rownum++){
  Row row = sh.createRow(rownum);
  for(int cellnum = 0; cellnum < 10; cellnum++){
      Cell cell = row.createCell(cellnum);
      String address = new CellReference(cell).formatAsString();
      cell.setCellValue(address);
  }
}

This quickly becomes roughly as below.

1
2
3
4
5
6
7
8
9
10
# Add a sheet to the new workbook (must provide sheet name)
sheet = wb.createSheet('Sheet 1')

# Add 1000 rows to the sheet
(0...1000).each {|rownum|
  row = sheet.createRow(rownum)
  (0...10).each {|cellnum|  # Add 10 colums to the sheet
    row.createCell(cellnum).cell_value = (rand * 1000).to_i # store a 4 digit random value
  }
}

As you can see, we just follow the same arrangement and do createSheet, createRow and createCell as required. In addition, for a cell, we assign a value to it by doing cell_value = which is equivalent to doing setCellValue in Java.

One quirk I found was that a sheet name needed to be provided in Ruby while it seems to be optional in Java.

Next, we need to finalise the file. The wb.write function needs a Java IO FileOutputStream (as per its Java API) and so, we need to create that by again bridging across to Java to create it. After that, we close the FileOutputStream and clean up the temporary file(s) that were created by caling dispose on the SXSSF Workbook we used.

1
2
3
4
5
6
7
# Finalise the file - needs a Java IO FileOutputStream to write it out
java_import java.io.FileOutputStream
fos = FileOutputStream.new('poi_1k-rows.xlsx')

wb.write(fos) # Combines data from the temporary file into the final file
fos.close # Closes the file
wb.dispose # Removes the temporary file(s) that were created along the way

Wrapping up

We covered quite a bit of ground in this post. A very similar example to this is on my rb-xlsx-converter-comparison repository on GitHub – check the README and specifically look at the code in use_poi_stream.rb for what works.

There’s a lot of further reading if you are interested.

Further reading on JRuby and calling Java

  1. Naturally, the canonical post is the page on Calling Java from JRuby on the JRuby wiki.
  2. A very old page (~2007) is still relevant where it comes to using Java classes from JRuby
  3. Some additional code in a gist – https://gist.github.com/adilsoncarvalho/1077172/8d4f6ea519377db3ac81dc53544f6bd9b17b79aa

Further reading on Apache POI

The Apache POI documentation is very extensive and there is quite a lot to look at as you work through it.

  1. First up, refer to the API Documentation – you’ll need this as you try to find out what is available and can be called and at what level
  2. POI Components Mapping lists which dependencies are needed for which file format
  3. The porject has a “page on accesing POI from JVM Languages” https://poi.apache.org/components/poi-jvm-languages.html that is relevant to give an idea of how things are done from other languages – currently, they don’t have information for JRuby
  4. The New Halloween Document is the main how-to page that is essential reading if you intend to use POI. Also, the busy developer’s guide to HSSF and HSSFX features is the quickest way to find a way to solve a specific problem you are facing

Other Sample code

If you’re struggling, it helps to look at other people’s code.

  1. A simple post on using Apache POI from Java is at https://www.codejava.net/coding/how-to-write-excel-files-in-java-using-apache-poi
  2. There are posts in Japanese here and here that show some of the code to use POI
  3. There is a gem that works with older versions of POI but the code can serve as a guide even for later versions

With that, we come to the end of this post. I hope you find it useful. If you have any other information to add, please let me know by commenting below and I will try to update the post.

comments powered by Disqus