PDF generation in Ruby using Google's Puppeteer, on Heroku

October 24, 2017 Building Apps

Google Pupeteer for Ruby

PDF generation has always been a big part of my apps, and so over the years I've tried a lot of the different PDF libraries. I've used PrawnPDF which is great, but has it's own language.. I've also used wkhtmltopdf and most recently PhantomJS. However, they've all had different issues, either with stability or the quality of the html to pdf conversion. Custom fonts have been the toughest part to get right, usually failing unexpectedly and without warning. So when I heard Google were adding a "headless" mode in Chrome, and also releasing a Node.js library to interacting with it, Puppeteer, I was very excited!

“Puppeteer is a Node library which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.”

— https://github.com/GoogleChrome/puppeteer

However, running a node.js library from within a Rails app was not something I'd done before and I couldn't find a lot of info or complete instructions, so I decided to document it here for my own future reference.

Headless is great because it means we can use much of the same functionality of a normal Chrome browser, but can be run entirely on a remote server. You can interact with websites in code.. including the PDF features. This is the same idea behind wkhtmltopdf and PhantomJs but puppeteer is powered by a modern and maintained browser. This means it handles HTML5, CSS3 and Google fonts, which the other alternatives have always had troubles with.. and it's maintained by Google themselves. Resulting PDF's from my experiments so far look perfect!

DISCLAIMER: I'm not saying the below is a good idea, or even that I know what I am doing.. just that it can be done and had no negative impact on my experimental project

Adding node modules to your Rails app

Firstly, make sure you have npm and node installed and with a recent version (I'm using 8.6.x): https://www.npmjs.com/get-npm

To get node running alongside Rails, this great article helped me on the right track: https://ricostacruz.com/til/npm-in-rails

I changed a few commands and skipped a couple, so here's what I did:

1. cd into the folder containing your Rails project.

2. Ignore the folder that will contain puppeteer and any other node module:

echo '/node_modules' >> .gitignore

3. Initialize a new npm/node project. It will ask you some questions, I went with the default for everything, except that I named my entry point file for "pdf.js" instead of "index.js"

npm init

4. Create a new intializer, which will take care of pulling in the node modules on new dev or test machines: config/initializers/npm.rb

system 'npm install' if Rails.env.development? || Rails.env.test?

5. Add the puppeteer module. The --save and --save-exact makes sure it also pulls in it's dependencies:

npm install --save --save-exact puppeteer

6. I added the node version I want to use, directly into the package.json This ensures the same node version is used when you deploy to Heroku later.

"engines": { "node": "8.6.x" },

7. So now in your app root, you have a "node_modules" folder, "package.json" and "package-lock.json" files, and an" pdf.js" file. Your package.json should look something like this:

{
  "name": "my-awesome-app",
  "version": "1.0.0",
  "engines": {
    "node": "8.6.x"
  },
  "description": "My Awesome app - includes Puppeteer",
  "main": "pdf.js",
  "directories": {
    "doc": "doc",
    "test": "test"
  },
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "repository": {
    "type": "git",
    "url": "git+https://github.com/myawesomegithub/myawesomeapp.git"
  },
  "author": "",
  "license": "ISC",
  "bugs": {
    "url": "https://github.com/myawesomegithub/myawesomeapp/issues"
  },
  "homepage": "https://github.com/myawesomegithub/myawesomeapp#readme",
  "dependencies": {
    "puppeteer": "0.12.0"
  }
}

Generating a PDF with Puppeteer

Once you have everything above setup, you can create PDF's with Puppeteer from the Terminal to see some actual results.

8. Replace the contents of pdf.js with the PDF example from the puppeteer repo: https://github.com/GoogleChrome/puppeteer

9. One thing that caused me some issues was that the example does not include catching and handling errors in a way that won't leave zombie browser processes running on your server, eventually eating up all memory and CPU :(

The example PDF code for puppeteer looks similar to this:

'use strict';
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
  await page.pdf({path: 'hn.pdf', format: 'A4'});

  await browser.close();
})();

I changed it to be this instead (UPDATED August, 2018):

'use strict';

const puppeteer = require('puppeteer');

const createPdf = async() => {
  let browser;
  try {
    browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
    const page = await browser.newPage();
    await page.goto(process.argv[2], {timeout: 3000, waitUntil: 'networkidle2'});
    await page.waitFor(250);
    await page.pdf({
      path: process.argv[3],
      format: 'A4',
      margin: { top: 36, right: 36, bottom: 20, left: 36 },
      printBackground: true
    });
  } catch (err) {
      console.log(err.message);
  } finally {
    if (browser) {
      browser.close();
    }
    process.exit();
  }
};
createPdf();

This makes sure to still close the browser and process, if any errors are thrown in your script. I've also added the --no-sandbox and --disable-setuid-sandbox flags, as they are currently needed to run Chrome headless on Heroku. It is not recommended running it like this, as it can introduce security issues.

10. I also replaced the hardcoded path of the website and the path for saving the PDF, with arguments that we pass in instead (much like params in Rails):

process.argv[2]

and

process.argv[3]

11. In Terminal, type:

node pdf.js https://google.com output.pdf

12. Once done, you should have a PDF file called output.pdf of google's website in the same folder.. winning!

Calling Puppetter from within Ruby

If you've ever used PhatomJS or another binary from within Ruby, then this part will feel familiar. All you need to do, is call the system command within Ruby. There are a few ways to do this in Ruby though, such as system(command) and %x{command}. I'm using system() for the PDF generation, as I just need to know when it's done and not receive the PDF itself back from the process. On another project I need a JSON response back from the process, then I will use %x{

13. So in your Rails controller, you can now do this:

system("node pdf.js #{Shellwords.escape(website_path)} #{Shellwords.escape(filename)}")

%x{"node pdf.js #{Shellwords.escape(website_path)} #{Shellwords.escape(filename)}"}

This great flowchart explains which one to choose, depending on your use case: https://stackoverflow.com/a/30463900

Any console.log() you do in the node script, will be what comes into your ruby variable if you are using the %x{} method above.

Getting it all running on Heroku

Getting things running on Heroku is actually quite straightforward, as it's really just a matter of adding the correct buildpacks to your application. I'm going to assume you are adding this to an existing Rails project running on Heroku already.

Keep in mind that Heroku's filesystem should not be used for persistence, however you can use Tempfiles and they will be there for the duration of the request. In other words, you can create a Tempfile, save the PDF to the Tempfile and then use Rails send_data to output the final pdf to a user ;) Something like this simplified example:

website = "https://google.com"
tmp = Tempfile.new("pdf-chrome-puppeteer")
system("node pdf.js #{Shellwords.escape(website)} #{Shellwords.escape(tmp.path)}")
pdf_data = File.read(tmp.path)
pdf_filename = "output.pdf"
send_data(pdf_data, filename: pdf_filename, type: 'application/pdf', disposition: 'inline')

15. In terminal, run the following command to see which buildpacks you are running on. Note: I'm on Cedar-14 stack.

heroku buildpacks

You should see:

1. heroku/ruby

16. First you need to add the puppeteer buildpack, to run before your Ruby buildpack (Ruby should always be last, as that is what will actually be handling the web requests etc.). The --index 1 puts it in the 1st position:

heroku buildpacks:add --index 1 https://github.com/jontewks/puppeteer-heroku-buildpack

17. Now add the node buildpack to position 1, so the final order is node, puppeteer and then ruby:

heroku buildpacks:add --index 1 heroku/nodejs

18. Commit all of your changes:

git add .
git commit -m 'Adding node modules support and puppeteer'

19. Push to Heroku and follow the build logs!! You should see everything installing, and finally your Ruby app starts per normal.

From this point on, you can call any puppeteer script in the same way, so you are not limited to PDF generation. For example, I'm experimenting with the page.evaluate functions in Puppeteer, which allows you to execture javascript in the context of pages to get interesting font and color info out of them.. much like this awesome site: http://stylifyme.com/?stylify=shopify.com

As mentioned earlier, keep an eye on memory and CPU usage though, we all know how much Chrome loves both ;) I've noticed it using about 60-70Mb per chrome instance, and as long as you close browsers and the process in your scripts, it's been very stable.

Hope this helps. Feel free to ask any questions in the comments, or point out things I missed :)