Achieving web-scale with Drupal and Unicorns
When I began my tenure at Acquia I was immediately faced with a problem: all of our sites, systems, and services (excepting Drupal Gardens and the Acquia Cloud backend) were running on a single Drupal site. Compounding that, it was a site with so many custom modules and features that everyone was all but terrified to touch it because any mistake could bring nearly all our services crashing to the ground.
Because of this condition, the needs of the business units and of the services being provided by the site were not being met. Additionally because it was serving so many purposes, the site was often running too hot and performing poorly.
The site had hit a wall both in the codebase and on its existing servers. We were left with a serious problem: how to scale this site to meet rapidly increasing traffic and data demands.
A simple option was to take advantage of the fact that the site was in the Acquia Cloud and we could easily add more servers to scale it up. But that would only mask the performance issues — when we hit the next wall (in the database layer) fixing things would be even harder. Adding more hardware also didn’t help the problem of the codebase being increasingly hard to maintain.
Armed with that understanding as a starting point, coupled with the knowledge that we needed a better way to track all customer actions and centralize data, we embarked on a journey to build out the “enterprise” version of that single site. This journey would ultimately lead us to an architecture that now spans dozens of servers which, combined, make thousands of API calls a day amongst an interconnected web of Drupal sites.
Many Sites Make a Light Load
Experiences building enterprise websites for large corporations taught me that sites need to work together when customer interactions happen across multiple sites.
And so the first requirement was set: The site must be split into many smaller sites, each of which would serve a single purpose (or limited set thereof).
However, past experiences had also taught me that there was a trap in that line of thinking — one that had to be avoided at all costs: the sites could not rely on one another being up to serve their core purpose. Any number of the sites could be down and the remaining sites must still function as intended.
This meant that we had to transfer users, roles, and content (which is where we store our customer entitlements and info) between a lot of different Drupal sites and we had to do it while maintaining and improving all of our offerings — all those starved internal business units really wanted added value to their online offerings. That meant that Drupal 5 (yeah, you read that right — more on this later), 6, and 7 sites had to talk and share information with each other.
To facilitate this, we chose the Services module.
Why We Chose the Services Module
After deciding that our core need was a bi-directional synchronization of users, roles, taxonomy, content, entities, and other arbitrary data, we began looking for existing modules that would provide a base for some or all of these needs.
A requirement that I placed on this search was REST. Because I knew that we would eventually need to expose this data internally to the business as we grew, and because it was very likely that some of the APIs we created would need to be exposed externally, REST was the obvious chose. It is terribly simple to integrate and anyone who can use cURL is able to test an API using it.
The 3.x branch of the services module is a leap forward from the previous versions. It is available for Drupal 6 and 7, and has created a clear and concise system which segregates authentication (we use both session-based and OAuth); resources (things like nodes, users, roles, taxonomy, etc.); endpoints (essentially separate base paths on your site that can expose different resources with different authentication methods). It is also incredibly simple to add additional resources of your own devising, so that even custom code and systems you have written can be exposed as resources.
It’s important to note that Services' close cousin is the UUID module. In our architecture, all nodes, users, vocabularies, roles, etc., have a distinct UUID, so that it is possible to know if a node (for example) already exists on a remote server — so that you can update rather than create it.
Unified, but Separate
An architecture where all the sites had only the information and modules which they needed to function, and where they did not rely on any other sites being up to do so, meant that all of the individual sites (our documentation sites, knowledgebase, network and insight sites, acquia.com, and other systems) could never make a blocking external API call during the course of any user-facing action.
If you haven’t already guessed, even though we had chosen Services as the module that would facilitate the sharing of all this data, we still had to make a key decision: peer-to-peer data pulling, or a system with a central server facilitating the data sharing.
There were three reasons that caused us to choose a central server: first, the business needed a single source of truth when it came to the customer data being shared; second, a peer-to-peer system would have customer-facing sites having to pass data to x-number of other customer facing sites — a scenario which would place too much load and latency on those sites; third, a central server would allow us to move all third party integrations (CRM, lead management, finance, etc.) off the customer-facing sites and allow us to do remote integrations only once.
The logical outcome of this set of decisions was a push-architecture, where the individual sites push data to the central server when a change is made. From there, the central server puts the update into a queue and (in our current scenario) checks every minute to see if any data needs to be pushed to any of the other sites. Fig. 1 shows a simplified diagram of what this looks like.
Was Drupal Really the Best Solution for the Central Server?
When we began talking about how to set up our central server, there was a lot of discussion surrounding it. Everything from whether or not to use Drupal, to other MVCs (in a variety of languages), to a purely custom solution written in a host of different possible languages were considered. We also talked a lot about the data store and whether we should go with a schema-less, NoSQL setup or a full-on relational schema in MySQL.
Two things drove us to use Drupal. The first was that it was the one thing we all knew well. Also, we knew we would have to build all sorts of reports and views on the data that passed through the central server. In this author’s humble opinion, the ability to create arbitrarily related data entities in Drupal is second-to-none, and the ability to extend those entities, report on them (Views module), and create all sorts of REST endpoints using services to access that data is best-in-class as well.
Combine that with the fact that I personally love relational databases (for a lot of very good reasons that would take a whole other article) and Drupal was cemented as the obvious choice.
Oh… and This isn’t Just an Idea Anymore
We’ve borne this out now in reality. Our central server handles thousands of inbound requests, updates its records (and gives us a historical view thanks to revisions), and then pushes data out to all the client sites. This happens with a shockingly small number of errors and failed connections, and is now essentially the backbone of Acquia’s data infrastructure.
Wait, You Said There are Errors and Failed Connections
I did indeed say that. As with any web-based API, there will be errors. Sites will attempt to connect to other sites that are down, have some problem, or maybe are just temporarily un-routable. The first rule for API based communications is: plan for failure. Well, maybe not the first, but it should certainly be in the top five.
Now that we’re done talking about architecture, let’s get into scaling this beast.
Any good hosting setup needs to have a very firm handle on concurrency and memory management. You should never allow more processes to run than the hardware (most importantly the memory) can support. The end result of this is that there is a limit to the number of PHP processes that can run at any time.
When you’re dealing with users requesting pages via browsers, you have a lot of options as to how you handle this problem. Many of your requests can be served from a front-end caching solution like a Varnish reverse proxy. Failing that, you can load parts of your page dynamically, or even just make your users wait.
APIs are prone to be less forgiving. You can almost never cache what they’re asking for (especially if they are pushing) and no one wants to let their API call wait for excessive periods of time while the server on the other end gets around to responding. Our system is very impatient. Most of our sites only wait 3-5 seconds for an API call to succeed before timing out and returning an error.
We needed a way to guarantee that API calls between the various sites and the central server would be allowed a connection and execute in a reasonable amount of time.
I don’t want to blow our own horns here too much, but the Acquia Cloud made this solution incredibly simple. You see, like good doobies, we were using memcache on all our sites. Because money is no object (hopefully my boss doesn’t read this — they’re m1.smalls, so it's cheap!), we have dedicated servers for memcache. However, we don’t typically use all the memory on them (they usually have about 500MB unallocated) and, due to an architectural quirk, they actually have the codebase for our sites checked out onto them. So I talked the idea over with our ops team and we decided that these servers would be excellent dedicated machines for exclusively handling REST calls from the central server. This means that there are dedicated PHP processes used exclusively for REST calls; even an attack that hits the regular webservers will not break this line of communication.
At left is an image showing what the architecture for one of the sites looks like.
Is the Services Module Enough?
It’s likely that while reading this article you’ve found yourself wondering if the Services module alone was enough to pull this off. Simply put: not by a long shot. Aside from aspects of the Acquia Cloud which we leveraged to the hilt, we had to deal with a fundamental limitation of the Services module: it only listens. What Services gives us (in our case) is a REST server sitting on a Drupal site that is invoked by making PUT, POST, DELETE, or GET requests to various URLs on that Drupal site.
What we needed was the ying to Services’ yang. We needed something that would push data on a variety of hook invocations to a site running the Services module. Enter Services Client. This module has a sub-module named Services Connection that is used to define connections to remote sites running Services and also handles loop prevention. The core Services Client module hooks into user, node, taxonomy, etc., saves/updates, and then — if certain conditions (defined in the UI) are met — passes the data to the remote site.
Services Client also handles all field mapping. This means that you can map field ‘foo’ on one site to field ‘bar’ on another site. It also means that you can map between Drupal 7 core fields and Drupal 5/6 CCK fields. You read that right — Services client can talk between Drupal 5, 6, and 7. It also means that it is a fantastic way to handle an upgrade from one version of Drupal to another, that may be made easier by moving functionality from one site to another while leaving both in operation for a period of time.
A variety of other modules was also used to create this setup, such as Services Raw, Services UUID, Services Translator, UUID, and a set of custom integration modules.
The Whole Picture
Now that we have created this architecture and have had it in use for nearly a year, it has begun to work its way into all our systems. Not only does it power the core sites and services, but other internal tools are beginning to make internal API calls to the central server for data, and we’ve even created a setup where every day the central server dumps to a reporting site where users can generate and run reports on our datasets, without concern about interrupting production systems.
Most importantly, our core goal has been met: our systems can scale indefinitely, and in many cases users move between different Drupal sites on different Drupal servers without even knowing it has happened.
The central server performs admirably and provides near real-time data synchronization not only between all of our Drupal sites, but also out to CRM and lead management systems — with other third party integrations in the works. This provides us with an in-house single source of truth, when it comes to our customers and their entitlements, and complete control over our data and what we do with it.
The number of sites surrounding our central server now totals fourteen, and though some are shared servers, many are on dedicated systems, meaning that over forty AWS instances now participate in this system. This is no small group of sites either; this is powering nearly all aspects of an enterprise business setup that we believe is a banner implementation of how to make Drupal function as the backbone of any business.