The latest rage in the Content Management world is the idea of a “decoupled CMS.” That is, rather than having a single monolithic system that handles everything from content entry to management to display and theming in one program spread, that responsibility is assigned to different systems: one that is really really good at content storage, one that's really really good at content management, one that's really really good at display and theming, etc.
At the same time, there has been a huge push for web services in almost every market. If you want content to be available anywhere besides an HTML page, then your answer is web services.
Drupal 8 will make huge strides in this area, but alas it's not out yet. Fortunately the answer to the second problem is the first; it is entirely possible to build a solid, scalable, performant RESTful web service with Drupal 7 by decoupling Drupal from the web service.
Recently, Palantir.net did exactly that for a major media client, video hosting service Ooyala, and it really drove home both the power of web services and the potential of a decoupled architecture.
The Problem
Ooyala wanted us to build a video curation service for one of their customers.
The first part of the problem was that the customer had data that was regularly updated, but this existing data source was incomplete, occasionally unreliable, and could be enriched with additional metadata, so human management was required before it could be used in the desired context (to describe video content in end-user-facing video-on-demand applications). The solution to this particular problem was the CMS.
The second part of the problem was getting the data in the desired context: highly interactive video-on-demand applications where users could purchase access to individual movies or episodes, collections of movies, seasons of a show, or other arbitrary groupings. The solution here was a REST API separated from the CMS.
Not complex enough? Add in a requirement to merge in data from a third-party video service to compensate for incomplete data.
Not complex enough yet? Add in that different shows should be available for different periods of time depending on contract requirements. And that shows should be shown to users before they can actually be purchased... sometimes.
Still not complex enough? Add in a four-nines uptime requirement, and speed being of utmost priority.
So, Drupal?
Pulling in data from a web service, curating it, and serving it out with a web service sounds like just the thing for Drupal... 8. In early-2013, of course, Drupal 8 was never a serious consideration. Drupal 7 has some support for web services via the Services and Rest WS modules, but both are hamstrung by Drupal 7's very page-centric architecture and generally poor support for working with HTTP. We needed something better.
Fortunately, Drupal is not the only tool in the modern web developer's arsenal. After a little investigation we decided that a decoupled approach was the best bet. Drupal is really good at content management and curation, so let it do that. For handling the web service, we turned to the PHP micro-framework Silex.
A Lightweight Web Service
Silex is Symfony2's little brother and therefore a sibling of Drupal 8. It uses the same core components and pipeline as Symfony2 and Drupal 8: HttpFoundation, HttpKernel, EventDispatcher, and so forth. Unlike Symfony2 or Drupal 8, though, it does little more than wire all of those components together into a “routing system in a box”; all of the application architecture, default behavior – everything – is left up to you to decide. That makes Silex extremely flexible and also extremely fast, at the cost of being on your own to decide what “best practices” you want to use.
In our testing, Silex was able to serve a basic web service request in less than a third the time of Drupal 7. Because it relies on HttpFoundation it is also far more flexible for controlling and managing non-HTML responses than Drupal 7, including playing nicely with HTTP caching. That makes Silex a good choice for many lightweight use cases, including a headless web service. See http://wdog.it/4/2/silex for more.
That then opened up the question of how to get data from Drupal to Silex. Silex doesn't have a built-in storage system, of course. Pulling data directly from Drupal's SQL tables was an option, but since the data stored there often requires processing by Drupal to be meaningful, that wasn't a good alternative. Additionally, the data structure that was optimal for content editors was not the same as what the client API needed to deliver. And we needed that client API to be as fast as possible, even before we added caching.
The solution was an intermediary data store, built with Elasticsearch. The Drupal side would, when appropriate, prepare its data and push it into Elasticsearch in the format we wanted, to be able to serve out to client applications. Silex would then need only read that data, wrap it up in a proper hypermedia package, and serve it. That kept the Silex runtime as small as possible and allowed us to do all the heavy lifting and business rules in Drupal.
Elasticsearch Versus Apache Solr
Elasticsearch is an open source search server built on the same Lucene engine as Apache Solr. Elasticsearch, however, is much easier to setup than Solr in part because it is semi-schemaless. Defining a schema in Elasticsearch is optional unless you need specific mapping logic, and then mappings can be defined and changed without needing a server reboot. It also has a very approachable JSON-based REST API, and setting up replication is surprisingly easy.
While Solr has better turnkey Drupal integration, we found Elasticsearch much easier to use for custom development. As time goes on, we hope to see Elasticsearch become more widely used in the Drupal community, since there is enormous potential for automation and performance benefits from it. (See http://wdog.it/4/2/es for more information.)
The Workflow
With three different data models to deal with (the incoming data, the model in Drupal, and the client API model), we needed one to be definitive. Drupal was the natural choice to be the canonical Owner Of All The Things. Our data model consisted of three key content types:
- Program: An individual record, such as Batman Begins or Episode #3 of Cosmos. Most of the useful metadata is on a Program, such as the title, synopsis, cast list, rating, and so on.
- Offer: A sellable object; users buy Offers, which refer to one or more Programs
- Asset: A wrapper for the actual video file, which was stored not in Drupal but in the client's digital asset management system.
We also had two types of curated Collection, which were aggregations of Programs created from whole cloth in Drupal by content editors.
Incoming data from the client's external systems is POSTed against Drupal – REST-style – as XML strings. A custom importer takes that data and mutates it into a series of Drupal nodes, typically one each of a Program, Offer, and Asset. We considered the Migrate and Feeds modules but both assume a Drupal-triggered import and had pipelines that were over-engineered for our case. Instead, we built a simple import mapper using PHP 5.3's support for anonymous functions. The end result was a few very short, very straightforward classes that could transform the incoming XML files to multiple Drupal nodes. (For more on how such an importer works, see my presentation from DrupalCon Austin on Functional PHP. It's one of the main examples used.)
Once the data is in Drupal, content editing is much as one would expect: a few fields and some entity reference relationships; add salt to taste. As it was only an administrator-facing system, we leveraged the default Seven theme for the whole site. The only significant divergence from “normal” Drupal was that we split the edit screen into several as the client wanted to allow editing and saving of only parts of a node. That was a challenge but we were able to make it work using Panels’ ability to create custom edit forms and some careful massaging of fields that didn't play nice with that approach. This is an area where Drupal 8’s new support for form modes will make such custom interfaces far easier.
Publication rules for content were, as noted, quite complex as they involved content being publicly available only during selected windows, but those windows were based on the relationships between different nodes. That is, Offers and Assets had their own separate availability windows and Programs would be available only if an Offer or Asset said they should be, but if the Offer and Asset differ... it got complicated. In the end we built most of the publication rules into a series of custom functions fired on cron that would, in the end, simply cause a node to be published or unpublished.
On node save, we either wrote a node to our Elasticsearch server (if it was published) or deleted it from the server (if unpublished); Elasticsearch handles updating an existing record or deleting a non-existent record without issue. Before writing out the node, though, we mangled it a great deal. We needed to clean up a lot of the content, restructure it, merge fields, remove irrelevant fields, and so on. All of that was done on the fly, when writing the nodes out to Elasticsearch.
Actually that's not entirely true. For performance reasons, and to avoid race conditions when saving nodes, we deferred the actual processing off to Drupal's queue system. That neatly avoided race conditions around accessing nodes during node save, and kept the user interface fast and responsive.
There was one other requirement: Since the incoming data was often incomplete, we needed to also import data from RottenTomatoes.com. For that we built a two layer system: One is a generic PHP package using the Guzzle library (also coming in Drupal 8) that expressed Rotten Tomatoes content as PHP objects, while the other then bridges that system to create Drupal nodes populated from Rotten Tomatoes data. We then matched up Rotten Tomatoes movies and reviews with the client's source data and allowed editors to elect to use data from Rotten Tomatoes in favor of their own where appropriate. That data was merged in during the indexing process as well, so once data was in Elasticsearch it didn’t matter where it came from. We also exposed Critic Reviews to Elasticsearch, so that client applications could see reviews of movies and user ratings before buying.
Serving Clients
Incoming requests from client applications never hit Drupal: They only ever hit the Silex app server. The Silex app doesn't even have to do much. For the wire format we selected the Hypertext Application Language, or HAL. HAL is a very simple JSON-based hypermedia format used by Drupal 8, Zend Appagility, and others, and is an IETF draft specification. It also has a very robust PHP library that we were able to use. Since Elasticsearch already stores and returns JSON, it was simple to map objects from Elasticsearch into HAL. The heavy lifting was just in deriving and attaching the appropriate hypermedia links and embedded values. Keyword and other search queries were simply passed through to Elasticsearch and the results used to load the appropriate records.
Finally, we wrapped the HAL object up in Symfony's Response object, set our HTTP caching parameters and ETags, and sent the message on its way.
A big advantage of the split-architecture is that spinning up a new Silex instance is easy. There is no meaningful configuration beyond identifying the Elasticsearch server to use, and most code is pulled down via Composer. (See my article, “Composer,” in Drupal Watchdog 3.02, September, 2013.) That means that spinning up multiple instances of the API server for redundancy, high-availability, or performance is virtually almost no work. And, as a welcome benefit, the API is read-only, so with correct use of HTTP headers and a basic Varnish server in front of it, the API is surprisingly snappy.
Would We Do it Again?
A big part of Drupal's maturity is realizing that Drupal is not the be-all end-all solution to every problem. It can be a great CMS, but that doesn't mean it's great for the entire CMS platform. In our case it was great for managing content, but not for serving a web API. Fortunately, knowledge of the upcoming Drupal 8 release, and its reliance on the Symfony pipeline, let us pair Drupal with Silex, which is great for serving a web API but not all that hot for managing and curating content. The right tool for the job and all that...
In the end, the customer had a robust and reliable API that is able to serve client applications we never even touched ourselves; Ooyala was able to serve video streams intelligently and efficiently; end users have a fast and responsive web service powering their media applications; and we had the opportunity to get our hands dirty with another member of the Symfony family, an investment that will pay off long-term with Drupal 8 and the growing popularity of Symfony within the PHP ecosystem.
That's the value of getting off the island.
Image: "Hug Me" by smohundro is licensed under CC BY-NC-SA 2.0