Skip to main content

Content Scaling with Drupal

Nov 06 '15

content scaling with drupal

Drupal is a state of the art content management platform, but just like any CMS, various problems arise when it is pushed to its storage and traffic limits. A good example of a problem application is a high traffic news website which publishes more than 50-100 stories per day.

A typical modern news website obtains its content from a variety of sources. There are usually staff writers, and then a melange of paid content which are imported through feeds (AP, Reuters etc.). Most news agencies choose to have their editorial staff use in-house software as a writing platform so this content is typically imported into Drupal just like the paid feeds. Since news is constantly being revised (journalistic mistakes, redacted information), these importers need to perform the dual role of creator and revisor.

While there are many ways to optimize a Drupal website that is being stressed, here are three huge wins which can dramatically alleviate server load on a high octane content website.

  1. Micromanage caching
  2. Divorce feeds/imports from Drupal cron
  3. Remove old revisions from the database

Micromanage caching

The old adage “The newswatch never stops” couldn’t be more poignant in our modern day of 24/7 media. “The early bird gets the worm” and most news agencies take personal pride in being the first to break stories. Caching however, even if its just a matter of minutes, can drive traffic to competitors who are breaking the stories elsewhere.

No Drupal website can sustain heavy loads without caching, but not all pages need to expire with the same frequency. The homepage and section fronts all need to have relatively short cache expirations, but article pages can be held in cache much longer. Old articles can be kept for very long periods of time since they never change. Developers can set different cache ages for URLs and setting different max ages depending on the page or page type is one important way to leverage caching.

For news websites, the Cache Expiration module is a godsend. It allows the engineers to set expiration rules based on editors actions (content insert, content update etc.). It even integrates with many common CDNs to purge edge URLs when content changes.

In the case of sections of the homepage which need to be updated instantly (i.e. a breaking news banner), one good strategy is to load this content via AJAX and hit a backend endpoint that ties into one of the content types that expire can purge. Thus when the banner is updated, its content can also be purged instantly from the edge cache as well. Depending upon the CDN provider (most will work), this can more or less appear to be in real time.

Divorce feeds/imports from Drupal cron

Most news websites rely on paid feeds for a significant chunk of their content. What is most expedient from a business perspective creates a technical challenge. News websites have to be importing third party content constantly. Requirements for how often content must be imported can be as short as 3 minutes.

When Drupal cron runs every contrib module that implements cron functionality is given a free pass to run their routines. Cron routines tend to be expensive routines, and most module developers who write cron jobs assume that at the most they won’t be run more than once an hour. There is a huge assumption and overhead made by generic modules about cron, and that is that it won’t be run early and often. Frequent news imports should really be run on their own.

One approach might be to create a URL endpoint that would trigger imports, and while solid, developers using this approach should be aware of how this may be a DDOS vulnerability. No one would ever want to provide a URL that if hit repeatedly could tie up their server resources. Auth headers should be passed but I wouldn’t recommend this approach.

A more secure approach would be to use crontab on a remote machine that has drush installed can utilize it via an alias. This will allow granular time based control of imports while also cutting out a full drupal bootstrap, leaving extra memory for editors and visitors to the website.

Remove old revisions from the database

Revisions are generally not a concern on websites where content is published and forgotten. On a news website however, content is under much greater scrutiny. Seldom is the case that something sn’t in need of adjustment. Maybe an image needs to be cropped differently, or the headline or body text needs of an article needs to be revised.

In Drupal revisions are rather brute. On content types with revisions, an edit to a single field causes every field on the article to be re-saved. An extra row is created in the revision table or each field. If this happens often, revision tables tend to balloon, and this is a leading cause of Drupal database bloat.

To put this in perspective let’s image a Drupal website with a ton of content over many years. Let’s say 75+ articles were published a day over a 5 year period (so around 150,000 nodes). Let’s also assume each node was revised/edited 4-5 times (alterations needed for placement on the website or for content).  What we end up with are revision tables with 700,000+ rows for each field! With a modest 15 fields on a content types we are talking about 8,250,000 rows that need to be processed in order to just go through the revisions on an article. Since loading a drupal article or node scrapes these revision tables, we have a MySQL performance bottleneck on our hands.

The best solution, but perhaps not entirely practical is to eliminate revisions on import based news websites. This probably won’t work for most news agencies. A good compromise is to only keep revisions for a certain period of time (ex. 14 days). This will keep the revisions around for as long as they are relevant. It might be helpful to know every edit to a piece for he two three weeks that it is published, but afterwards stick to the final version and lose the history.

In closing, three common performance bottlenecks have been proposed here because they are the easiest to fix. These tend to be things that are not necessarily intuitive and are easy to overlook. There as myriad other enhancements to make but these are very common pitfalls. I appreciate any comments below.