Technical Details On The Recent Civic Platform Outage
Technical Details on the Recent Civic Platform Outage
Format blatantly lifted from a Mozilla post about why Firefox broke. Questions welcome!
Recently, civicplatform.org had an incident where the entire website stopped working. This was due to an error on our end: we published a version 4.0.0 of the @hackoregon/component-library package, and made related changes to project packages that relied on it, without realizing that our project packages were relying on an earlier version. Now that we’ve fixed the problem, I wanted to walk through the details of what happened, why, and how we repaired it.
Background: civic mono-repo and @hackoregon/component-library
Although you wouldn’t know it when you visit civicplatform.org, there’s a lot of complexity under the hood. The civic mono-repo has 25 packages, each of which lives in the same repository. Some of these are project packages, for example 2018-disaster-resilience. Others are convenience packages, such as civic-babel-presets. The 2018 package wraps all of these pieces up into the website that delivers civicplatform.org. But a special package is our component-library. This package contains our common UI components and data visualizations, and is used across all project packages, as well as depended on by external projects such as openelections.
You can check the components in our Storybook!
We published the first version of @hack-oregon/component-library to NPM over two years ago, but recently started publishing a ci version in April (#444). This has mostly been working great, with ci versions regularly published whenever a change to the component-library package is merged to our master branch.
Yarn workspaces and symlinked dependencies
Normally, all packages in the mono-repo are kept up to date with each other using a feature called yarn workspaces.
You can find some relevant details on how this works in yarn’s documentation, quoted below, which describes a simple mono-repo with two workspaces, workspace-a, and workspace-b, where workspace-b has a dependency on workspace-a.
In order to keep packages in sync, workspace dependencies are symlinked, rather than installed via npm:
“Please note the fact that /workspace-a is aliased as /node_modules/workspace-a via a symlink. That’s the trick that allows you to require the package as if it was a normal one!”
However, there is an important detail (queue ominous music) about the versioning of those dependencies:
“In the example above, if workspace-b depends on a different version than the one referenced in workspace-a’s package.json, the dependency will be installed from npm rather than linked from your local filesystem. This is because some packages actually need to use the previous versions in order to build the new ones (Babel is one of them).”
The lead up to the explosion
On August 13th, we noticed that the ci version of the component-library hadn’t been published in a few weeks (#762). It’s still unclear what exactly caused this issue, but this issue was resolved by setting our lerna mode to independent (#764). This fixed the issue, but caused some funkiness in the version number of our ci version (#806) -- normally version numbers don’t go down. In order to resolve the version number funkiness, and because there had been breaking changes, we published a version 4.0.0.
At this point, we’re still in the clear. Local development is going great, we’re making lots of fun changes, including implementing theming for the component-library! Things look great in our Storybook.
We’ve been testing these changes in the project packages, and in the 2018 package by running things locally. However, under the cover of fun colors and comic sans, something has gone wrong!
Things go boom
http://civicplatform.org/ <<< TypeError: zS.VisualizationColors is undefined
At 6:25 pm, PST, on August 22nd, we noticed that civicplatform.org was totally down. We don’t know the number of affected users, but it was probably at least in the 10s.
Our team of CIVIC heroes rushed to the scene. The proximate cause was a console error -- zS.VisualizationColors is undefined. Because we have automated testing, and our local development environment was unaffected, the underlying issue was unclear.
We started by attempting to replicate the error locally. We tried running the production version of the 2018 package using yarn start:prod. That worked fine. With our recent publish of version 4 of the component library, we had a suspicion that something could be amiss with our dependencies. By running yarn clean to remove all node-modules, then running the steps used in our travis.yml file, we were able to replicate the error.
The issue, in a nutshell 🥜
When we published 4.0.0, the version in the package.json in component-library was updated. However, our other packages had a dependency on "@hackoregon/component-library": "^3.0.0". Things kept working locally, because yarn maintained the symlinks it created between the workspace dependencies in node-modules. However, that doesn’t work when you install from scratch - to paraphrase yarn’s documentation: If 2018 depends on a different version (^3.0.0) than the one referenced in component-library’s package.json (4.0.0), the dependency will be installed from npm rather than linked from your local filesystem.
The fix, in a nutshell 🌰
We upgraded version all of our internal package dependencies on the component-library to 4.0.0 (#832) at 7:31 pm, PST. The actual time of the outage is unknown (we should really have some more monitoring…), but it took us less than one hour from when we noticed it to being back online.
How to prevent this issue in the future:
When publishing a major version of the component-library, verify that all component-library dependencies in the civic-repo are updated before the component-library package.json file is updated ✨. We’ve created an issue to improve our publishing workflow documentation (#835)