How we migrated the Wonder codebase to a monorepo in a snap

Jules Terrien
Wonder Engineering
Published in
7 min readJan 7, 2019

--

At Wonder, we write code that helps our users get high quality research on any topic in a matter of hours. Our web-app is split between a repo dedicated to our API and workers, and a few repos for frontend code. While this setup worked for several years, it presented challenges which we recently solved by migrating all code into a monorepo.

This migration took about a week to research and test, and was executed in several hours with no impact on our dev teams. Along the way, we’ve learned a few lessons regarding how to migrate code efficiently and safely and how to easily re-wire our deploy Heroku & CircleCI toolset. If you’re considering joining the monorepo bandwagon, read on 🤓

Why a monorepo

There’s a lot of discussion about how code should be organized from monoliths to multiple repos to monorepos. A simple google search for “monolith monorepo” returns thousands of results most of them blog posts debating the pros and cons of each. At Wonder, having code split between independent repos caused a set of problems.

First, it made sharing code difficult. Over time, this made it easier for different conventions to spread across the codebase which, in turn, made it harder to discover and work on.

Git workflows were also unnecessarily complex as. For any given feature spanning multiple repos, different PRs had to be opened per repo. This made PR reviews harder as developers had to constantly synchronize to figure out what code to pull and, ultimately, what order needed to be followed when merging to master and deploying.

As we started to invest more heavily in end-to-end (e2e) tests, our multiple repo setup highlighted a few other challenges. We wanted our e2e test suite to live closely with our backend code so we could leverage a growing set of scripts which helped setup database state.

Putting our e2e test suite inside of our backend repo didn’t feel right as a developer would also always need to refer to client often while writing tests. To encourage our teams to write these tests, we wanted minimal friction: code from various apps should be easy to find and refer to, and git workflows should be simplified to one branch per feature.

Increasingly, our ideal end state pointed to having all code in a monorepo.

Migrating Code

http://npmjs.com/package/lerna

The first challenge that came to mind was how to migrate code without losing history. Because all of our code is in JavaScript, we started researching tools built for this purpose and landed on Lerna. Used by the likes of Babel and Jest, Lerna is tested in the wild and offers a simple API to migrate code, manage package dependencies and publish new package versions. It also helped that a few of our developers had already used the tool previously.

After creating a new repo to serve as the monorepo’s home and initializing it with Lerna (by simply running lerna init ), we wrote a simple shell script which would use Lerna to import code from each repo into a package of the monorepo. Running this script took about 10–15mins for ~20,000 commits.

#!/user/bin/env bash# Echo commands & fail on error
set -e -x
# Import repos as packages
For repo in “${repos[@]}”; do
# assumed repos are located accordingly
cd “../$repo”
git pull
cd -
npx lerna import “../$repo” --yes --flatten --max-buffer 100000000
done

One drawback from this approach is that the --flatten flag uses a “flat” commit history: merge commits are applied as a single change the merge introduced which means the count of commits in your new monorepo will be lower than the combined counts in previous repos. There are other options to merge trees if you don’t want to flatten your history.

After running this script we had our our monorepo setup and structured:

monorepo/
package.json
lerna.json
node_modules
packages/
repo1/
package.json
repo2/
package.json
etc.

Patching migrated code

Once code was migrated, we needed to update various files such as Dockerfiles, Procfiles, CircleCI configs and other miscellaneous files such as .gitignore. These had previously lived inside each repo but now had to be re-organized to match the new monorepo structure or edited.

Git provides a useful mechanism to make edits programmatically using .patch files which are simply the output of a git diff and can be applied quickly using git apply (learn more here).

It’s worth noting that, to be successfully applied, the metadata in each diff (specifically the file and line ranges) need to match the diff’ed file at the moment when they’re applied. We ran into a few cases where our patch files had been created prior to new commits being pushed to relevant files which caused the patch application to fail. We had to either re-create the patch files or edit them to match the newly edited files. You can learn more about how to read a diff output here.

Once our patch files were ready, applying them was a piece of 🎂:

git apply patches/* 
git add .
git commit -m “Apply patches for monorepo setup“
git rm -r patches
git commit -m “Remove monorepo patches”

Migrating open branches

The last lines of code we needed to migrate lived in the various PRs that were still open at the moment we made the transition. Here again, we wanted the process to be as easy as possible for our dev teams and wrote a bash script to that effect, reducing these migrations to a 1-liner:

$ bash migrate_branch.sh <repo-name> <branch-name>

This script first ensured that the provided repo name could be mapped to an existing remote:

#!/usr/bin/env bashrepo=$1
remote=
if [ $repo = <repo-name> ]; then
remote=<remote-name>
elif [ $repo = <other-repo-name> ]; then
remote=<other-remote-name>
fi
remoteURL=git@github.com:askwonder/$remote.git
git remote add $remote $remoteURL

It then created a patch file by diff’ing the relevant package in the current monorepo with the edited code in the remote branch, and applied that patch:

git fetch $remotegit diff $remote/master...$remote/$targetBranch > PatchFile.patch
echo "Patch file created for branch $targetBranch"
git apply --no-index --directory packages/$repo PatchFile.patch
echo "Patch applied"
rm PatchFile.patch
echo "Your PR has been applied! Please commit and open a new PR."
# clean up
git remote remove $remote

--no-index is undocumented for git apply but you can find a similar explanation for it here.
--directory is what tells git where to apply the relevant code in the new monorepo tree structure.

Getting CircleCI to work for our monorepo

Moving to a monorepo meant we had to change our CircleCI configs: instead of having a ./circleci/config.yml per repo, we now needed a single config file which could define builds for each package. Ideally, we also wanted the status of each build to be displayed on Github independently so developers could easily figure out which build had succeeded or failed. Last, we wanted our config to only create a build if a change was detected in the relevant package to keep our queue of circle jobs to a minimum.

Thankfully, all of this is possible using Circle CI 2.0 using a combination of documented and undocumented features. 💪

We used Circle’s Workflow feature to define jobs for each package. Each job would also start by calling a script which checked if a change had been made in relevant package before deciding whether to continue or not. Ultimately, our config looked something like:

# ./circle/config.ymlversion: 2jobs:
<job-1>:
docker:
- image: <docker-image>
steps:
- checkout
- run:
name: Check if build should run
command: bash ./.circleci/run_tests.sh <package-name>
- ...steps needed to build & run tests
<job-2>:
...
workflows:
version: 2
<workflow-name>:
jobs:
- <job-1>
- <job-2>

And a run_tests.sh script inspired by this thread.

#!/bin/bashrepo=$1
branch=`git rev-parse --abbrev-ref HEAD`
if [ "$branch" = "master" ]; then
echo "On branch master. Let's run all tests!"
eval "npm run test"
elif git diff --name-only origin/master...$branch | grep "^${repo}" ; then
echo "Changes detected! Adding ${repo} tests to the queue..."
else
echo "No changes detected. Exiting circle build..."
circleci step halt
fi

With this config, developers could easily push a single branch to our monorepo remote and have Circle run builds independently:

Deploying the monorepo

Once all code had been migrated and relevant configs had been re-wired, it was time to deploy. Before the migration, we each was deployed to its own Heroku app and we didn’t want this repo migration to affect our deploy setup.

Fortunately, we found a Heroku buildpack which made deploying a monorepo to separate Heroku apps easy peasy:

#!/usr/bin/env bash# Echo commands & fail on error
set -e -x
# Map Heroku apps to monorepo packages
declare -A apps
apps=(
[<Heroku-app-1>]='<packages/package-1>'
[<Heroku-app-2>]='<packages/package-2>'
)
# Check app connectivity
for app in "${!apps[@]}"; do
heroku config -a $app > /dev/null
done
# Setup heroku
for app in "${!apps[@]}"; do
heroku config:set -a $app "BUILDPACK=${apps[$app]}=https://github.com/heroku/heroku-buildpack-nodejs"
heroku buildpacks:clear -a $app
heroku buildpacks:add -a $app https://github.com/Pagedraw/heroku-buildpack-select-subdir
heroku buildpacks:add -a $app heroku/nodejs
heroku git:remote -a $app -r $app
done
# Deploy
branch=$(git rev-parse --abbrev-ref HEAD)
for app in "${!apps[@]}"; do
git push -f $app $branch:master &
done
wait# Push to Github
git push

And just like that the monorepo was live!

Join us

Want to work with the team that made this? Wonder is hiring :)

--

--