Last night we open-sourced HCB, taking a seven-year old codebase from private to public has been long journey. Check it out and star it here: github.com/hackclub/hcb.

Here’s how we did it.

Scrubbing

HCB began as a private repository with the intention of open sourcing one day. With it being private, the repository collected amounts of customer data / personally identifiable data (PII). Over time, with bug reports, etc., we posted highly sensitive information across GitHub issues, pull requests, and commits.

We had nearly 5,000 issues & PRs which could potentially contain this information. We attempted to automate this process with Regex, but it was difficult to cover edge cases such as PII within images. Our solution was to hire a team of “scrubbers” who would go through the repository and flag any sensitive content. We assigned every scrubber a block of hundred issues / PRs; once finished they could claim a new block of issues and PRs. Within fifteen days - the entire repository's Issues and Pull Requests were scrubbed of customer data and PII.

The scrubbers ended up identifying 170 instances of sensitive information in the repository. I then took that list, updated the GitHub comments to remove those instances, and reached out to GitHub support (thank you Fred!) to remove the images from GitHub servers and the edit history from their database.

Credentials w/ Doppler

Our next concern was our credentials. Don’t worry, we hadn’t been storing our .env in the repository. We were using Rails’ encrypted credentials system. It’s decent, however, we were concerned that if our key leaked, everything would be leaked with no option for remediation. With that in mind, we transitioned to storing our credentials with Doppler and built a custom module to load the credentials into the Rails processes' environment. Some benefits of using Doppler: more flexibility with environments (we have developers with different levels of access), audit logs, and convenient changes (without refreshing GitHub branches). The module also supported overriding Doppler variables for specific environments using an ignored .env. For developers without Doppler access, this is also the system for setting environment variables.

Git rewrite

We didn’t commit our .env file but there were some files with secrets exposed in them. To identify these secrets we used git log on a couple of strings we’d identified as critical (eg. bank account numbers):

git log -S [SECRET]

We also used grab/secret-scanner and trufflesecurity/trufflehog to scan for any other potential secrets in our codebase.

After identifying these files / strings, we used git-filter-repo to remove them. We ran these commands in a GitHub codespace to have as clean of a setup as possible.

To remove an entire file we used:

python3 git-filter-repo.py --sensitive-data-removal --path [file_path] --invert-paths

And to redact a string we used:

python3 git-filter-repo.py --sensitive-data-removal --replace-text <(echo 'STRING==>REDACTED') --force

We then force pushed to our repository. For this to have any meaning, we had to unfortunately delete all forks of the repository and hard reset all of our local environments.

Lastly, we worked with GitHub support to remove the cached commits! Thank you Eden.

As a side effect of this many of our PRs look a little funny:

Security discussions

We had a lot of back and forward conversations about security during this process. We’ve implemented a couple of safeguards during this process but the general consensus walking away from conversations with security firms/experts was that we were risking very little by open sourcing, and long-term, our codebase would be more secure.

Documentation

Lastly, we wanted to make sure that our codebase would be accessible to new developers. We worked on making the setup process smoother w/ Codespaces, wrote guides to the key parts of HCB (for example, this guide on the transaction mapping engine), and I gave a talk at SF Ruby with a brief overview of our codebase:

There’s still more work to do! Reach out if you wish a part of HCB was documented better (hcb-engr@hackclub.com). And, if you’re a teenager interested in helping out, join us in the #hcb-dev channel on the Hack Club Slack.

Years in the making - this is a proud moment for the whole team! I’ll leave you with this gource video of the codebase over the years.