Crunchy Postgres via Automation: What’s new in V2

As discussed last week, I’m back today to highlight the major differences between the 1.x and 2.x lines of Crunchy Postgres via Automation(CPA).

CPA 1.x

When we initially designed and launched the 1.x line of CPA, we attempted to leverage the OS components as much as possible to reduce our internal support burden and ship things faster. This means that CPA would deploy the Crunchy Certified Postgres repository, but install packages from both it and the normally configured OS repositories. All the ‘core’ Crunchy Postgres components would come from us, but most other things like etcd or HAProxy would come from the system itself.

As you can imagine, this quickly led to issues:

  • Each of our supported OS platforms shipped with different versions of these components
  • Each of our supported OS platforms shipped new releases at different schedules
  • All of our customers have different schedules for updating their OS

Additionally, we ran into issues where:

  • Some customers don’t allow their nodes to access the Internet
  • Some customers use internal mirrors of repositories and their refresh schedule is unknown to us

I won’t bore you with all the details of the pain these issues caused, but it was bad enough that we worked with our internal Build team to create ‘dummy’ packages (.rpm and .deb) that we called ‘crunchylocks’. These packages were empty but had very tight dependencies. For example, the ‘crunchylock’ package for etcd might have had a dependency of 3.4.3 <= etcd <= 3.4.8, which would, in theory, limit etcd to known working versions. These lock packages worked, mostly, but became a maintenance burden in and of themselves. And when they didn’t work, for example when using the best option for dnf when upgrading CPA, they blew up spectacularly.

Speaking of CPA upgrades, it now involved selectively removing/upgrading specific ‘crunchylock’ packages, in the correct order, in lock-step with upgrading the actual components and re-installing existing ‘crunchylock’ packages if the upgrade failed. The upgrade code became pretty gnarly pretty quickly. Upgrading was, quite frankly, the single biggest ‘issue’ with CPA 1.x as a whole.

Additionally, the CPA 1.x line was limited to Ansible 2.9.6 compatibility for most of it’s life. This become a large issue with customers running Ansible Automation Platform (AAP) and a growing issue for development efforts.

Something needed to change.

CPA 2.x

As I sat down to plan out the 2.x line in my new role as Architect1, it quickly became apparent that no one was happy with the above; customers, Support, our Solution Architects, my team, and the Build team all wanted, well, something else.

So I spent time brainstorming with John, our Build team lead, and we devised a strategy to build all the necessary components in-house and ship everything we directly relied upon. We further decided, based on a suggestion from John, to utilize a Bill of Materials (BoM) to list exactly what we built and shipped. Each version of the BoM would have a 1:1 relationship with CPA 2.x release, e.g. CPA 2.1.0 would be BoM version x, CPA 2.1.1 would be BoM version y, and so on. This would simplify debugging for both the Support and Solution Architects teams as simply knowing the version of CPA would additionally provide the exact versions of all components.

Working with Heath, one of our Build Engineers, I crafted a BoM that contained all the information needed for CPA to download all the components directly from the Crunchy Data software portal, verify their size and checksum, and cache these files on the Ansible controller. We would then push the files out to the target nodes, removing the need for these nodes to download anything.

The BoM itself is essentially a giant YAML file containing a massive dictionary that the CPA roles load as ai variable. It currently looks something like this:

crunchy_automation_bom:
    target_role_version: x.y.z
    bom_version: nnnn
    epcp_path: '/epcp/redhat/EL-8/x86_64/base/'
    ha_path: '/cpa/redhat/EL-8/x86_64/base/'
    pg_path: '/postgresql13/redhat/EL-8/x86_64/base/'
    packages:
      - { src_repo: "pg", src: "postgresql13-13.16-1Crunchy.el8.x86_64.rpm", pkg: "postgresql13", version: "13.16-1Crunchy", sha256: "0ecbacc38c691b6cacb95d112e1e3d770ac7ca2f3f822cd6d7ad72a0b1a088c4", size: "1531568" }

As you can see, we have all the information needed to construct an URL to retrieve packages from our software portal. Well, all the information except the customer’s portal credentials, which we gather elsewhere in the code 🤐. Obviously, there are many more lines in the packages sub-dictionary (almost 90).

We download these packages on the Ansible controller, verify their checksum to ensure they haven’t been tampered with or that bits were munged in transit, and we store them into a cache directory on the Ansible controller. Our code is smart enough to check for packages in the cache directory first and will only download those files that are missing or do not match their checksum; which means you can even use an Ansible controller that doesn’t have Internet access since you can simply pre-populate the cache directory by obtaining a tarball from us (an oft-requested feature).

Once everything has been verified to be in the cache directory correctly, our code will synchronize the cache directory to all target nodes in the inventory, and then create a software repository (a .repo or .list) that points to the cache on those nodes. At that point, we can freely install our shipped packages without an Internet connection, and our repo is 100% static so customers can freely update the OS (dnf update/apt upgrade) on these nodes w/o fear of the CPA components changing. It’s a total win.

We also spent a non-trivial amount of time refactoring the code to leave Ansible 2.9.6 behind like the relic it is. The code now works and is tested with Ansible 2.12.10 - 2.16.x (with the upcoming CPA 2.2.0 adding support for Ansible 2.17.x!). Along with the Ansible upgrade, we now work with recent releases of community.postgresql which has allowed us to refactor even more of our code and optimize things. All in all, things are fairly good on the Ansible front now.

Outro

There are a ton of other changes between the CPA 1.x and 2.x, and we’ll cover some of them in future posts, but for now, those are the most significant changes. We have a lot planned for the 2.x line, and have even starting queuing up ideas for an eventual 3.x line.

Next week, we’ll talk about team structure and assigned duties. I think you’ll be surprised what we’re doing with the resources we have.

:wq


  1.  ↩︎