Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to gracefully handle certificate renewal failure #104

Open
danjer2 opened this issue Feb 1, 2023 · 5 comments
Open

Option to gracefully handle certificate renewal failure #104

danjer2 opened this issue Feb 1, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@danjer2
Copy link

danjer2 commented Feb 1, 2023

BUSINESS PROBLEM
We are using Venafi provider to create and refresh certificates. We take advantage of expiration_window parameter to automatically renew the certificate when we are getting close to expiration time.

We recently had an unfortunate chain of events. During a release, the certificate was in the expiration window, and Terraform attempted to refresh it. At the perfect time (after terraform plan executed), the Venafi API became unavailable. The terraform apply failed when it tried to create the new certificate, leaving our application in a non-functional state.

We thought we could address this in the future with

lifecycle {
   create_before_destroy = true
}

but looking at the plans generated it would not help, because Terraform first unlinks the certificate. So even if the creation stops the run before destruction happens, the app is still left without a certificate.
Moreover, since the apply does not complete, it's possible that other infrastructure changes did not get applied, so the application is left in an inconsistent state.

PROPOSED SOLUTION
Add a configuration parameter to Venafi Terraform provider to ignore failure of preventive (i.e. prior to expiration) certificate refresh.

By adding the lifecycle block above we'd force the new certificate to be created first.
If certificate creation fails and the config param is turned on, the provider could return the current (still valid) certificate and let Terraform complete all other changes. The application would be left in a functional, consistent state.

A warning that the cert refresh failed would be helpful.

CURRENT ALTERNATIVES
The only alternative is to attempt to restore the application configuration manually. Restoring a certificate that is still valid is not too complicated, but if there are other changes that were not applied, the process of identifying and applying them is more complicated and error-prone.

VENAFI EXPERIENCE
I have been using the Venafi Terraform provider for more than a year.

@danjer2 danjer2 added the enhancement New feature or request label Feb 1, 2023
@luispresuelVenafi
Copy link
Contributor

Hi @danjer2 thank you for reaching out.

We are sorry to hear that you experienced that. On which platform you were working with, when it happened that it became unavailable? was it TPP or VaaS?

@danjer2
Copy link
Author

danjer2 commented Feb 1, 2023

I'm not familiar with the service setup, but based on the fact that the values for both url and trust_bundle are internal, I'm guessing we are hosting it on-prem.

@danjer2
Copy link
Author

danjer2 commented Feb 8, 2023

I got additional information on the event. It wasn't a Venafi API outage. The problem was with the certificate issuer's back end, So the plan worked fine, but during apply it failed to create the new certificate.

@luispresuelVenafi
Copy link
Contributor

@danjer2 Hi, sorry for late response.

I'm trying to understand better your situation. When you mentioned this:

but looking at the plans generated it would not help, because Terraform first unlinks the certificate. So even if the creation stops the run before destruction happens, the app is still left without a certificate.
Moreover, since the apply does not complete, it's possible that other infrastructure changes did not get applied, so the application is left in an inconsistent state.

Did you try during that time, to re-run the plan and Terraform didn't allow you right? I'm asking since I'd have thought that Terraform should not have deleted or modified your state at all, since the issuance didn't complete

@danjer2
Copy link
Author

danjer2 commented Feb 10, 2023

@luispresuelVenafi - I feel like you're asking multiple questions in one.

  • Terraform state would probably be left in a valid state during the error. After the problem with the certificate issuer's outage would be resolved, I believe that Terraform would have worked. Because the outage was not resolved quickly we needed to scramble for workarounds.
  • Our tests were trying to see if we had the ability to handle this situation with just Terraform functionality (like lifecycle{}). We concluded it would not help because it works at an individual resource level, and it would not cause a re-sequencing of other actions (like linking/unlinking) resources. Moreover, we realized that even if we prevent any such unlinking from happening, the Terraform Apply would still stop before completing, with unknown impact (depending on what resources were yet to be updated).

I just had a long discussion with the team that encountered this, and we concluded that we would have two ways to make certificate renewal via Terraform reliable:

  1. Some change (like the one I proposed) to make the Venafi provider handle the error gracefully
  2. Separate the certificate creation/renewal from the rest of the infrastructure. This can be either manual or a separate Terraform. The separation would ensure that if the cert renewal fails, it does not affect the deployment of the infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants