Proactively Checking and Replacing STS Certificate on vSphere 6.x / 7.x

Recently, we’ve been working on a global issue affecting all customers that had deployed a vCenter Server as version 6.5 Update 2 or later. The Security Token Service (STS) signing certificate may have a two-year validity period. Depending on when vCenter was deployed, this may be approaching expiry.

Since currently there is no alert on vCenter for this certificate, and also it is a certificate that prior to 6.7u3g had no way to be replaced by customers in case of expiration (required GSS involvement to execute internal procedures / scripts) and it generates a production down scenario, silently.

Within the GSS team, we’ve come up with three scripts to help with this situation.

Checksts.py

Checksts.py is a python script that is mentioned in KB https://kb.vmware.com/s/article/79248. This script will proactively check for expiration of the STS certificate. It works on Windows vCenters as well as vCenter Server Appliances.

To use it, you can download it from the KB mentioned:

Once it is downloaded, you can copy it to any directory on your vCenter. After that, you will run it like this:

  • Windows: "%VMWARE_PYTHON_BIN%" checksts.py
  • VCSA: python checksts.py

This is an example for VCSA:

If you get the message “You have expired STS certificates” and/or your certificate expiration date is in less than 6 months, we recommend to move onto the next step, replacing the STS certificate! If your expiration date is in more than 6 months, then you don’t have to worry about any of this!

Fixsts.sh (VCSA) / Fixsts.ps1 (Windows)

The fixsts scripts are mentioned in https://kb.vmware.com/s/article/76719 (which I personally wrote) for VCSA and https://kb.vmware.com/s/article/79263 for Windows.

The idea is the same for both, replacing the STS certificate with a new, valid one. This can be done proactively (cert has not expired yet) as well as reactively (cert has already expired and you’re in a production down scenario)

The steps for these two KBs are mentioned in the articles. They’re pretty much identical, with minor differences in running the commands due to the Guest OS, and super straightforward to run.

Once the STS is replaced, in case it was done proactively, you will be good to go!

YOU CAN STOP READING FROM THIS POINT ON – hope you liked this blog entry!

However, if this was done reactively, then it is likely that you will need to replace more certificates in your vCenter Server, especially if you were using VMCA certs (which could have the same expiration date as the STS certificate if they were never replaced)

Replacing other certificates

How do I know if which of my other certificates are expired?

On the KBs mentioned, there are two one-liners provided to check for certificates

  • Windows: $VCInstallHome = [System.Environment]::ExpandEnvironmentVariables("%VMWARE_CIS_HOME%");foreach ($STORE in & "$VCInstallHome\vmafdd\vecs-cli" store list){Write-host STORE: $STORE;& "$VCInstallHome\vmafdd\vecs-cli" entry list --store $STORE --text | findstr /C:"Alias" /C:"Not After"}

  • VCSA: for i in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list); do echo STORE $i; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $i --text | egrep "Alias|Not After"; done

These commands will show, for each of the VECS (VMware Endpoint Certificate Store) stores, the expiration date for all certificates. If the certificates have an expiration date prior to today, then they’re expired. Also, you will have issues with services if certificates are expired. Services such as vpxd-svcs, vpxd or vapi-endpoint will be pretty verbose with expiration date of certain certificates.

For example:

root@vcsa1 [ /tmp ]# for i in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list); do echo STORE $i; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $i --text | egrep "Alias|Not After"; done
STORE MACHINE_SSL_CERT
Alias : __MACHINE_CERT
Not After : Apr 6 11:57:19 2029 GMT
STORE TRUSTED_ROOTS
Alias : c96d3301505316ccc1b295276ece31318ad79ec7
Not After : Apr 6 11:57:19 2029 GMT
Alias : 8a11418d5ae2b87b7e8a5cb8646fbfae41503f9d
Not After : Dec 13 21:50:49 2029 GMT
Alias : cb5a495d34f3f2f75d357b47aac3799346665258
Not After : Sep 25 20:32:57 2022 GMT
Alias : 229a64a3dff7417d0b38fb011c692a55b7bee5c2
Not After : May 16 20:21:12 2030 GMT
Alias : 2f0e8e4f1658e61bef5004cb5efd159b90396838
Not After : May 16 20:45:07 2030 GMT
STORE TRUSTED_ROOT_CRLS
Alias : 4504400e4bcbdab5a34a9bc2555abd55327369c1
Alias : 31b2b5a18d89d90dadff901400a60d45ca3356e9
Alias : e7840a7cbbe7fcdd7a13d9159ff97443cc53fb5e
Alias : 985d7e55183635f13e2c6469eee9c72f68334615
STORE machine
Alias : machine
Not After : Apr 6 11:57:19 2029 GMT
STORE vsphere-webclient
Alias : vsphere-webclient
Not After : Apr 6 11:57:19 2029 GMT
STORE vpxd
Alias : vpxd
Not After : Apr 6 11:57:19 2029 GMT
STORE vpxd-extension
Alias : vpxd-extension
Not After : Apr 6 11:57:19 2029 GMT
STORE APPLMGMT_PASSWORD
STORE data-encipherment
Alias : data-encipherment
Not After : Apr 6 11:57:19 2029 GMT
STORE SMS
Alias : sms_self_signed
Not After : Apr 12 12:04:48 2029 GMT
STORE BACKUP_STORE

In this case, none of the certificates are expired. But if we had expired certificates we will need to replace them!

Let’s group them in three groups. All of them are replaced using the same tool, certificate-manager, detailed on KB https://kb.vmware.com/s/article/2097936, but the option you will use will depend on the scenario

  • Group 1: Machine SSL Certificate (Front facing certificate, on port 443)
    • If only Machine SSL is expired, you will run Option 3 (Replace the Machine SSL certificate with a VMCA Generated Certificate) of this KB, with the following caveats
      • The “comma separated list of hostnames” you will be prompt to complete, should contain the PNID of the node as well as any additional hostname or alias you might be using. How do we get the PNID for the node?
        • Windows: "%VMWARE_CIS_HOME%"\vmafdd\vmafd-cli get-pnid --server-name localhost
        • VCSA: /usr/lib/vmware-vmafd/bin/vmafd-cli get-pnid --server-name localhost
      • The value of “VMCA Name” should match the PNID obtained in the prior step
  • Group 2: Root certificate (VMCA root certificate)
    • If there is any certificate expired in the TRUSTED_ROOTS store, it will be safer to just run Option 8 (Reset all certificates) on the KB mentioned above. This will reset all certificates to VMCA signed. The same caveats mentioned for Option 3 apply
  • Group 3: Solution Users certificates(vpxd, vpxd-extension, machine, vsphere-webclient)
    • If there is any certificate expired in the stores vpxd, vpxd-extension, machine or vsphere-webclient, run Option 6 (Replace Solution User Certificates with VMCA generated Certificates) on the KB mentioned above. The same caveats mentioned for Option 3 apply

Once all this is done, you should be back up and running with regenerated certificates, and out of the production down scenario!

Closing note

This is a pretty concerning issue, so I’m really happy to have been part of the team to help fix so many environments across the globe.

Please, use this information to proactively check for the STS certificate, as well as replacing without having to get into a production down scenario. You can share this with customers, partners, or whoever you feel might be benefited from this information!

Pre-upgrade considerations in Multi-vCenter environments

With vSphere 7.0 being released April 2nd, 2020 and vSphere 6.0 reaching its end of general support on March 12th, 2020, this is one of the moments in which many environments are in the process of upgrading their vSphere version, either from 6.0 to 6.5/6.7 (to continue having support) as well as to 7.0 to take advantage of all the new features, such as Kubernetes native integration.

However, we have been getting an increased number of Support Requests with issues after upgrades in Multi-vCenter environments using Enhanced Linked Mode (from now on, ELM), especially if the environment is using more than one Platform Services Controller (from now on, PSC) either embedded, or external.

The goal of this article is to help you understand your roadblocks to upgrade PRIOR to actually doing the upgrade, so you don’t incur in any downtime and can proactively fix everything that’s needed before upgrading.

For the purposes of this article, I will try to demonstrate everything with a Demo Environment, so everything is more clear.

Demo Environment

Super simple environment!


Two vCenter Server Appliances with Embedded PSC, in a single SSO domain.
I’m going to demonstrate the issues that we could get in if we upgrade an environment that is not in a healthy state of PSC replication.

What’s PSC Replication?

As you know, data replicates between the PSC instances (embedded in this scenario) when Enhanced Linked Mode is configured.

What data is replicated?

  • Users and roles
  • Trusted Roots store certificates
  • Lookup Service service registrations
  • Computer accounts
  • Domain controller accounts

And many, many more things. VMDIR (VMware Directory Service) is a Multi-master LDAP database.

I did mention Lookup Service service registrations… what are those?

Lookup Service

The Lookup Service is a component that registers the location of vSphere components so they can securely find and communicate with each other. This includes every internal service as well as some 2nd Party Tools (such as NSX, vSphere Replication, SRM) and 3rd Party Tools (Storage plugins, for example)

This is the output of the amount of Service Registrations per Service Type, for our Demo environment

  2         Service Type: applmgmt
  2         Service Type: certificatemanagement
  2         Service Type: cis.cls
  2         Service Type: cis.vmonapi
  2         Service Type: client
  2         Service Type: com.vmware.vsan.dp
  2         Service Type: com.vmware.vsphere.client
  2         Service Type: cs.authorization
  2         Service Type: cs.componentmanager
  2         Service Type: cs.ds
  2         Service Type: cs.eam
  2         Service Type: cs.identity
  2         Service Type: cs.inventory
  2         Service Type: cs.keyvalue
  2         Service Type: cs.license
  2         Service Type: cs.perfcharts
  2         Service Type: cs.vapi
  2         Service Type: cs.vsm
  2         Service Type: imagebuilder
  2         Service Type: messagebus.config
  2         Service Type: mixed
  2         Service Type: phservice
  2         Service Type: rbd
  2         Service Type: sca
  2         Service Type: sms
  2         Service Type: sso:admin
  2         Service Type: sso:groupcheck
  2         Service Type: sso:sts
  2         Service Type: topologysvc
  2         Service Type: vcenterserver
  2         Service Type: vcha
  2         Service Type: vcIntegrity
  2         Service Type: vsan-dps
  2         Service Type: vsan-health
  2         Service Type: vsphereclient
  2         Service Type: vsphereui

You can see services such as vsphereclient (vSphere Flash Client), vsphereui (vSphere HTML5 Client) and vcenterserver (vCenter Server), among others.

You can also see that there is two of every registration. Every PSC has its own Lookup Service, but they replicate the data through VMDIR, so every registration exists on every PSC.

Let’s take a look at the vCenter Server registrations:

I’m running the following command on one of the vCenter Servers (with Embedded PSC)

/usr/lib/vmidentity/tools/scripts/lstool.py list --url http://localhost:7080/lookupservice/sdk | grep -i "Service type: vCenterServer" -A9 | egrep "Service Type:|Version|URL"

For the purposes of this article, I’m only interested in the Service Type, Version and URL. However, a service registration contains much more data than that, such as the Service Registration ID, Node ID, and all the URL for the different endpoints with its own SSL certificate, but we’re not going to dive into that.

Service Type: vcenterserver
Version: 6.7
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk

We can see that every registration has the URL and the version. This is really important! Keep this in the back of your minds because we’re going to go back to this!

PSC Replication Status

As we mentioned previously, the VMDIR database replicates between the PSCs

You can check the replication status of any PSC instance with the following command

/usr/lib/vmware-vmdir/bin/vdcrepadmin -f showpartnerstatus -h localhost -u administrator -w SSO_Password

This is the output on our Demo environment


Partner: vcsa2.gsslabs.org
Host available: Yes
Status available: Yes
My last change number: 10360
Partner has seen my change number: 10360
Partner is 0 changes behind.

This means that outgoing replication for this node is working, however, it does not mean that replication is working correctly in both directions. For this, you would need to run the same command on its replication partner.

Partner: vcsa1.gsslabs.org
Host available: Yes
Status available: Yes
My last change number: 10351
Partner has seen my change number: 10351
Partner is 0 changes behind.

This is good, our environment is healthy replication wise!
But what if it wasnt?

  • Host available: no, would mean that the replication partner is not reachable through the network
  • Status available: no, would mean that the replication partner is reachable through the network, but VMDIR state is either on read-only or null (this is bad!)
  • Having a big number of changes behind and not updating could mean that this local node is in read-only or null state (this is also really bad!)

So how do we check our VMDIR state if the “showpartnerstatus” command shows any of these errors?
Running the following command

echo 6 | /usr/lib/vmware-vmdir/bin/vdcadmintool
You will get an output similar to:
VmDir State is - Normal
This state could also be Null, Read-Only and Standalone – For the purposes of this document, all three are bad!

But how does a PSC get into this state?

After restoring a PSC (Either embedded or external) either from a snapshot, image-level backup, file-level backup, or VM-based backup, the Update Sequence Number (USN) value is a lower number that its replication partners. This results in the replication partners being out of synchronization with the restored node.

This is why you should always, when snapshotting a Multi-vCenter environment, you should always do it with all nodes powered off, and if you restore one of the nodes to a snapshot, you have to restore all of the involved nodes. This also applies to backups!

What can broken replication affect?

Replication issues are usually called a “Silent Killer” because you don’t notice it is working until you want to do a change in the SSO environment. These changes can be adding a new 2nd or 3rd party tool, creating local users / roles in the SSO domain, installing a new vCenter or PSC, Converging from External PSC to Embedded PSC, and the one we’re discussing in this document, upgrading!

So let’s go back to our demo environment…

This image has an empty alt attribute; its file name is image.png

And we’re now going to upgrade vcsa1.gsslabs.org – The upgrade succeeds…
Remember what we discussed about the versions?
This is what vcsa1.gsslabs.org (the upgraded one) now sees in Lookup Service

Service Type: vcenterserver
Version: 7.0
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk

This is only a simple change to demonstrate the issue. This would happen for every other internal service, and in the case of vSphere 6.7 to 7.0, it will create, rename and re-register a bunch of other services, since the whole VMDIR structure changed.

This is fine! When we log in to vcsa1.gsslabs.org, we see both vCenter Servers…
But what happens if we log in to vcsa2.gsslabs.org ? We see that vcsa1.gsslabs.org is not showing up!

So we go to check the Lookup Service entries, and we find the following…

Service Type: vcenterserver
Version: 6.7
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk


Since replication was not working, vcsa2.gsslabs.org never got the changes that vcsa1.gsslabs.org made during the upgrade… so when vcsa2.gsslabs.org‘s Web Client tries to contact the vCenter instance in vcsa1.gsslabs.org, there is a version mismatch, and therefore it does not load it.

If you now upgrade vcsa2.gsslabs.org, the same thing is going to happen, and both are going to show something like this…

VCSA1
Service Type: vcenterserver
Version: 7.0
U
RL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk

VCSA2
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 7.0
URL: https://vcsa2.gsslabs.org:443/sdk


Effective immediately, ELM is officialy broken – These vCenters won’t see eachother in the Web Client, let alone replicate VMDIR changes.

And this state is not easily fixable, this would likely involve cleaning up both sides VMDIR and then executing a cross domain repoint between eachother. Now imagine if instead of this simple environment, you have a 6 vCenter Environment, and you run into these issues, can you imagine the trouble you will get into?

OK, so now what do we do?

Now that the impacts of broken PSC replication in upgrades (it will also affect convergence, and many other SSO operations), this is something you can do to avoid being sucker punched by the upgrade process.

  • Check if replication between all your PSC instances is working correctly and showing 0 changes behind across the board. This is done using the vdcrepadmin command that was shown before
  • If you run into any issue such as the ones already mentioned, check the VMDIR status using the vdcadmintool command that was shown before
  • If you get any of the errors detailed in this article, please open a Support Request with VMware -> https://kb.vmware.com/s/article/2006985
    We have a multitude of internal tools that can help you fix the replication issues and get you into a healthy state before attempting any other disruptive process, such as upgrading!

Closing Note

I hope this blog post (my debug blog post!) is helpful for everyone that is running into these situations. The idea was to demonstrate a really possible issue you might have, using a simple aspect such as the Service Registration for vCenter Server version change, in the process of an upgrade.

Hopefully this will avoid many critical issues in Multi-vCenter Environment

Regards,