Lessons learned while deploying VCF 4.2 Management Domain

Hello Everyone! It’s me again, trying to maintain a weekly post cadence!

Today I’m going to talk about some roadblocks I hit while doing a 4.2 VCF Deployment in a real, customer environment. Hopefully this will prevent these issues from happening to you or help you to solve them quickly if they do arise!

Getting started with VMware Cloud Foundation (VCF) 4.0 - CormacHogan.com

Password Policy for Cloud Builder

In VCF 4.2, several changes to password strength were made. It seems that using 8 character passwords are hit/miss (you could get a valid deployment and then immediately a non-valid deployment if you deploy another Cloud Builder with a password like “VMw@r3!!” – I haven’t been able to fully grasp the cause for this behaviour.

In addition, VMware is now a dictionary word, so it wont be allowed. So “VMware1!” and “VMware1!VMware1!” will also fail.

The password that i’ve been using successfully for the initial deployment is “VMw@r3!!VMw@r3!!” – That one works 100% – You can go ahead and use that one.

Hostnames in uppercase

This one is really, really strange – If the hostnames of your ESXi hosts are in uppercase, you will get a ‘Failed to connect to lowercase_hostname’ for all of your hosts when running the validation, and the validation will stop and won’t query any of the host configuration

I spent some time trying to figure this out, at first I thought it was DNS records, but then on a different environment, 3 of the 4 hosts had their hostname in upper case and one of them in lower case, and the one in lower case was the only one connecting, so that made me test the change and suddenly the new host in lowercase was also connecting!

To clarify, ESXI1.VSPHERE.LOCAL will fail, esxi1.vsphere.local will work – Make sure your hostnames are in lowercase

Heterogeneous / Unbalanced disk configuration across hosts

This one is really interesting, let’s say you’re doing an all flash VCF and you have 20 disks per host – The best way to configure it would be 4 Disk groups of 1 Cache + 4 Capacity, so that you would use all 20 disks.

Since you can have at a maximum 5 Disk groups of 1 Cache + 7 Capacity, 40 is the maximum number of disks you can have.

However, make sure that you’re following these two rules for your deployment

  • Make sure that the amount of disks follows a multiple of a homogeneous disk group configuration so that all your disks can be used and all the disk groups have the same amount of disks – I.e, if you have 22 disks, there is no way you can use all disks while maintaining all disk groups with the same amount of disks. If you have 22 disks, you can do 3 (1+6) and one won’t be used, or 4(1+4) and two won’t be used.
  • Make sure that all your hosts have the same amount of disks. You can check this before installing – In my scenario, validation was passing but it was setting the cluster as hybrid instead of all flash.
    After checking that all devices were SSD and were marked as SSD I was really confused. Then I checked and two of the hosts had 2 more disks than the rest. Fixing that made the validation pass and marking the cluster as all flash.

EVC Mode

This one almost made me reinstall the whole cluster…

BE REALLY SURE that you’re selecting the correct EVC mode for your CPU family if you’re selecting an EVC mode in the Cloud Builder spreadsheet.

If you select the wrong EVC mode, Cloud Builder will fail in this deployment, and you won’t be able to continue from the GUI at all. The only way around it is via the API. Otherwise, it is wiping the cluster and starting from scratch!

I’m going to show you how to fix this issue but the method applies in case you need to edit the configuration and then re-attempt a deployment.

First of all, you need to get your SDDC Deployment ID, you can get it with this API call (I will be using curl for this example but you can also use something like invoke-restmethod in powershell or even a GUI based REST client such as Postman)

Get your SDDC Deployment ID

curl 'https://cloud_builder_fqdn/v1/sddcs/' -i -u 'admin:your_password' -X GET \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json' \

You can export the output to a file or to a text viewing tool such as less, and then search for the sddcId value

Editing the JSON File

Once you have the sddcId, you need to edit the JSON file that CB generated from the spreadsheet so you can then use it in the API call. I recommend that you copy the file and edit the copy. The file is located at /opt/vmware/sddc-support/cloud_admin_tools/resources/vcf-public-ems/

cp /opt/vmware/sddc-support/cloud_admin_tools/Resources/vcf-public-e                                                                                                                     ms/vcf-public-ems.json /tmp/newjson.json
sed -i "s/cascadelake/haswell/g" /tmp/newjson.json

You can also edit the file using vi – in this case I used sed because I knew the string will only appear once in the file and it was faster

Restarting the deployment

Now that you have the sddcId and you’ve edited the JSON file, it is time for you to restart the process using another API call

curl 'https://cloud_builder_fqdn/v1/sddcs/your_sddc_id_from_previous_step' -i -u 'admin:your_password' -X PATCH     -H 'Content-Type: application/json'     -H 'Accept: application/json'     -d "@/tmp/newjson.json"  -k

Make sure to add the @ before the location of the file when using curl

Once you run this, you should get something like:

HTTP/1.1 100 Continue
HTTP/1.1 200
Server: nginx
Date: Wed, 07 Apr 2021 20:37:08 GMT

And if you log in to the Cloud Builder web interface, your deployment should be running again! Phew, you saved yourself from reinstalling and preparing 4 nodes! Go grab a beer while the deployment continues 😀

Driver Issue when installing NSX-T VIBs

I ran into this issue after waiting for multiple hours for the NSX-T Host Preparation to finish, and seeing all the hosts on the NSX-T tab being marked as failed.

When checking the debug logs for Cloud Builder, I saw errors like:

2021-04-07T23:06:44.700+0000 [bringup,196c7022580bfc32,5a84] DEBUG [c.v.v.c.f.p.n.p.a.ConfigureNsxtTransportNodeAction,bringup-exec-7] TransportNode esxi1.vsphere.local DeploymentState state is {"details":[{"failureCode":260
80,"failureMessage":"Failed to install software on host. Failed to install software on host. esxi1.vsphere.local : java.rmi.RemoteException:  [DependencyError] VIB QLC_bootbank_qedi_2.19.9.0-1OEM.700.1.0.15843807 requires qe
dentv_ver \u003d X.40.17.0, but the requirement cannot be satisfied within the ImageProfile. VIB QLC_bootbank_qedf_2.2.8.0-1OEM.700.1.0.15843807 requires qedentv_ver \u003d X.40.17.0, but the requirement cannot be satisfied within the Im
ageProfile. Please refer to the log file for more details.","state":"failed","subSystemId":"eeaefa1e-c5a2-4a8a-9623-994b94a803a9","__dynamicStructureFields":{"fields":{},"name":"struct"}}],"state":"failed","__dynamicStructureFields":{"fi

This is related to QLogic drivers that are included in the HP custom image that was being used in this deployment (and was patched to 7.0u1d which is the pre-requisite for VCF 4.2)

Indeed, these drivers were installed

esxcli software vib list | grep qed
qedf                          QLC     VMwareCertified   2021-03-03
qedi                         QLC     VMwareCertified   2021-03-03
qedentv                     VMW     VMwareCertified   2021-03-04
qedrntv                     VMW     VMwareCertified   2021-03-04

None of these drivers were in use, and none of the hosts were using QLogic hardware – So these drivers could be removed without issues, however, it is best to unconfigure the hosts from NSX-T first since that also prompts for a reboot.

Go to the Transport Node tab in NSX-T, select the cluster, and click on “Unprepare” – This will likely fail and prompt you to run a force cleanup – This one will work and the hosts will disappear from the tab.

In my scenario, none of the NSX-T VIBs were installed so no NSX-T VIB cleanup was necessary

Now, it is time to delete the drivers from the hosts and reboot them. You can run this one by one on the hosts (since you already have vCenter, vCLS, and NSX Manager VMs running, you can’t just blindly power-off all your hosts)

esxcli software vib remove --vibname=qedentv --force
esxcli software vib remove --vibname=qedrntv --force
esxcli software vib remove --vibname=qedf --force
esxcli software vib remove --vibname=qedi --force
esxcli system maintenanceMode set --enable true
esxcli system shutdown reboot --reason "Drivers"

Edge TEP to ESXi TEP validation when using Static IP Pool

VCF 4.2 removes the need of having a DHCP server on the ESXi TEP network (as long as you’re not using stretched cluster) which is a lifesaver for many, since setting up the DHCP server was usually a light stopper for customers (the other one being BGP)

However, the validation still attempts to search for a DHCP server (it doesn’t matter that you configured a Static IP Pool on the spreadsheet) and since there isn’t any, you get a 169.254.x.x IP and the validation fails. For example:

VM Kernel ping from IP '' ('NSXT_EDGE_TEP') from host 'esxi1.vsphere.local' to IP '' ('NSXT_HOST_OVERLAY') on host 'esxi2.vsphere.local' failed
You can see the IP is on the 169.254.x.x range

Luckily, this is just a validation bug, it is reported internally, and will likely be fixed in the latest VCF release. The issue will not present itself while actually doing the deployment and the TEP addresses will be set up correctly using the static IP Pool

BGP Route Distribution Failure

If your BGP neighboring is not configured correctly on your upstream routers, you will see the task “Verify BGP Route Distribution fail”

021-04-08T05:09:54.729+0000 [bringup,42ba3b72e2ee4185,395f] ERROR [c.v.v.c.f.p.n.p.a.VerifyBgpRouteDistributionNsxApiAction,pool-3-thread-13] FAILED_TO_VALIDATE_BGP_ROUTE_DISTRIBUTION
com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException: Failed to validate the BGP Route Distribution result for edge node with ID 123b3404-bab6-4013-a9f7-eba3b91b4faf

This means that the BGP configuration on the upstream routers is incorrect, usually, there is a BGP neighbor missing. The easiest way to figure out what’s missing is to check the BGP status on the Edge Nodes

In my case, the Upstream switches only had one neighbor configured per uplink VLAN, so node 1 showed:

BGP neighbor is, remote AS 65211, local AS 65210, external link
BGP version 4, remote router ID, local router ID
BGP state = Established, up for 09:09:51

And node 2 Showed:

BGP neighbor is, remote AS 65211, local AS 65210, external link
BGP version 4, remote router ID, local router ID
BGP state = Connect

You can see that the BGP session for node 2 is not established. After configuring the neighbor correctly on the upstream routers, the issue was resolved!


Deploying VCF 4.2 in this environment has been a rollercoaster but luckily, all the issues were able to be solved.

I hope this helps you either avoid all of these issues (by pre-emptively checking and fixing what could go wrong) or in case it does happen to you, to fix them as quick as possible)

Stay tuned for more VCF 4.2 adventures, next time, with workload domains!

How do I get to vSphere 7.0 without dying in the process?

Hello Everyone,

After a long hiatus, I decided to write a new blog post (and hopefully improve the frequency of them :D) – This will be based on a 2-hour presentation that I did for VMUG (VMware User Group) Argentina last week, which was done in spanish, and I will link it down below

However, for all of the non-spanish Speakers, I will do a breakdown of everything you need to check before attempting a vSphere upgrade from the vCenter & PSC perspective to pass the upgrade wth flying colors! – Buckle up!

Where is our environment currently standing?

First of all, you need to assess the current situation of your vCenters and PSCs – Is replication working correctly for example? This article goes really really deep into checking that:

Pre-upgrade considerations in Multi-vCenter environments

If you have any replication issues, this is the first thing you need to fix, otherwise, as shown in the previous article (and the video) you risk completely destroying your environment.

The 2nd thing you need to check is your current topology – How many PSCs and vCenters are actually in my environment? Am I using PSC HA? Is everything converged? Depending on your current topology, it might be a pretty trivial migration or it would need multiple steps over the course of a weekend.

What happens in the upgrade process?

First of all, the external PSC is deprecated in vSphere 7.0 – That means that, as a part of the upgrade process, any environment with an external PSC is converged. Even though this process might be straightforward, it can cause multiple problems before, during and after the migration. It’s easier and more convenient to break it up in parts

So if we’re good with replication (check and re-check previous article, I can’t stress this enough) then we need to figure out an upgrade and migration plan

Planning the upgrade process based on our topology

Let’s start with something simple:

What would be the correct steps here?

Let’s break it down:

1: Offline snapshots of all three VCs (with embedded PSCs) – offline means with all the SSO domain powered off- this is done from the ESXi nodes that are hosting the VMs.

2: Upgrade vCenter 1

3: Check functionality and replication

4: Offline snapshots of all three VCs (with embedded PSCs)

5: Upgrade vCenter 2

6: Check Functionality and replication

7: Guess what?

8: Upgrade vCenter 3:

9: Check Functionality and replication

10: Delete all snapshots

Why am I taking snapshots at every step? Why don’t I just take a single round of snapshots and then upgrade all at once?

Well, because if you had any issue at any point of the 2nd or 3rd upgrade, you would have to roll back everything and start from scratch. If you do it this way, you have multiple points to go back and avoid having to re-do the upgrade process! This can get even worse if instead of 3 vCenters you have 9 or 10 – If let’s say, you had an issue with upgrade 7, you would have to revert everything!

Now let’s make this a little bit more complicated!

So let’s picture this scenario (which is not too uncommon, i’ve seen this is in the real world)

What do we have?

First of all, blue lines symbolize good replication and red lines symbolize that replication is not working – So, as discussed earlier, this will be the first thing to fix – in the process of fixing this (most likely with a GSS ticket), multiple rounds of offline snapshots will be taken!

Now, onto the topology:

  • 6 External PSCs in a ring topology
  • 3 PSC HA VIPs being used by 2 vCenters each
  • 6 vCenters

So what should we do here? This not only involves the upgrade of the vSphere environment, but also, the re-pointing of 2nd and 3rd party tools to the new converged PSCs – Think of NSX and SRM for example.

The biggest pain point in this scenario, however, is PSC HA – how do we get rid of this prior to the upgrade?

Even though there is a KB for converging PSC HA (https://kb.vmware.com/s/article/65129) in practice, this is not the best approach due to how error prone it is.

What is the best approach? There are two ways to approach this, depending on downtime and operations.

The cleanest approach, would be to deploy 6 new PSCs, then repoint the vCenters to those 6 PSCs, and then decomission all the PSC HA nodes (as well as the VIP) – However, this might be complicated because of lack of IP addresses in the management segment, time, etc.

You could also leverage lsdoctor (https://kb.vmware.com/s/article/80469) to unconfigure PSC HA and then repoint the vCenters to each of the nodes – This introduces a little bit more downtime per vCenter (downtime when unconfiguring PSC HA + downtime until the repoint is complete) but removes the need of deploying new PSCs.

If you ask me, I recommend the first option, to make this as clean as possible.

So in this scenario, what would you do?

  1. Offline snapshots of all vCenters and PSCs
  2. Deploy PSC 7 pointed to PSC 6
  3. Deploy PSC N pointed to PSC N-1 until all PSCs are deployed.
  4. Check replication among the new PSCs

So now we have something like this

You can see that by deploying the PSCs in that order, we have a “semi-ring” already, with way less operational hassle than if we were deploying them pointed to a single PSC and then having to remake the replication agreements

So what’s next?

We need to repoint the vCenters to these new PSCs – Since the repoint is a pretty short process, you can get away with taking a single round of offline snapshots at the beginning and just repoint everything

  1. Offline snapshots of all vCenters and PSCs
  2. Repoint all vCenters to the new PSCs, 1:1
  3. Check correct functioning

End result:

Lovely, right?

Now, we need to get rid of all the PSCs that were forming the PSC HA (nodes and VIPs)

  1. Offline snapshots of all vCenters and PSCs
  2. Decomission all PSCs and PSC HA VIP nodes using: https://kb.vmware.com/s/article/2106736
  3. Check correct functioning

Now we’re here!

So we did all this and we haven’t even started upgrading or converging… but believe me, taking due diligence in doing this as clean as possible will save you from multiple headaches when you actually upgrade!

So what is left?

  1. Form a ring creating an agreement between PSC12 and PSC7
  2. Take a new round of offline snapshots
  3. Converge PSC7
  4. Check correct functioning
  5. Take a new round of offline snapshots
  6. Converge PSC8
  7. ….
  8. ….
  9. Until all PSCs are converged

In case there is any issue with the convergence, you can just go back to the latest functioning snapshot so you don’t have to redo everything!

You should be here now:

And from here, you can finally do the upgrade process – as discussed previously and in the first scenario, you should take a round of offline snapshots per each upgrade, to avoid having to re-do upgrades

Last but not least, you should repoint all 2nd and 3rd party solutions to the new converged (and upgraded) PSCs that are now living inside the vCenter appliance!

Closing note

I hope you enjoyed this post – If you have even limited knowledge of spanish, I encourage you to watch the youtube video in which I go over this in detail, and also I analyze and fix replication issues the same way it would be done if you contacted GSS.

Feel free to share this with peers, customers, partners – If we generate awareness about these processes and a clean and correct way of doing them, we will have way more succesful upgrades!

Proactively Checking and Replacing STS Certificate on vSphere 6.x / 7.x

Recently, we’ve been working on a global issue affecting all customers that had deployed a vCenter Server as version 6.5 Update 2 or later. The Security Token Service (STS) signing certificate may have a two-year validity period. Depending on when vCenter was deployed, this may be approaching expiry.

Since currently there is no alert on vCenter for this certificate, and also it is a certificate that prior to 6.7u3g had no way to be replaced by customers in case of expiration (required GSS involvement to execute internal procedures / scripts) and it generates a production down scenario, silently.

Within the GSS team, we’ve come up with three scripts to help with this situation.


Checksts.py is a python script that is mentioned in KB https://kb.vmware.com/s/article/79248. This script will proactively check for expiration of the STS certificate. It works on Windows vCenters as well as vCenter Server Appliances.

To use it, you can download it from the KB mentioned:

Once it is downloaded, you can copy it to any directory on your vCenter. After that, you will run it like this:

  • Windows: "%VMWARE_PYTHON_BIN%" checksts.py
  • VCSA: python checksts.py

This is an example for VCSA:

If you get the message “You have expired STS certificates” and/or your certificate expiration date is in less than 6 months, we recommend to move onto the next step, replacing the STS certificate! If your expiration date is in more than 6 months, then you don’t have to worry about any of this!

Fixsts.sh (VCSA) / Fixsts.ps1 (Windows)

The fixsts scripts are mentioned in https://kb.vmware.com/s/article/76719 (which I personally wrote) for VCSA and https://kb.vmware.com/s/article/79263 for Windows.

The idea is the same for both, replacing the STS certificate with a new, valid one. This can be done proactively (cert has not expired yet) as well as reactively (cert has already expired and you’re in a production down scenario)

The steps for these two KBs are mentioned in the articles. They’re pretty much identical, with minor differences in running the commands due to the Guest OS, and super straightforward to run.

Once the STS is replaced, in case it was done proactively, you will be good to go!

YOU CAN STOP READING FROM THIS POINT ON – hope you liked this blog entry!

However, if this was done reactively, then it is likely that you will need to replace more certificates in your vCenter Server, especially if you were using VMCA certs (which could have the same expiration date as the STS certificate if they were never replaced)

Replacing other certificates

How do I know if which of my other certificates are expired?

On the KBs mentioned, there are two one-liners provided to check for certificates

  • Windows: $VCInstallHome = [System.Environment]::ExpandEnvironmentVariables("%VMWARE_CIS_HOME%");foreach ($STORE in & "$VCInstallHome\vmafdd\vecs-cli" store list){Write-host STORE: $STORE;& "$VCInstallHome\vmafdd\vecs-cli" entry list --store $STORE --text | findstr /C:"Alias" /C:"Not After"}

  • VCSA: for i in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list); do echo STORE $i; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $i --text | egrep "Alias|Not After"; done

These commands will show, for each of the VECS (VMware Endpoint Certificate Store) stores, the expiration date for all certificates. If the certificates have an expiration date prior to today, then they’re expired. Also, you will have issues with services if certificates are expired. Services such as vpxd-svcs, vpxd or vapi-endpoint will be pretty verbose with expiration date of certain certificates.

For example:

root@vcsa1 [ /tmp ]# for i in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list); do echo STORE $i; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $i --text | egrep "Alias|Not After"; done
Not After : Apr 6 11:57:19 2029 GMT
Alias : c96d3301505316ccc1b295276ece31318ad79ec7
Not After : Apr 6 11:57:19 2029 GMT
Alias : 8a11418d5ae2b87b7e8a5cb8646fbfae41503f9d
Not After : Dec 13 21:50:49 2029 GMT
Alias : cb5a495d34f3f2f75d357b47aac3799346665258
Not After : Sep 25 20:32:57 2022 GMT
Alias : 229a64a3dff7417d0b38fb011c692a55b7bee5c2
Not After : May 16 20:21:12 2030 GMT
Alias : 2f0e8e4f1658e61bef5004cb5efd159b90396838
Not After : May 16 20:45:07 2030 GMT
Alias : 4504400e4bcbdab5a34a9bc2555abd55327369c1
Alias : 31b2b5a18d89d90dadff901400a60d45ca3356e9
Alias : e7840a7cbbe7fcdd7a13d9159ff97443cc53fb5e
Alias : 985d7e55183635f13e2c6469eee9c72f68334615
STORE machine
Alias : machine
Not After : Apr 6 11:57:19 2029 GMT
STORE vsphere-webclient
Alias : vsphere-webclient
Not After : Apr 6 11:57:19 2029 GMT
STORE vpxd
Alias : vpxd
Not After : Apr 6 11:57:19 2029 GMT
STORE vpxd-extension
Alias : vpxd-extension
Not After : Apr 6 11:57:19 2029 GMT
STORE data-encipherment
Alias : data-encipherment
Not After : Apr 6 11:57:19 2029 GMT
Alias : sms_self_signed
Not After : Apr 12 12:04:48 2029 GMT

In this case, none of the certificates are expired. But if we had expired certificates we will need to replace them!

Let’s group them in three groups. All of them are replaced using the same tool, certificate-manager, detailed on KB https://kb.vmware.com/s/article/2097936, but the option you will use will depend on the scenario

  • Group 1: Machine SSL Certificate (Front facing certificate, on port 443)
    • If only Machine SSL is expired, you will run Option 3 (Replace the Machine SSL certificate with a VMCA Generated Certificate) of this KB, with the following caveats
      • The “comma separated list of hostnames” you will be prompt to complete, should contain the PNID of the node as well as any additional hostname or alias you might be using. How do we get the PNID for the node?
        • Windows: "%VMWARE_CIS_HOME%"\vmafdd\vmafd-cli get-pnid --server-name localhost
        • VCSA: /usr/lib/vmware-vmafd/bin/vmafd-cli get-pnid --server-name localhost
      • The value of “VMCA Name” should match the PNID obtained in the prior step
  • Group 2: Root certificate (VMCA root certificate)
    • If there is any certificate expired in the TRUSTED_ROOTS store, it will be safer to just run Option 8 (Reset all certificates) on the KB mentioned above. This will reset all certificates to VMCA signed. The same caveats mentioned for Option 3 apply
  • Group 3: Solution Users certificates(vpxd, vpxd-extension, machine, vsphere-webclient)
    • If there is any certificate expired in the stores vpxd, vpxd-extension, machine or vsphere-webclient, run Option 6 (Replace Solution User Certificates with VMCA generated Certificates) on the KB mentioned above. The same caveats mentioned for Option 3 apply

Once all this is done, you should be back up and running with regenerated certificates, and out of the production down scenario!

Closing note

This is a pretty concerning issue, so I’m really happy to have been part of the team to help fix so many environments across the globe.

Please, use this information to proactively check for the STS certificate, as well as replacing without having to get into a production down scenario. You can share this with customers, partners, or whoever you feel might be benefited from this information!