Using ABX to change DNS Servers for a vRA Deployment at provisioning time

Hello Everyone,

On today’s post, we will go through creating an ABX Action to change the DNS Servers for a Deployment in vRA8. This might be needed to do in scenarios in which, even though the network has DNS servers configured, a specific deployment might require to use other DNS Servers while still being on the same network, for example, to join a different AD domain

The same idea can be used to edit other fields of the deployment, such as the IP Address, search domains, etc.

The post will be divided in 5 sections:

  • Cloud Template
  • Event Topic
  • ABX Action
  • ABX Subscription
  • Test

Cloud Template

In the template we’re going to need two inputs – dnsServers (comma separated list of DNS Servers) as well as an input to manage the amount of VMs in the deployment, we can call it ‘instances’

 instances:
    type: integer
    title: Amount of VMs
    default: 1
    maximum: 5
  dnsServers:
    type: string
    description: Comma separated list of DNS Servers
    title: DNS Servers
    default: '1.1.1.1,8.8.8.8'

These two values will be custom properties on the VM Object

properties:
   count: '${input.instances}'
   dnsServers: '${input.dnsServers}'

In addition to this, the network assignment for the VM resource should be set to ‘static’. A customization specification profile is optional, since using a ‘static’ assignment will auto-generate a ‘ghost’ customization specification profile at the time of provisioning

networks:
    - network: '${resource.VMNetwork1.id}'
      assignment: static

Event Topic

The event topic that we need to use to make changes to the Network Configuration is the one that has the object that we need to edit being presented to the workflow in an editable state, as in, not read-only.

For this specific use, the state is Network Configure

Pay special attention to the ‘Fired once for a cluster of machines’ part

The dnsServers object is a 3D Array, so that is what we need to use in the ABX Action Code

So from this point we learn that:

  • The action will run once for a cluster of machines, so if we were to do a Multi-VM deployment we need to take this into account, otherwise, it will only run for a single VM and not all of the VMs in the deployment
  • a 3D array needs to be used to insert the DNS Servers into the object at the event topic

ABX Action

For this example, I will use Python, and I will not use any external libraries for array management (such as numpy) since I wanted to see if it could be done natively. Python has way better native support for lists than it does for arrays, but in this case, given the schema of the object in the event topic, we’re forced to use a 3D Array.

The first thing we need to do when creating the action, is to add the needed inputs. In this one, I will add the custom properties of the resources as an input

Adding the custom properties

Once we have that as an input, we can use it to get the data we need (amount of instances and DNS servers)

To pass data back to the provisioning state, we will use and return the ‘outputs’ object

This is the code of the action itself, I will explain it below

def handler(context, inputs):
    outputs = {}
    dnsServers = [[[]]]
    instances = inputs["customProperties"]["count"]
    inputDnsServers = inputs["customProperties"]["dnsServers"].split(",")
    if (len(inputDnsServers) > 0):
        outputs["dnsServers"] = dnsServers
        outputs["dnsServers"][0][0] = inputDnsServers
        for i in range(1,int(instances)):
            outputs["dnsServers"] = outputs["dnsServers"] + [[inputDnsServers]]
    return outputs
  • Define the outputs object
  • Define the 3D Array for DNS Servers
  • Assign the inputs as variables in the script
  • Convert the comma separated string into a List
  • If the list is not empty (this means that the user did enter a value in the DNS Servers list on the input form) then we add the 3D Array to the outputs object.
    • Why am I asking to see if it is empty? Because if the user did not put anything on the field, we will be overwriting the output with an empty array, and that will overwrite the DNS that were read from the network in vRA. We only want to overwrite that if we’re actually changing the DNS Servers.
  • Also in the same condition, we want to add the DNS Servers array to each VM, so we iterate through the amount of VMs.
    • The way to add it without using numpy (we have no append method) is not elegant, but it does the trick. Basically, we initialize the first element and then we add other elements to the same array using the same format.
  • Return the outputs object

This can also be done in javascript and powershell, the idea would be the same.

So how does this object look like in an actual run?

In this example, I changed the DNS for 3 VMs – You can see that we’re using the 3D Array Structure

Lastly, we need to configure a subscription for it to run at this specific state.

ABX Subscription

This is the most straightforward part – We create a blocking subscription in the Network Configure state, and we add the action we just created

The ABX subscription can be filtered via a condition (for example, to run only on specific cloud templates) as well.

So let’s do our deployment now!

The network i’m going to select has the 8.8.8.8 DNS configured

This will be overwritten by whatever we put on the input form. I’m going to use 1.2.3.4 and 5.6.7.8 for this example, and there will be 2 VMs in the deployment

We can check the output of the action before the deployment finishes

Action run started by an Event (Subscription)
Action output

In there we can see the actual code that run, if it was successful or not, the payload, and the output the action had. In this case, our two DNS Servers for our two VMs with a successful output.

Checking the DNS for one of the VMs, we can see the two DNS Servers we selected as inputs!

Success!!!

Summary

I hope you found this post useful! The same idea can be used to change several other properties at provisioning time. Also, it was a learning experience for me to learn how to play with arrays in Python natively, and how to interact with ABX properly.

More posts will be coming soon!

If you liked this one, please share it and/or leave a comment!

Configuring a Dynamic Multi-NIC Cloud Template in vRA 8.x

Hello Everyone,

On today’s post, I will focus on a Dynamic Multi-NIC configuration for a Cloud Template in vRA 8.x

This allows customers to reuse the same cloud templates for virtual machines that could have a different amount of NICs, and this amount is defined at the time of the request. If this wasn’t dynamic, then a cloud template with three networks, will always need to have three networks configured at the time of the request, which might not be the case.

Using a Dynamic construct allows for less cloud template sprawl, since multiple application configurations can use the same cloud template.

Since this configuration is not trivial, this post will be a step by step guide on how to achieve this result.

Current Environment

For this Lab demonstration, we will use a vSphere Cloud Account, 4 NSX-T segments that are part of a Network Profile with a capability tag named “env:production” – In doing so, when using that constraint tag in the cloud template, we can guarantee our deployment will use that specific network profile.

The 4 NSX-T segments also have a single tag that refers to the type of network it is. In this scenario, Application, Frontend, Database and Backup are our 4 networks.

NSX-T Segments tagged and defined in the network profile
‘env:production’ tag in the network profile

Creating the Cloud Template

To get the Dynamic Multi-NIC configuration on the Cloud Template to work, we need the following things:

  • Inputs for Network to NIC mapping based on tagging
  • Inputs for NIC existence
  • Network Resources
  • VM Resource and Network Resource assignment

In addition to this, we can do customization in Service Broker to change the visibility of the fields. This is done to only allow the requester to choose a network mapping for a NIC what will actually be used.

Inputs for Network to NIC mapping based on tagging

This cloud template will allow for configurations of up to 4 NICs, and since we have 4 networks, we should let the requester select, for each NIC, what networks can be used.

This is what it looks like

Network1:
    type: string
    description: Select Network to Attach to
    default: 'net:application'
    title: Network 1
    oneOf:
      - title: Application Network
        const: 'net:application'
      - title: Frontend Network
        const: 'net:frontend'
      - title: Database Network
        const: 'net:database'
      - title: Backup Network
        const: 'net:backup'
  Network2:
    type: string
    description: Select Network to Attach to
    default: 'net:frontend'
    title: Network 2
    oneOf:
      - title: Application Network
        const: 'net:application'
      - title: Frontend Network
        const: 'net:frontend'
      - title: Database Network
        const: 'net:database'
      - title: Backup Network
        const: 'net:backup'
  Network3:
    type: string
    description: Select Network to Attach to
    default: 'net:database'
    title: Network 3
    oneOf:
      - title: Application Network
        const: 'net:application'
      - title: Frontend Network
        const: 'net:frontend'
      - title: Database Network
        const: 'net:database'
      - title: Backup Network
        const: 'net:backup'
  Network4:
    type: string
    description: Select Network to Attach to
    default: 'net:backup'
    title: Network 4
    oneOf:
      - title: Application Network
        const: 'net:application'
      - title: Frontend Network
        const: 'net:frontend'
      - title: Database Network
        const: 'net:database'
      - title: Backup Network
        const: 'net:backup'

We can see that each of the inputs allows for any of the networks to be selected.

Inputs for NIC Existence

Other than the first NIC (which should always exist, otherwise our VM(s) wouldn’t have any network connectivity, we want to be able to deploy VMs with 1, 2, 3, and 4 NICs, using the same Cloud Template.

To achieve that, we will create 3 Boolean Inputs that will define if a NIC should be added or not.

needNIC2:
    type: boolean
    title: Add 2nd NIC?
    default: false
  needNIC3:
    type: boolean
    title: Add 3rd NIC?
    default: false
  needNIC4:
    type: boolean
    title: Add 4th NIC?
    default: false

Network Resources

To manage the configuration of the NICs and networks, the network resources for NICs 2, 3 and 4 will use a count property, and this property’s result (either 0 if it doesn’t exist, or 1 if it does exist) will be based on the result of the inputs. Network 1 will not use that property

Also, we will use the deviceIndex property to maintain consistency with the numbering – So the network resources look like this

Network1:
    type: Cloud.vSphere.Network
    count: 1
    properties:
      networkType: existing
      deviceIndex: 0
      constraints:
        - tag: '${input.Network1}'
        - tag: 'env:production'
  Network2:
    type: Cloud.vSphere.Network
    properties:
      networkType: existing
      count: '${input.needNIC2 == true ? 1 : 0}'
      deviceIndex: 1
      constraints:
        - tag: '${input.Network2}'
        - tag: 'env:production'
  Network3:
    type: Cloud.vSphere.Network
    properties:
      networkType: existing
      count: '${input.needNIC3 == true ? 1 : 0}'
      deviceIndex: 2
      constraints:
        - tag: '${input.Network3}'
        - tag: 'env:production'
  Network4:
    type: Cloud.vSphere.Network
    properties:
      networkType: existing
      count: '${input.needNIC4 == true ? 1 : 0}'
      deviceIndex: 3
      constraints:
        - tag: '${input.Network4}'
        - tag: 'env:production'

The constraint tags that are used are the Network Input (to choose a network) and the ‘env:production’ tag to make our deployment use the Network Profile we defined earlier.

VM Resource & Network Resource Assignment

This is the tricky part – Since our networks could be non-existent (if the needNic input is not selected) we cannot use the regular syntax to add a network, which would be something like:

networks:
        - network: '${resource.Network1.id}'
          assignment: static
          deviceIndex: 0
        - network: '${resource.Network2.id}'
          assignment: static
          deviceIndex: 1
      ...

This will fail on the Cloud Template validation because the count for Network2 could be zero, so to do the resource assignment, we need to use the map_by syntax.

Several other examples can be seen on the following link: https://docs.vmware.com/en/vRealize-Automation/8.5/Using-and-Managing-Cloud-Assembly/GUID-12F0BC64-6391-4E5F-AA48-C5959024F3EB.html

The VM resource uses a simple Ubuntu Image with a Small Flavor, so here is what it looks like once the map_by syntax is used for the assignment

Cloud_vSphere_Machine_1:
    type: Cloud.vSphere.Machine
    properties:
      image: Ubuntu-TMPL
      flavor: Small
      customizationSpec: Linux
      networks: '${map_by(resource.Network1[*].id + resource.Network2[*].id + resource.Network3[*].id + resource.Network4[*].id, r => {"network":r.id, "assignment":"static", "deviceIndex":r.deviceIndex})}'
      constraints:
        - tag: 'env:production'

This allows for any combination of NICs, from 1 to 4, and if the count of one of the resources is 0, it won’t be picked up by the assignment expression.

This is what the Cloud Template looks on the canvas. You can see that Networks 2, 3 and 4 have the appearance of possible multiple instances. This is because we’re using the count parameter.

Canvas view of the Cloud Template

If we were to deploy this Cloud Template, it will look like this:

Doesn’t make much sense to select networks that we won’t assign, right?

How do we fix this? We can leverage Service Broker to manage the visibility of the fields based on the boolean input!

Using the inputs as conditional value for the visibility of the network field

So now, from Service Broker, it looks like this:

No extra NICs selected
NICs 2 and 3 selected

So if we deploy this, it should have three networks assigned. The first NIC should use the Application Network, the second NIC should use the Frontend Network and the 3rd NIC should use the Database Network.

Let’s test it!

TA-DA!

We can see that even if the Cloud Template had 4 Network Resources, only 3 were instantiated for this deployment! And each network was mapped to a specific NSX-T segment, thanks to the constraint tags.

Closing Note

I hope this blog post was useful – The same assignment method can be used for other resources such as Disks or Volumes – the principle is still the same.

Feel free to share this if you found it useful, and leave your feedback in the comments.

Until the next time!

Updating an Onboarded Deployment in vRA 8.x

Hello Everyone!

On today’s post, we will go through the process of updating an onboarded deployment in vRA 8.x

The onboarding feature allows customers to add VMs that were not deployed from vRA, to the vRA infrastructure. This means that these VMs are added to one or more deployments, and once they exist within vRA, operations such as power cycling, opening a remote console, or resizing CPU/RAM are now available.

However, there are scenarios in which customers would want to expand these deployments, not with new onboarded VMs, but with newly deployed VMs (or other resources) from vRA! These deployments will use an image, a flavor, could use a multitude of inputs, tagging, networks, etc. So how do we do this?

Onboarding the VMs using an auto-generated Cloud Assembly Template

The first thing we need to do, is to create an onboarding plan, select a name for our deployment, and select the VMs we’re going to onboard initially.

Creating the Onboarding Plan
Adding two VMs to be onboarded

On the deployments tab, we can rename the deployment if needed, but the most important part is to select Cloud Template Configuration and change it to Create Cloud Template in Cloud Assembly Format this will allow us to have a source for our deployment, that we can edit afterwards to allow for future growth

Cloud Template in Cloud Assembly format

It is important to note that the imageRef has no image available. Since this is not a vRA Deployment but an Onboarding, none of the resources are being deployed from any of the images. We will come back to this item later.

After saving this configuration and clicking on Run, our deployment will be onboarded

Updating the onboarded deployment to add a new VM in a specific network

If we check on the onboarded deployment, we will see that it is mapped to a specific Cloud Template (the one that was auto-generated earlier by the Onboarding Plan)

So if we were to do an update on this deployment, we need to edit that Cloud Template

I will now add a vSphere Machine resource as well as a vSphere Network:

inputs: {}
resources:
  Cloud_vSphere_Machine_1:
    type: Cloud.vSphere.Machine
    properties:
      image: 'ubuntu'
      cpuCount: 1
      totalMemoryMB: 1024
      networks:
        - network: '${resource.Cloud_vSphere_Network_1.id}'
  Cloud_vSphere_Network_1:
    type: Cloud.vSphere.Network
    properties:
      networkType: existing
      constraints: 
        - tag: env:vsphere  
  DevTools-02a:
    type: Cloud.vSphere.Machine
    properties:
      imageRef: no_image_available
      cpuCount: 1
      totalMemoryMB: 4096
  DevTools-01a:
    type: Cloud.vSphere.Machine
    properties:
      imageRef: no_image_available
      cpuCount: 1
      totalMemoryMB: 4096
  

This is what our template looks like now. So the next thing we should do is click on Update, right?

Update is Greyed out!

The update task is greyed out because ir Cloud Template does not have inputs. Since we don’t have inputs, what we need to do is to go to the Cloud Template, and instead of selecting Create a New Deployment we should select Update an Existing Deployment and then click on the onboarded deployment.

Updating the Onboarded Deployment

After clicking on Next, the plan is presented.

Notice something wrong here?

The update operation will attempt to re-create the onboarded VMs! That’s not something we want, and also, in this scenario, it will fail since there is no image mapping to deploy from!

What we want is to leave all the VMs that were previously onboarded, untouched, and only add our new VM and network. So how do we achieve this?

This is achieved by adding the ignorechanges parameter with a value of true to every resource in the cloud template that was previously onboarded – In this scenario, this would be our 2 DevTools VMs

Adding the ignoreChanges parameter

If we re-try updating the deployment now, the only tasks that should appear will be the ones for the new resources (VM and Network)

Update deployment showing the new tasks

After clicking on ‘deploy’ and waiting for it to finish, our deployment will now like this

Deployment updated with our new VM and network! Hooray!

Offboarding/Unregistering limitations

It is important to note that vRA’s limitations for unregistering VMs are still present. The only VMs that can be unregistered from vRA are the ones that were previously onboarded. VMs that were deployed from vRA will not be able to be unregistered without deletion. The fact that the deployment VMs are part of an Onboarded Deployment does not change this.

Closing Note

I hope you enjoyed this post! When I started working on this use case I figured it was not as trivial as I thought, and after doing research and testing, found this walkthrough/solution.

Let me know if this was useful in the comments!

Until next time!

Deploying a non-standard VCF 4.2 Workload Domain via API!

Getting started with VMware Cloud Foundation (VCF) - CormacHogan.com

Hello Everyone!

On today’s post, as a continuation of the previous post (in which we talk about the VCF MGMT Domain) I will show a step by step guide of how to do a complete deployment of a VCF Workload Domain, subject to some specific constraints based on a project I was working on, using VCF’s API!

What’s this non-standard architecture like?

In this specific environment, I had to play around the following constraints

  • 4 hosts with 256GB of RAM using vSAN, check the previous post for information about the MGMT domain!
  • 3 Hosts with 256GB of RAM, using vSAN
  • 3 Hosts with 1.5TB of RAM, using FC SAN storage
  • Hosts using 4×10 NICs
  • NIC Numbering not being consistent (some hosts had 0,1,2,3 – other hosts had 4,5,6,7 – even though this can be changed editing files on the ESXi, it is still a constraint and can be worked around using the API)

With this information, the decision was to:

  • Separate the Workload Domain into 2 clusters, one for NSX-T Edges and the other one for Compute workloads, given the discrepancies in RAM and storage configuration, they could never be part of the same logical cluster.

This looks something like…

It is impossible to deploy this using the GUI, due to the following:

  • Can’t utilize 4 Physical NICs for a Workload Domain
  • Can’t change NIC numbering or NIC to DVS uplink mapping

So we have to do this deployment using the API! Let’s go!

Where do we start?

First of all, VCF’s API documentation is public, and this is the link to it: https://code.vmware.com/apis/1077/vmware-cloud-foundation – I will be referring to this documentation A LOT over the course of this blog post

All the API calls require the use of a token, which is generated with the following request (example taken from the documentation)

cURL Request

$ curl 'https://sfo-vcf01.rainpole.io/v1/tokens' -i -X POST \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json' \
    -d '{
  "username" : "administrator@vsphere.local",
  "password" : "VMware123!"
}'

Once we have the token, we can use it in other API calls until it expires and we just either refresh it or create a new one. All the VCF API calls that are generated to SDDC manager (not internal API calls) will require the usage of a bearer token.

List of steps to create a workload domain

  • Commission all hosts from SDDC manager and create network profiles appropriately to match the external storage selection – In this scenario, we will have a network profile for the vSAN based hosts, as well as another network profile for the FC SAN based hosts. Hosts can also be commissioned via API calls (3.65 in the API reference) instead of doing it via the GUI, but the constraints I had did not prevent me from doing it via GUI.
  • Get all the IDs for the commisioned hosts – The API Call is “2.7.2 Get the Hosts” and it is a GET call to https://sddc_manager_url/v1/hosts using Bearer Token authentication
  • Create the Workload Domain with a single cluster (Compute) – The API Call is “2.9.1 Create a Domain”
  • Add the Secondary Cluster (Edge) to the newly-created workload domain – The API Call is “2.10.1 Create a Cluster”
  • Create the NSX-T Edge Cluster on top of the Edge Cluster – The API Call is “2.37.3 – Create Edge Cluster”

For each of these tasks, we should first validate our JSON body before executing the API call. We will discuss this further.

You might ask, why don’t you create a Workload Domain with two clusters instead of first creating the Workload Domain with a single cluster and then adding the second one?

This is something I hit during the implementation – If we check the Clusters object on the API, we can see it is an array, so it should be able to work with multiple cluster values.

"computeSpec": { "clusterSpecs": [

The info on the API call also points to the fact that we should be able to create multiple clusters on the “Create Domain” call.

Even worse, the validation API will validate an API call with multiple clusters

However, I came to learn (after trying multiples times and contacting the VCF Engineering team, that this is not the case)

For example, if our body looked something like this (with two clusters), the validation API will work!

"computeSpec": {
      "clusterSpecs": [
        {
          "name": "vsphere-w01-cl-01",
          "hostSpecs": [
            {
              "id": "b818ba18-2960-49ce-a876-ed4e0c07a936",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic0",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic1",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic2",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  },
                  {
                    "id": "vmnic3",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  }
                ]
              }
            },
            {
              "id": "bd152a18-7b31-4cd4-a352-b94a7119bb33",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic0",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic1",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic2",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  },
                  {
                    "id": "vmnic3",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  }
                ]
              }
            },
            {
              "id": "18409da3-fbae-47b2-800f-67d032fe21a0",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic0",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic1",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic2",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  },
                  {
                    "id": "vmnic3",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  }
                ]
              }
            }
          ],
          "datastoreSpec": {
            "vmfsDatastoreSpec" : {
              "fcSpec" : [ {
              "datastoreName" : "vsphere-m01-fc-datastore1"
             } ]
             }
          },
          "networkSpec": {
            "vdsSpecs": [
              {
                "name": "vsphere-w01-cl01-vds01",
                "portGroupSpecs": [
                  {
                    "name": "vsphere-w01-cl01-vds-pg-mgmt",
                    "transportType": "MANAGEMENT"
                  },
                  {
                    "name": "vsphere-w01-cl01-vds-pg-vmotion",
                    "transportType": "VMOTION"
                  }
                ]
              },
              {
                "name": "vsphere-w01-cl01-vds02",
                "isUsedByNsxt": true
              }
            ],
            "nsxClusterSpec" : {
            "nsxTClusterSpec" : {
              "geneveVlanId" : 1214,
              "ipAddressPoolSpec" : {
                "name" : "vsphere-w01-np01",
                "subnets" : [ {
                "ipAddressPoolRanges" : [ {
                  "start" : "172.22.14.100",
                  "end" : "172.22.14.200"
                } 
              ],
                "cidr" : "172.22.14.0/24",
                "gateway" : "172.22.14.254"
                } ]
               }
             }
            }
          }
        },
          {
          "name": "vsphere-w01-cl-edge-01",
          "hostSpecs": [
            {
              "id": "aa699b0d-015f-43e9-83ea-6e941b37e642",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic4",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic5",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic6",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  },
                  {
                    "id": "vmnic7",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  }
                ]
              }
            },
            {
              "id": "1e500b1b-fd33-425c-8c6d-42840cf658db",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic4",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic5",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic6",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  },
                  {
                    "id": "vmnic7",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  }
                ]
              }
            },
            {
              "id": "e138d6a1-6c55-4326-ac6c-ffc0239e15b5",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic4",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic5",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic6",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  },
                  {
                    "id": "vmnic7",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  }
                ]
              }
            }
          ],
          "datastoreSpec": {
            "vsanDatastoreSpec": {
              "failuresToTolerate": 1,
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "datastoreName": "vsphere-w01-ds-vsan-01"
            }
          },
          "networkSpec": {
            "vdsSpecs": [
              {
                "name": "vsphere-w01-cl-edge-01-vds01",
                "portGroupSpecs": [
                  {
                    "name": "vsphere-w01-cl-edge-01-pg-mgmt",
                    "transportType": "MANAGEMENT"
                  },
                  {
                    "name": "vsphere-w01-cl-edge-01-pg-vsan",
                    "transportType": "VSAN"
                  },
                  {
                    "name": "vsphere-w01-cl-edge-01-pg-vmotion",
                    "transportType": "VMOTION"
                  }
                ]
              },
              {
                "name": "vsphere-w01-cl-edge-01-vds02",
                "isUsedByNsxt": true
              }
            ],
            "nsxClusterSpec" : {
                "nsxTClusterSpec" : {
                  "geneveVlanId" : 1214,
                  "ipAddressPoolSpec" : {
                      "name" : "vsphere-w01-np02",
                      "subnets" : [ {
                        "ipAddressPoolRanges" : [ {
                          "start" : "172.22.14.210",
                          "end" : "172.22.14.230"
                        } 
                      ],
                        "cidr" : "172.22.14.0/24",
                        "gateway" : "172.22.14.254"
                        } ]
                    }
                      
                }
            }
           }
        }
      ]
    },

However, when we go ahead and try to create it, it will fail, and we will see the following error on the logs.

ERROR [vcf_dm,02a04e83325703b0,7dc4] [c.v.v.v.c.v1.DomainController,http-nio-127.0.0.1-7200-exec-6]  Failed to create domain
com.vmware.evo.sddc.common.services.error.SddcManagerServicesIsException: Found multiple clusters for add vi domain.
at com.vmware.evo.sddc.common.services.adapters.workflow.options.WorkflowOptionsAdapterImpl.getWorkflowOptionsForAddDomainWithNsxt(WorkflowOptionsAdapterImpl.java:1222)

So, as mentioned earlier, we need to first create our domain (with a single cluster), and then add the 2nd cluster!

1: Create a Workload Domain with a Single Cluster

We will first create our Workload Domain with the compute cluster, which in this scenario, uses external storage, and will use the secondary distributed switch for overlay traffic.

This is my API call body based on the API reference, to create a Workload Domain with a single cluster of 3 hosts, using two VDS, 4 physical NICs numbered from 0 to 3 and external FC storage, using the host IDs that I got after the previous step.

{
    "domainName": "vsphere-w01",
    "orgName": "vsphere.local",
    "vcenterSpec": {
      "name": "vsphere-w01-vc01",
      "networkDetailsSpec": {
        "ipAddress": "172.22.11.64",
        "dnsName": "vsphere-w01-vc01.vsphere.local",
        "gateway": "172.22.11.254",
        "subnetMask": "255.255.255.0"
      },
      "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
      "rootPassword": "VMware1!",
      "datacenterName": "vsphere-w01-dc-01"
    },
    "computeSpec": {
      "clusterSpecs": [
        {
          "name": "vsphere-w01-cl-01",
          "hostSpecs": [
            {
              "id": "b818ba18-2960-49ce-a876-ed4e0c07a936",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic0",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic1",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic2",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  },
                  {
                    "id": "vmnic3",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  }
                ]
              }
            },
            {
              "id": "bd152a18-7b31-4cd4-a352-b94a7119bb33",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic0",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic1",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic2",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  },
                  {
                    "id": "vmnic3",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  }
                ]
              }
            },
            {
              "id": "18409da3-fbae-47b2-800f-67d032fe21a0",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic0",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic1",
                    "vdsName": "vsphere-w01-cl01-vds01"
                  },
                  {
                    "id": "vmnic2",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  },
                  {
                    "id": "vmnic3",
                    "vdsName": "vsphere-w01-cl01-vds02"
                  }
                ]
              }
            }
          ],
          "datastoreSpec": {
            "vmfsDatastoreSpec" : {
              "fcSpec" : [ {
              "datastoreName" : "vsphere-m01-fc-datastore1"
             } ]
             }
          },
          "networkSpec": {
            "vdsSpecs": [
              {
                "name": "vsphere-w01-cl01-vds01",
                "portGroupSpecs": [
                  {
                    "name": "vsphere-w01-cl01-vds-pg-mgmt",
                    "transportType": "MANAGEMENT"
                  },
                  {
                    "name": "vsphere-w01-cl01-vds-pg-vmotion",
                    "transportType": "VMOTION"
                  }
                ]
              },
              {
                "name": "vsphere-w01-cl01-vds02",
                "isUsedByNsxt": true
              }
            ],
            "nsxClusterSpec" : {
            "nsxTClusterSpec" : {
              "geneveVlanId" : 1214,
              "ipAddressPoolSpec" : {
                "name" : "vsphere-w01-np01",
                "subnets" : [ {
                "ipAddressPoolRanges" : [ {
                  "start" : "172.22.14.100",
                  "end" : "172.22.14.200"
                } 
              ],
                "cidr" : "172.22.14.0/24",
                "gateway" : "172.22.14.254"
                } ]
               }
             }
            }
          }
        }
      ]
    },
    "nsxTSpec": {
      "nsxManagerSpecs": [
        {
          "name": "vsphere-w01-nsx01a",
          "networkDetailsSpec": {
            "ipAddress": "172.22.11.76",
            "dnsName": "vsphere-w01-nsx01a.vsphere.local",
            "gateway": "172.22.11.254",
            "subnetMask": "255.255.255.0"
          }
        },
        {
          "name": "vsphere-w01-nsx01b",
          "networkDetailsSpec": {
            "ipAddress": "172.22.11.77",
            "dnsName": "vsphere-w01-nsx01b.vsphere.local",
            "gateway": "172.22.11.254",
            "subnetMask": "255.255.255.0"}
        },
        {
          "name": "vsphere-w01-nsx01c",
          "networkDetailsSpec": {
            "ipAddress": "172.22.11.78",
            "dnsName": "vsphere-w01-nsx01c.vsphere.local",
            "gateway": "172.22.11.254",
            "subnetMask": "255.255.255.0"}
        }
      ],
      "vip": "172.22.11.75",
      "vipFqdn": "vsphere-w01-nsx01.vsphere.local",
      "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
      "nsxManagerAdminPassword": "VMware1!VMware1!"
    }
  }

Important!

  • The DVS that is going to be used for overlay traffic must have the isUsedByNsxt flag set to true. In the case of a 4 NIC and 2 VDS deployment such as this one, it shouldn’t have any of the management, vMotion or vSAN traffic.

With the body, to execute the VALIDATE and EXECUTE api calls, we will do the following: (high level overview since we can use any REST API tool such as Postman, curl, invoke-restmethod, or any wrapper from any language that can execute REST calls)

The list of steps will be the same for all the POST API calls, changing the URL to match each specific call.

If the validation is successful, we will get a message similar to:

 "description": "Validating Domain Creation Spec",
    "executionStatus": "COMPLETED",
    "resultStatus": "SUCCEEDED",
    "validationChecks": [
        {
            "description": "DomainCreationSpecValidation",
            "resultStatus": "SUCCEEDED"
        }

We should continue editing and retrying in case of errors until we get the validation to pass, do not attempt to execute the API call without validating it first!

Once the validation has passed, we can follow the same steps that are mentioned above but instead of making a POST call to https://sddc_manager_fqdn/v1/domains/validations, we remove the “validations” part, so it would be a call to https://sddc_manager_fqdn/v1/domains.

The deployment will start and after a couple minutes we will see in the SDDC console that it was successful.

If it were to fail for whatever reason, we can troubleshoot the deployment by checking where it failed on the SDDC console as well as checking logs, but as long as the validation passes, it should not be a problem with the body we’re sending.

2: Adding a 2nd Cluster to the existing workload domain

To add a cluster to an existing domain, the first thing we need is to get the ID of the domain, that can easily be done with a GET call to https://sddc_manager_url/v1/domains and selecting the ID of the workload domain we just created.

Once we get the ID, this is the body (following the API reference) to add a new cluster to an existing domain.

{
    "domainId": "58a6cdcb-f609-49dd-9729-7e27d65440c6",
    "computeSpec": {
      "clusterSpecs": [
          {
          "name": "vsphere-w01-cl-edge-01",
          "hostSpecs": [
            {
              "id": "aa699b0d-015f-43e9-83ea-6e941b37e642",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic4",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic5",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic6",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  },
                  {
                    "id": "vmnic7",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  }
                ]
              }
            },
            {
              "id": "1e500b1b-fd33-425c-8c6d-42840cf658db",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic4",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic5",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic6",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  },
                  {
                    "id": "vmnic7",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  }
                ]
              }
            },
            {
              "id": "e138d6a1-6c55-4326-ac6c-ffc0239e15b5",
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "hostNetworkSpec": {
                "vmNics": [
                  {
                    "id": "vmnic4",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic5",
                    "vdsName": "vsphere-w01-cl-edge-01-vds01"
                  },
                  {
                    "id": "vmnic6",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  },
                  {
                    "id": "vmnic7",
                    "vdsName": "vsphere-w01-cl-edge-01-vds02"
                  }
                ]
              }
            }
          ],
          "datastoreSpec": {
            "vsanDatastoreSpec": {
              "failuresToTolerate": 1,
              "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
              "datastoreName": "vsphere-w01-ds-vsan-01"
            }
          },
          "networkSpec": {
            "vdsSpecs": [
              {
                "name": "vsphere-w01-cl-edge-01-vds01",
                "portGroupSpecs": [
                  {
                    "name": "vsphere-w01-cl-edge-01-pg-mgmt",
                    "transportType": "MANAGEMENT"
                  },
                  {
                    "name": "vsphere-w01-cl-edge-01-pg-vsan",
                    "transportType": "VSAN"
                  },
                  {
                    "name": "vsphere-w01-cl-edge-01-pg-vmotion",
                    "transportType": "VMOTION"
                  }
                ]
              },
              {
                "name": "vsphere-w01-cl-edge-01-vds02",
                "isUsedByNsxt": true
              }
            ],
            "nsxClusterSpec" : {
                "nsxTClusterSpec" : {
                  "geneveVlanId" : 1214,
                  "ipAddressPoolSpec" : {
                      "name" : "vsphere-w01-np02",
                      "subnets" : [ {
                        "ipAddressPoolRanges" : [ {
                          "start" : "172.22.14.210",
                          "end" : "172.22.14.240"
                        } 
                      ],
                        "cidr" : "172.22.14.0/24",
                        "gateway" : "172.22.14.254"
                        } ]
                    }
                      
                }
            }
           }
        }
      ]
    }
  }

Even though we don’t need the cluster to be prepared for NSX-T (since it will only be used for Edges) setting the isUsedByNSXT flag to true will make the secondary VDS used by the uplink portgroups once we create a T0, which is what we want in this scenario – otherwise, we would not be using the 3rd and 4th NICs at all.

As discussed earlier, we should first run the POST call to validate in this case, the URL is https://sddc_manager_fqdn/v1/clusters/validations and after the body is validated, proceed with the creation removing validation from the URL

Last but not least, we need to create our NSX-T Edge Cluster on top of the 2nd cluster on the domain!

3: Create NSX-T Edge Cluster

The last piece of the puzzle is creating the NSX-T Edge Cluster, to allow for this workload domain to leverage overlay networks and communicate to the physical world.

To create the NSX-T Edge Cluster, we first need to get the Cluster ID of the cluster we just created (how many times can you say cluster in the same sentence?)

Following the API reference, number 2.10.1 is ‘Get Clusters’, which does a GET call to https://sddc_manager_fqdn/v1/clusters

Now that we have the ID, this is the body to create two Edge Nodes, configure management, TEP and uplink interfaces, configure a T0 and a T1 instance, as well as configuring BGP peering on the T0 instance!

{
    "edgeClusterName" : "vsphere-w01-ec01",
    "edgeClusterType" : "NSX-T",
    "edgeRootPassword" : "VMware1!VMware1!",
    "edgeAdminPassword" : "VMware1!VMware1!",
    "edgeAuditPassword" : "VMware1!VMware1!",
    "edgeFormFactor" : "LARGE",
    "tier0ServicesHighAvailability" : "ACTIVE_ACTIVE",
    "mtu" : 9000,
    "asn" : 65212,
    "edgeNodeSpecs" : [ {
      "edgeNodeName" : "vsphere-w01-en01.vsphere.local",
      "managementIP" : "172.22.11.71/24",
      "managementGateway" : "172.22.11.254",
      "edgeTepGateway" : "172.22.17.254",
      "edgeTep1IP" : "172.22.17.12/24",
      "edgeTep2IP" : "172.22.17.13/24",
      "edgeTepVlan" : 1217,
      "clusterId" : "37c83ee6-2338-40b0-9470-bb6d47922601",
      "interRackCluster" : false,
      "uplinkNetwork" : [ {
        "uplinkVlan" : 1218,
        "uplinkInterfaceIP" : "172.22.18.2/24",
        "peerIP" : "172.22.18.1/24",
        "asnPeer" : 65213,
        "bgpPeerPassword" : "VMware1!"
      }, {
        "uplinkVlan" : 1219,
        "uplinkInterfaceIP" : "172.22.19.2/24",
        "peerIP" : "172.22.19.1/24",
        "asnPeer" : 65213,
        "bgpPeerPassword" : "VMware1!"
      } ]
    }, {
        "edgeNodeName" : "vsphere-w01-en02.vsphere.local",
        "managementIP" : "172.22.11.72/24",
        "managementGateway" : "172.22.11.254",
        "edgeTepGateway" : "172.22.17.254",
        "edgeTep1IP" : "172.22.17.14/24",
        "edgeTep2IP" : "172.22.17.15/24",
        "edgeTepVlan" : 1217,
        "clusterId" : "37c83ee6-2338-40b0-9470-bb6d47922601",
        "interRackCluster" : false,
        "uplinkNetwork" : [ {
          "uplinkVlan" : 1218,
          "uplinkInterfaceIP" : "172.22.18.3/24",
          "peerIP" : "172.22.18.1/24",
          "asnPeer" : 65213,
          "bgpPeerPassword" : "VMware1!"
        }, {
          "uplinkVlan" : 1219,
          "uplinkInterfaceIP" : "172.22.19.3/24",
          "peerIP" : "172.22.19.1/24",
          "asnPeer" : 65213,
          "bgpPeerPassword" : "VMware1!"
      } ]
    } ],
    "tier0RoutingType" : "EBGP",
    "tier0Name" : "vsphere-w01-ec01-t0-gw01",
    "tier1Name" : "vsphere-w01-ec01-t1-gw01",
    "edgeClusterProfileType" : "DEFAULT"
  }

As mentioned before, please run the VALIDATE call first, in this scenario, a POST call to https://sddc_manager_fqdn/v1/edge-clusters/validations – after validation is passed, proceed to execute the call without the validations on the URL.

After this procedure is finished, we will have our workload domain with two clusters as well as a T0 gateway completely configured and ready to go! Simple and quick, isn’t it?

Closing Note

Leveraging APIs for VCF can help us not only to work with architectures or designs that are not able to be implemented due to GUI restrictions, but also greatly speed up the time we take in doing so!

I hope you enjoyed this post, and if you have any concerns, or want to share your experience deploying VCF via API calls, feel free to do so!

See you in the next post!

Lessons learned while deploying VCF 4.2 Management Domain

Hello Everyone! It’s me again, trying to maintain a weekly post cadence!

Today I’m going to talk about some roadblocks I hit while doing a 4.2 VCF Deployment in a real, customer environment. Hopefully this will prevent these issues from happening to you or help you to solve them quickly if they do arise!

Getting started with VMware Cloud Foundation (VCF) 4.0 - CormacHogan.com

Password Policy for Cloud Builder

In VCF 4.2, several changes to password strength were made. It seems that using 8 character passwords are hit/miss (you could get a valid deployment and then immediately a non-valid deployment if you deploy another Cloud Builder with a password like “VMw@r3!!” – I haven’t been able to fully grasp the cause for this behaviour.

In addition, VMware is now a dictionary word, so it wont be allowed. So “VMware1!” and “VMware1!VMware1!” will also fail.

The password that i’ve been using successfully for the initial deployment is “VMw@r3!!VMw@r3!!” – That one works 100% – You can go ahead and use that one.

Hostnames in uppercase

This one is really, really strange – If the hostnames of your ESXi hosts are in uppercase, you will get a ‘Failed to connect to lowercase_hostname’ for all of your hosts when running the validation, and the validation will stop and won’t query any of the host configuration

I spent some time trying to figure this out, at first I thought it was DNS records, but then on a different environment, 3 of the 4 hosts had their hostname in upper case and one of them in lower case, and the one in lower case was the only one connecting, so that made me test the change and suddenly the new host in lowercase was also connecting!

To clarify, ESXI1.VSPHERE.LOCAL will fail, esxi1.vsphere.local will work – Make sure your hostnames are in lowercase

Heterogeneous / Unbalanced disk configuration across hosts

This one is really interesting, let’s say you’re doing an all flash VCF and you have 20 disks per host – The best way to configure it would be 4 Disk groups of 1 Cache + 4 Capacity, so that you would use all 20 disks.

Since you can have at a maximum 5 Disk groups of 1 Cache + 7 Capacity, 40 is the maximum number of disks you can have.

However, make sure that you’re following these two rules for your deployment

  • Make sure that the amount of disks follows a multiple of a homogeneous disk group configuration so that all your disks can be used and all the disk groups have the same amount of disks – I.e, if you have 22 disks, there is no way you can use all disks while maintaining all disk groups with the same amount of disks. If you have 22 disks, you can do 3 (1+6) and one won’t be used, or 4(1+4) and two won’t be used.
  • Make sure that all your hosts have the same amount of disks. You can check this before installing – In my scenario, validation was passing but it was setting the cluster as hybrid instead of all flash.
    After checking that all devices were SSD and were marked as SSD I was really confused. Then I checked and two of the hosts had 2 more disks than the rest. Fixing that made the validation pass and marking the cluster as all flash.

EVC Mode

This one almost made me reinstall the whole cluster…

BE REALLY SURE that you’re selecting the correct EVC mode for your CPU family if you’re selecting an EVC mode in the Cloud Builder spreadsheet.

If you select the wrong EVC mode, Cloud Builder will fail in this deployment, and you won’t be able to continue from the GUI at all. The only way around it is via the API. Otherwise, it is wiping the cluster and starting from scratch!

I’m going to show you how to fix this issue but the method applies in case you need to edit the configuration and then re-attempt a deployment.

First of all, you need to get your SDDC Deployment ID, you can get it with this API call (I will be using curl for this example but you can also use something like invoke-restmethod in powershell or even a GUI based REST client such as Postman)

Get your SDDC Deployment ID

curl 'https://cloud_builder_fqdn/v1/sddcs/' -i -u 'admin:your_password' -X GET \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json' \
    -k

You can export the output to a file or to a text viewing tool such as less, and then search for the sddcId value

Editing the JSON File

Once you have the sddcId, you need to edit the JSON file that CB generated from the spreadsheet so you can then use it in the API call. I recommend that you copy the file and edit the copy. The file is located at /opt/vmware/sddc-support/cloud_admin_tools/resources/vcf-public-ems/

#COPY THE FILE
cp /opt/vmware/sddc-support/cloud_admin_tools/Resources/vcf-public-e                                                                                                                     ms/vcf-public-ems.json /tmp/newjson.json
#REPLACE STRING ON FILE
sed -i "s/cascadelake/haswell/g" /tmp/newjson.json

You can also edit the file using vi – in this case I used sed because I knew the string will only appear once in the file and it was faster

Restarting the deployment

Now that you have the sddcId and you’ve edited the JSON file, it is time for you to restart the process using another API call

curl 'https://cloud_builder_fqdn/v1/sddcs/your_sddc_id_from_previous_step' -i -u 'admin:your_password' -X PATCH     -H 'Content-Type: application/json'     -H 'Accept: application/json'     -d "@/tmp/newjson.json"  -k

Make sure to add the @ before the location of the file when using curl

Once you run this, you should get something like:

HTTP/1.1 100 Continue
HTTP/1.1 200
Server: nginx
Date: Wed, 07 Apr 2021 20:37:08 GMT

And if you log in to the Cloud Builder web interface, your deployment should be running again! Phew, you saved yourself from reinstalling and preparing 4 nodes! Go grab a beer while the deployment continues 😀

Driver Issue when installing NSX-T VIBs

I ran into this issue after waiting for multiple hours for the NSX-T Host Preparation to finish, and seeing all the hosts on the NSX-T tab being marked as failed.

When checking the debug logs for Cloud Builder, I saw errors like:

2021-04-07T23:06:44.700+0000 [bringup,196c7022580bfc32,5a84] DEBUG [c.v.v.c.f.p.n.p.a.ConfigureNsxtTransportNodeAction,bringup-exec-7] TransportNode esxi1.vsphere.local DeploymentState state is {"details":[{"failureCode":260
80,"failureMessage":"Failed to install software on host. Failed to install software on host. esxi1.vsphere.local : java.rmi.RemoteException:  [DependencyError] VIB QLC_bootbank_qedi_2.19.9.0-1OEM.700.1.0.15843807 requires qe
dentv_ver \u003d X.40.17.0, but the requirement cannot be satisfied within the ImageProfile. VIB QLC_bootbank_qedf_2.2.8.0-1OEM.700.1.0.15843807 requires qedentv_ver \u003d X.40.17.0, but the requirement cannot be satisfied within the Im
ageProfile. Please refer to the log file for more details.","state":"failed","subSystemId":"eeaefa1e-c5a2-4a8a-9623-994b94a803a9","__dynamicStructureFields":{"fields":{},"name":"struct"}}],"state":"failed","__dynamicStructureFields":{"fi
elds":{},"name":"struct"}}

This is related to QLogic drivers that are included in the HP custom image that was being used in this deployment (and was patched to 7.0u1d which is the pre-requisite for VCF 4.2)

Indeed, these drivers were installed

esxcli software vib list | grep qed
qedf                           2.2.8.0-1OEM.700.1.0.15843807         QLC     VMwareCertified   2021-03-03
qedi                           2.19.9.0-1OEM.700.1.0.15843807        QLC     VMwareCertified   2021-03-03
qedentv                        3.40.3.0-12vmw.701.0.0.16850804       VMW     VMwareCertified   2021-03-04
qedrntv                        3.40.4.0-12vmw.701.0.0.16850804       VMW     VMwareCertified   2021-03-04

None of these drivers were in use, and none of the hosts were using QLogic hardware – So these drivers could be removed without issues, however, it is best to unconfigure the hosts from NSX-T first since that also prompts for a reboot.

Go to the Transport Node tab in NSX-T, select the cluster, and click on “Unprepare” – This will likely fail and prompt you to run a force cleanup – This one will work and the hosts will disappear from the tab.

In my scenario, none of the NSX-T VIBs were installed so no NSX-T VIB cleanup was necessary

Now, it is time to delete the drivers from the hosts and reboot them. You can run this one by one on the hosts (since you already have vCenter, vCLS, and NSX Manager VMs running, you can’t just blindly power-off all your hosts)

esxcli software vib remove --vibname=qedentv --force
esxcli software vib remove --vibname=qedrntv --force
esxcli software vib remove --vibname=qedf --force
esxcli software vib remove --vibname=qedi --force
esxcli system maintenanceMode set --enable true
esxcli system shutdown reboot --reason "Drivers"

Edge TEP to ESXi TEP validation when using Static IP Pool

VCF 4.2 removes the need of having a DHCP server on the ESXi TEP network (as long as you’re not using stretched cluster) which is a lifesaver for many, since setting up the DHCP server was usually a light stopper for customers (the other one being BGP)

However, the validation still attempts to search for a DHCP server (it doesn’t matter that you configured a Static IP Pool on the spreadsheet) and since there isn’t any, you get a 169.254.x.x IP and the validation fails. For example:

VM Kernel ping from IP '172.22.17.2' ('NSXT_EDGE_TEP') from host 'esxi1.vsphere.local' to IP '169.254.31.119' ('NSXT_HOST_OVERLAY') on host 'esxi2.vsphere.local' failed
You can see the IP is on the 169.254.x.x range

Luckily, this is just a validation bug, it is reported internally, and will likely be fixed in the latest VCF release. The issue will not present itself while actually doing the deployment and the TEP addresses will be set up correctly using the static IP Pool

BGP Route Distribution Failure

If your BGP neighboring is not configured correctly on your upstream routers, you will see the task “Verify BGP Route Distribution fail”

021-04-08T05:09:54.729+0000 [bringup,42ba3b72e2ee4185,395f] ERROR [c.v.v.c.f.p.n.p.a.VerifyBgpRouteDistributionNsxApiAction,pool-3-thread-13] FAILED_TO_VALIDATE_BGP_ROUTE_DISTRIBUTION
com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException: Failed to validate the BGP Route Distribution result for edge node with ID 123b3404-bab6-4013-a9f7-eba3b91b4faf

This means that the BGP configuration on the upstream routers is incorrect, usually, there is a BGP neighbor missing. The easiest way to figure out what’s missing is to check the BGP status on the Edge Nodes

In my case, the Upstream switches only had one neighbor configured per uplink VLAN, so node 1 showed:

BGP neighbor is 172.22.15.1, remote AS 65211, local AS 65210, external link
BGP version 4, remote router ID 172.22.15.1, local router ID 172.22.16.2
BGP state = Established, up for 09:09:51

And node 2 Showed:

BGP neighbor is 172.22.15.1, remote AS 65211, local AS 65210, external link
BGP version 4, remote router ID 0.0.0.0, local router ID 172.22.15.3
BGP state = Connect

You can see that the BGP session for node 2 is not established. After configuring the neighbor correctly on the upstream routers, the issue was resolved!

Conclusion

Deploying VCF 4.2 in this environment has been a rollercoaster but luckily, all the issues were able to be solved.

I hope this helps you either avoid all of these issues (by pre-emptively checking and fixing what could go wrong) or in case it does happen to you, to fix them as quick as possible)

Stay tuned for more VCF 4.2 adventures, next time, with workload domains!

How do I get to vSphere 7.0 without dying in the process?

Hello Everyone,

After a long hiatus, I decided to write a new blog post (and hopefully improve the frequency of them :D) – This will be based on a 2-hour presentation that I did for VMUG (VMware User Group) Argentina last week, which was done in spanish, and I will link it down below

However, for all of the non-spanish Speakers, I will do a breakdown of everything you need to check before attempting a vSphere upgrade from the vCenter & PSC perspective to pass the upgrade wth flying colors! – Buckle up!

Where is our environment currently standing?

First of all, you need to assess the current situation of your vCenters and PSCs – Is replication working correctly for example? This article goes really really deep into checking that:

Pre-upgrade considerations in Multi-vCenter environments

If you have any replication issues, this is the first thing you need to fix, otherwise, as shown in the previous article (and the video) you risk completely destroying your environment.

The 2nd thing you need to check is your current topology – How many PSCs and vCenters are actually in my environment? Am I using PSC HA? Is everything converged? Depending on your current topology, it might be a pretty trivial migration or it would need multiple steps over the course of a weekend.

What happens in the upgrade process?

First of all, the external PSC is deprecated in vSphere 7.0 – That means that, as a part of the upgrade process, any environment with an external PSC is converged. Even though this process might be straightforward, it can cause multiple problems before, during and after the migration. It’s easier and more convenient to break it up in parts

So if we’re good with replication (check and re-check previous article, I can’t stress this enough) then we need to figure out an upgrade and migration plan

Planning the upgrade process based on our topology

Let’s start with something simple:

What would be the correct steps here?

Let’s break it down:

1: Offline snapshots of all three VCs (with embedded PSCs) – offline means with all the SSO domain powered off- this is done from the ESXi nodes that are hosting the VMs.

2: Upgrade vCenter 1

3: Check functionality and replication

4: Offline snapshots of all three VCs (with embedded PSCs)

5: Upgrade vCenter 2

6: Check Functionality and replication

7: Guess what?

8: Upgrade vCenter 3:

9: Check Functionality and replication

10: Delete all snapshots

Why am I taking snapshots at every step? Why don’t I just take a single round of snapshots and then upgrade all at once?

Well, because if you had any issue at any point of the 2nd or 3rd upgrade, you would have to roll back everything and start from scratch. If you do it this way, you have multiple points to go back and avoid having to re-do the upgrade process! This can get even worse if instead of 3 vCenters you have 9 or 10 – If let’s say, you had an issue with upgrade 7, you would have to revert everything!

Now let’s make this a little bit more complicated!

So let’s picture this scenario (which is not too uncommon, i’ve seen this is in the real world)

What do we have?

First of all, blue lines symbolize good replication and red lines symbolize that replication is not working – So, as discussed earlier, this will be the first thing to fix – in the process of fixing this (most likely with a GSS ticket), multiple rounds of offline snapshots will be taken!

Now, onto the topology:

  • 6 External PSCs in a ring topology
  • 3 PSC HA VIPs being used by 2 vCenters each
  • 6 vCenters

So what should we do here? This not only involves the upgrade of the vSphere environment, but also, the re-pointing of 2nd and 3rd party tools to the new converged PSCs – Think of NSX and SRM for example.

The biggest pain point in this scenario, however, is PSC HA – how do we get rid of this prior to the upgrade?

Even though there is a KB for converging PSC HA (https://kb.vmware.com/s/article/65129) in practice, this is not the best approach due to how error prone it is.

What is the best approach? There are two ways to approach this, depending on downtime and operations.

The cleanest approach, would be to deploy 6 new PSCs, then repoint the vCenters to those 6 PSCs, and then decomission all the PSC HA nodes (as well as the VIP) – However, this might be complicated because of lack of IP addresses in the management segment, time, etc.

You could also leverage lsdoctor (https://kb.vmware.com/s/article/80469) to unconfigure PSC HA and then repoint the vCenters to each of the nodes – This introduces a little bit more downtime per vCenter (downtime when unconfiguring PSC HA + downtime until the repoint is complete) but removes the need of deploying new PSCs.

If you ask me, I recommend the first option, to make this as clean as possible.

So in this scenario, what would you do?

  1. Offline snapshots of all vCenters and PSCs
  2. Deploy PSC 7 pointed to PSC 6
  3. Deploy PSC N pointed to PSC N-1 until all PSCs are deployed.
  4. Check replication among the new PSCs

So now we have something like this

You can see that by deploying the PSCs in that order, we have a “semi-ring” already, with way less operational hassle than if we were deploying them pointed to a single PSC and then having to remake the replication agreements

So what’s next?

We need to repoint the vCenters to these new PSCs – Since the repoint is a pretty short process, you can get away with taking a single round of offline snapshots at the beginning and just repoint everything

  1. Offline snapshots of all vCenters and PSCs
  2. Repoint all vCenters to the new PSCs, 1:1
  3. Check correct functioning

End result:

Lovely, right?

Now, we need to get rid of all the PSCs that were forming the PSC HA (nodes and VIPs)

  1. Offline snapshots of all vCenters and PSCs
  2. Decomission all PSCs and PSC HA VIP nodes using: https://kb.vmware.com/s/article/2106736
  3. Check correct functioning

Now we’re here!

So we did all this and we haven’t even started upgrading or converging… but believe me, taking due diligence in doing this as clean as possible will save you from multiple headaches when you actually upgrade!

So what is left?

  1. Form a ring creating an agreement between PSC12 and PSC7
  2. Take a new round of offline snapshots
  3. Converge PSC7
  4. Check correct functioning
  5. Take a new round of offline snapshots
  6. Converge PSC8
  7. ….
  8. ….
  9. Until all PSCs are converged

In case there is any issue with the convergence, you can just go back to the latest functioning snapshot so you don’t have to redo everything!

You should be here now:

And from here, you can finally do the upgrade process – as discussed previously and in the first scenario, you should take a round of offline snapshots per each upgrade, to avoid having to re-do upgrades

Last but not least, you should repoint all 2nd and 3rd party solutions to the new converged (and upgraded) PSCs that are now living inside the vCenter appliance!

Closing note

I hope you enjoyed this post – If you have even limited knowledge of spanish, I encourage you to watch the youtube video in which I go over this in detail, and also I analyze and fix replication issues the same way it would be done if you contacted GSS.

Feel free to share this with peers, customers, partners – If we generate awareness about these processes and a clean and correct way of doing them, we will have way more succesful upgrades!

Quickly create NSX-T Segments using PowerCLI and NSX-T REST API!

Hello Everyone,

In today’s edition, I’m going to share with you a script that I wrote that will do the following:

  • Get all the VMs from your infrastructure
  • For each VM (and each virtual nic that is connected to a portgroup), it will query
    • Portgroup Name
    • Portgroup VLAN Type (because we’re going to skip trunk VLANs)
    • Portgroup VLAN ID
    • Default gateway (here is where it gets tricky…)
    • Gateway network prefix (also tricky)

Once we have all that data, we will proceed to create all of these segments (with the gateway and network prefix) inside NSX-T, using REST API calls.

By creating the segments with the gateway, we serve two great purposes

  • The segment is already prepared to be connected to a T1 DR and does not need further manual editing
  • Customers may not know all the gateways of all their vSphere networks, and this script will output that for you!

What do I need from you to run it?

  • vCenter & NSX Manager FQDNs and credentials
  • NSX-T Overlay transport zone name (the transport zone we’re going to use to create the segments)

This is the link to the script, where you can take a look at the code: https://github.com/luchodelorenzi/scripts/blob/master/createSegments.ps1

I’m going to explain it bit by bit, mostly focusing on the logic and the problems I encountered while testing it.

The end result would be something like this:

Enter vCenter FQDN: vcsa-01a.corp.local

PowerShell credential request
Enter vCenter Credentials
User: administrator@vsphere.local
Password for user administrator@vsphere.local: ********

Enter NSX Manager FQDN: nsxapp-01a.corp.local

PowerShell credential request
Enter NSX Credentials
User: admin
Password for user admin: ****************

Enter NSX Overlay Transport Zone name: nsx-overlay-transportzone

Name                           Port  User
----                           ----  ----
vcsa-01a.corp.local            443   VSPHERE.LOCAL\Administrator
Querying data for rdsh-01a ...
Querying data for log-01a ...
Querying data for web-01a ...
Querying data for vm-01a ...
Querying data for app-01a ...
Querying data for web-02a ...
Querying data for edgenode-01a ...
############################################################
Found the following possible segments in your infrastructure
Portgroup PG-WEB with VLAN 100 gateway 172.16.10.1 and prefix length 24
Portgroup PG-VM with VLAN 200 gateway 172.16.101.1 and prefix length 24
Portgroup PG-APP with VLAN 300 gateway 172.16.20.1 and prefix length 24
Would you like to Create these segments on NSX-T?
 ( y / n ) : y
Yes, create segments
found transport zone id: 1b3a2f36-bfd1-443e-a0f6-4de01abc963e
Creating Segment PG-WEB-VLAN100-GW-172.16.10.1 on transport zone nsx-overlay-transportzone
Creating Segment PG-VM-VLAN200-GW-172.16.101.1 on transport zone nsx-overlay-transportzone
Creating Segment PG-APP-VLAN300-GW-172.16.20.1 on transport zone nsx-overlay-transportzone
Simple, right?

So let’s start breaking the script up in parts…

Part 1 – Getting FQDNs and Credentials

This is pretty self explanatory, we’re just getting the FQDNs and credentials and saving them into variables.

$vcenter=Read-Host "Enter vCenter FQDN"
$vccredential = Get-Credential -message "Enter vCenter Credentials"
$nsxmanager=Read-Host "Enter NSX Manager FQDN"
$nsxcredential = Get-Credential -message "Enter NSX Credentials"
$overlayTransportZone = Read-Host "Enter NSX Overlay Transport Zone name"
Connect-VIServer -Server $vcenter -credential $vccredential

Part 2 – Exporting Data from VMs

So what are we doing here?

  • We’re iterating through every VM in the infrastructure, and getting the IP stack (which is part of the extensiondata.guest object, therefore, being read from VMware tools – This would be empty if this VM does not have VMware tools running)
  • We’re getting the device (virtual nic) that has the portgroup with the default gateway (this is needed in the scenario of multiple virtual nics and multiple portgroups)
    This is important because the VMware tools data is not available in vSphere, so we need to do the mapping ourselves.
  • We filter stuff out that we don’t need, such as:
    • Any network with less than 6 characters (ipv6 empty network)
    • Any network that does not have “.” on the address (so no ipv4)
    • Any prefix length that is 0 or 32 (useless in this scenario, this is not the gateway network)
    • Any network that starts with 224. or 169.254

After having filtered that, we’re going to have our gateway and network prefix, so what’s next?

  • Using the device, we get the portgroup that we’re going to use, and from that portgroup, we get the VLAN configuration and VLAN ID, and we discard it if it is a trunk portgroup
  • We will also discard the portgroup if it contains “vxw-dvs” because this will mean it is a NSX-V portgroup and won’t be VLAN backed
  • We create a new object that will contain:
    • Portgroup Name
    • Portgroup VLAN ID
    • Gateway
    • Network prefix
  • And we add this object to an array of objects
foreach ($vm in $vms) {
    $networkObject = "" | Select Portgroups,Gateway,Prefix
    $networkObject.Portgroups = ($vm | Get-NetworkAdapter | Get-VDPortgroup)
    Write-Host Querying data for $vm...
	if ($vm.extensiondata.guest.ipstack){
		$device = ($vm.extensiondata.guest.ipstack[0].iprouteconfig.iproute | where {$_.network -eq "0.0.0.0"}).gateway.device 
		$networkObject.gateway = ($vm.extensiondata.guest.ipstack[0].iprouteconfig.iproute | 
			where {$_.network -eq "0.0.0.0"}).gateway.ipaddress
		$networkObject.Prefix = ($vm.extensiondata.guest.ipstack[0].iprouteconfig.iproute | 
			where {$_.network.length -gt 6} | where {$_.network -like "*.*"} | 
				where {$_.prefixlength -ne 32} | where {$_.network.substring(0,4) -ne "224."}  | 
					where {$_.prefixlength -ne 0} | where {$_.network.substring(0,8) -ne "169.254."} | 
						where {$_.gateway.device -eq $device}).prefixlength
						
		if (($vm | Get-NetworkAdapter)[$device]){
			$pg = ($vm | Get-NetworkAdapter)[$device] | get-vdportgroup
		}
		$PGObject = "" | Select Name, VLAN, Gateway, PrefixLength
		$PGObject.Name = $pg.name
		$PGObject.VLAN = $pg.VlanConfiguration.VlanId
		$PGObject.Gateway = $networkObject.Gateway
		$PGObject.PrefixLength = $networkObject.Prefix
		#Skip Trunk vLAN
		if ($pg.VlanConfiguration.vlantype -ne 'Trunk' -and $pg.name -notlike "*vxw-dvs*" -and $pg.name -ne $null){
			$PossibleSegments += $PGObject
		 }
	}

Part 3 – Parsing the data

We have an array of objects that have all the data we need, but this will likely have many repeated entries, since a lot of VMs are going to be using the same portgroup and same gateway. We could use a single entry per portgroup, but this will not be ideal.
There is nothing stopping anyone from using multiple networks inside a same portgroup and VLAN, so the ‘uniqueness’ of the segment will be given by a combination of the portgroup as well as the gateway
In that way, we will have all the data we need and won’t discard anything useful!

$UniqueSegments = $PossibleSegments | Where {$_.Gateway -ne $null} | sort-object -Property  @{E="Name"; Descending=$True}, @{E="Gateway"; Descending=$True} -unique

Write-Host "############################################################"
Write-Host "Found the following possible segments in your infrastructure"
$uniqueSegments | % {
	Write-Host Portgroup $_.name with VLAN $_.VLAN, gateway $_.gateway and prefix length $_.prefixlength
}

Part 4 – Pushing the data to NSX-T

So now that we have our array of segment objects fully sorted out and having unique entries, we need to push it to NSX-T

Remember at the beginning I asked for the NSX-T Overlay Transport zone? We’re going to need the transport zone ID

NSX-T 3.0 Rest API – List transport zones
So with the name, we can execute that API call and get the transport zone ID, to use it in the create segment API call!
NSX-T 3.0 Rest API – Create Segment

$getTzUrl = "https://$nsxmanager/api/v1/transport-zones"
	$getTzRequest = Invoke-RestMethod -Uri $gettzurl -Authentication Basic -Credential $nsxcredential -Method get -ContentType "application/json" -SkipCertificateCheck
	$gettzrequest.results | % {
		if ($_.display_name -eq $overlayTransportZone){
			$overlayTzId = $_.id
			Write-Host found transport zone id: $overlayTzId
		}
	}
	foreach ($segment in $uniqueSegments)
	{
		$segmentDisplayName = $segment.name + "-VLAN" + $segment.VLAN + "-GW-" + $segment.gateway
		$Body = @{
			display_name = $segmentDisplayName
			subnets = @(
					@{
					gateway_address = $segment.gateway + "/" + $segment.prefixlength
					}
				)
			transport_zone_path="/infra/sites/default/enforcement-points/default/transport-zones/$overlayTzId"
				}	
		$jsonBody = ConvertTo-Json $Body
		Write-Host "Creating Segment $segmentDisplayName on transport zone $overlayTransportZone" 
		$patchSegmentUrl = "https://$nsxmanager/policy/api/v1/infra/segments/" + $segmentDisplayName
		$patchRequest = Invoke-RestMethod -Uri $patchSegmentUrl -Authentication Basic -Credential $nsxCredential -Method patch -body $jsonBody -ContentType "application/json" -SkipCertificateCheck
	}
}

Conclusion

I hope you find this script useful – It should GREATLY improve times in fresh NSX-T deployments, not only by quickly creating all the segments automatically, but also since it does all the hard work of exporting all the gateway and prefix configuration, which can be super tedious!

If you have any questions regarding the script, please leave it on the comments below and I’ll address it

If you found any bugs or errors or better ways to accomplish the same thing, please also leave a comment! It will also be super helpful!

Cheers,

Proactively Checking and Replacing STS Certificate on vSphere 6.x / 7.x

Recently, we’ve been working on a global issue affecting all customers that had deployed a vCenter Server as version 6.5 Update 2 or later. The Security Token Service (STS) signing certificate may have a two-year validity period. Depending on when vCenter was deployed, this may be approaching expiry.

Since currently there is no alert on vCenter for this certificate, and also it is a certificate that prior to 6.7u3g had no way to be replaced by customers in case of expiration (required GSS involvement to execute internal procedures / scripts) and it generates a production down scenario, silently.

Within the GSS team, we’ve come up with three scripts to help with this situation.

Checksts.py

Checksts.py is a python script that is mentioned in KB https://kb.vmware.com/s/article/79248. This script will proactively check for expiration of the STS certificate. It works on Windows vCenters as well as vCenter Server Appliances.

To use it, you can download it from the KB mentioned:

Once it is downloaded, you can copy it to any directory on your vCenter. After that, you will run it like this:

  • Windows: "%VMWARE_PYTHON_BIN%" checksts.py
  • VCSA: python checksts.py

This is an example for VCSA:

If you get the message “You have expired STS certificates” and/or your certificate expiration date is in less than 6 months, we recommend to move onto the next step, replacing the STS certificate! If your expiration date is in more than 6 months, then you don’t have to worry about any of this!

Fixsts.sh (VCSA) / Fixsts.ps1 (Windows)

The fixsts scripts are mentioned in https://kb.vmware.com/s/article/76719 (which I personally wrote) for VCSA and https://kb.vmware.com/s/article/79263 for Windows.

The idea is the same for both, replacing the STS certificate with a new, valid one. This can be done proactively (cert has not expired yet) as well as reactively (cert has already expired and you’re in a production down scenario)

The steps for these two KBs are mentioned in the articles. They’re pretty much identical, with minor differences in running the commands due to the Guest OS, and super straightforward to run.

Once the STS is replaced, in case it was done proactively, you will be good to go!

YOU CAN STOP READING FROM THIS POINT ON – hope you liked this blog entry!

However, if this was done reactively, then it is likely that you will need to replace more certificates in your vCenter Server, especially if you were using VMCA certs (which could have the same expiration date as the STS certificate if they were never replaced)

Replacing other certificates

How do I know if which of my other certificates are expired?

On the KBs mentioned, there are two one-liners provided to check for certificates

  • Windows: $VCInstallHome = [System.Environment]::ExpandEnvironmentVariables("%VMWARE_CIS_HOME%");foreach ($STORE in & "$VCInstallHome\vmafdd\vecs-cli" store list){Write-host STORE: $STORE;& "$VCInstallHome\vmafdd\vecs-cli" entry list --store $STORE --text | findstr /C:"Alias" /C:"Not After"}

  • VCSA: for i in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list); do echo STORE $i; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $i --text | egrep "Alias|Not After"; done

These commands will show, for each of the VECS (VMware Endpoint Certificate Store) stores, the expiration date for all certificates. If the certificates have an expiration date prior to today, then they’re expired. Also, you will have issues with services if certificates are expired. Services such as vpxd-svcs, vpxd or vapi-endpoint will be pretty verbose with expiration date of certain certificates.

For example:

root@vcsa1 [ /tmp ]# for i in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list); do echo STORE $i; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $i --text | egrep "Alias|Not After"; done
STORE MACHINE_SSL_CERT
Alias : __MACHINE_CERT
Not After : Apr 6 11:57:19 2029 GMT
STORE TRUSTED_ROOTS
Alias : c96d3301505316ccc1b295276ece31318ad79ec7
Not After : Apr 6 11:57:19 2029 GMT
Alias : 8a11418d5ae2b87b7e8a5cb8646fbfae41503f9d
Not After : Dec 13 21:50:49 2029 GMT
Alias : cb5a495d34f3f2f75d357b47aac3799346665258
Not After : Sep 25 20:32:57 2022 GMT
Alias : 229a64a3dff7417d0b38fb011c692a55b7bee5c2
Not After : May 16 20:21:12 2030 GMT
Alias : 2f0e8e4f1658e61bef5004cb5efd159b90396838
Not After : May 16 20:45:07 2030 GMT
STORE TRUSTED_ROOT_CRLS
Alias : 4504400e4bcbdab5a34a9bc2555abd55327369c1
Alias : 31b2b5a18d89d90dadff901400a60d45ca3356e9
Alias : e7840a7cbbe7fcdd7a13d9159ff97443cc53fb5e
Alias : 985d7e55183635f13e2c6469eee9c72f68334615
STORE machine
Alias : machine
Not After : Apr 6 11:57:19 2029 GMT
STORE vsphere-webclient
Alias : vsphere-webclient
Not After : Apr 6 11:57:19 2029 GMT
STORE vpxd
Alias : vpxd
Not After : Apr 6 11:57:19 2029 GMT
STORE vpxd-extension
Alias : vpxd-extension
Not After : Apr 6 11:57:19 2029 GMT
STORE APPLMGMT_PASSWORD
STORE data-encipherment
Alias : data-encipherment
Not After : Apr 6 11:57:19 2029 GMT
STORE SMS
Alias : sms_self_signed
Not After : Apr 12 12:04:48 2029 GMT
STORE BACKUP_STORE

In this case, none of the certificates are expired. But if we had expired certificates we will need to replace them!

Let’s group them in three groups. All of them are replaced using the same tool, certificate-manager, detailed on KB https://kb.vmware.com/s/article/2097936, but the option you will use will depend on the scenario

  • Group 1: Machine SSL Certificate (Front facing certificate, on port 443)
    • If only Machine SSL is expired, you will run Option 3 (Replace the Machine SSL certificate with a VMCA Generated Certificate) of this KB, with the following caveats
      • The “comma separated list of hostnames” you will be prompt to complete, should contain the PNID of the node as well as any additional hostname or alias you might be using. How do we get the PNID for the node?
        • Windows: "%VMWARE_CIS_HOME%"\vmafdd\vmafd-cli get-pnid --server-name localhost
        • VCSA: /usr/lib/vmware-vmafd/bin/vmafd-cli get-pnid --server-name localhost
      • The value of “VMCA Name” should match the PNID obtained in the prior step
  • Group 2: Root certificate (VMCA root certificate)
    • If there is any certificate expired in the TRUSTED_ROOTS store, it will be safer to just run Option 8 (Reset all certificates) on the KB mentioned above. This will reset all certificates to VMCA signed. The same caveats mentioned for Option 3 apply
  • Group 3: Solution Users certificates(vpxd, vpxd-extension, machine, vsphere-webclient)
    • If there is any certificate expired in the stores vpxd, vpxd-extension, machine or vsphere-webclient, run Option 6 (Replace Solution User Certificates with VMCA generated Certificates) on the KB mentioned above. The same caveats mentioned for Option 3 apply

Once all this is done, you should be back up and running with regenerated certificates, and out of the production down scenario!

Closing note

This is a pretty concerning issue, so I’m really happy to have been part of the team to help fix so many environments across the globe.

Please, use this information to proactively check for the STS certificate, as well as replacing without having to get into a production down scenario. You can share this with customers, partners, or whoever you feel might be benefited from this information!

Pre-upgrade considerations in Multi-vCenter environments

With vSphere 7.0 being released April 2nd, 2020 and vSphere 6.0 reaching its end of general support on March 12th, 2020, this is one of the moments in which many environments are in the process of upgrading their vSphere version, either from 6.0 to 6.5/6.7 (to continue having support) as well as to 7.0 to take advantage of all the new features, such as Kubernetes native integration.

However, we have been getting an increased number of Support Requests with issues after upgrades in Multi-vCenter environments using Enhanced Linked Mode (from now on, ELM), especially if the environment is using more than one Platform Services Controller (from now on, PSC) either embedded, or external.

The goal of this article is to help you understand your roadblocks to upgrade PRIOR to actually doing the upgrade, so you don’t incur in any downtime and can proactively fix everything that’s needed before upgrading.

For the purposes of this article, I will try to demonstrate everything with a Demo Environment, so everything is more clear.

Demo Environment

Super simple environment!


Two vCenter Server Appliances with Embedded PSC, in a single SSO domain.
I’m going to demonstrate the issues that we could get in if we upgrade an environment that is not in a healthy state of PSC replication.

What’s PSC Replication?

As you know, data replicates between the PSC instances (embedded in this scenario) when Enhanced Linked Mode is configured.

What data is replicated?

  • Users and roles
  • Trusted Roots store certificates
  • Lookup Service service registrations
  • Computer accounts
  • Domain controller accounts

And many, many more things. VMDIR (VMware Directory Service) is a Multi-master LDAP database.

I did mention Lookup Service service registrations… what are those?

Lookup Service

The Lookup Service is a component that registers the location of vSphere components so they can securely find and communicate with each other. This includes every internal service as well as some 2nd Party Tools (such as NSX, vSphere Replication, SRM) and 3rd Party Tools (Storage plugins, for example)

This is the output of the amount of Service Registrations per Service Type, for our Demo environment

  2         Service Type: applmgmt
  2         Service Type: certificatemanagement
  2         Service Type: cis.cls
  2         Service Type: cis.vmonapi
  2         Service Type: client
  2         Service Type: com.vmware.vsan.dp
  2         Service Type: com.vmware.vsphere.client
  2         Service Type: cs.authorization
  2         Service Type: cs.componentmanager
  2         Service Type: cs.ds
  2         Service Type: cs.eam
  2         Service Type: cs.identity
  2         Service Type: cs.inventory
  2         Service Type: cs.keyvalue
  2         Service Type: cs.license
  2         Service Type: cs.perfcharts
  2         Service Type: cs.vapi
  2         Service Type: cs.vsm
  2         Service Type: imagebuilder
  2         Service Type: messagebus.config
  2         Service Type: mixed
  2         Service Type: phservice
  2         Service Type: rbd
  2         Service Type: sca
  2         Service Type: sms
  2         Service Type: sso:admin
  2         Service Type: sso:groupcheck
  2         Service Type: sso:sts
  2         Service Type: topologysvc
  2         Service Type: vcenterserver
  2         Service Type: vcha
  2         Service Type: vcIntegrity
  2         Service Type: vsan-dps
  2         Service Type: vsan-health
  2         Service Type: vsphereclient
  2         Service Type: vsphereui

You can see services such as vsphereclient (vSphere Flash Client), vsphereui (vSphere HTML5 Client) and vcenterserver (vCenter Server), among others.

You can also see that there is two of every registration. Every PSC has its own Lookup Service, but they replicate the data through VMDIR, so every registration exists on every PSC.

Let’s take a look at the vCenter Server registrations:

I’m running the following command on one of the vCenter Servers (with Embedded PSC)

/usr/lib/vmidentity/tools/scripts/lstool.py list --url http://localhost:7080/lookupservice/sdk | grep -i "Service type: vCenterServer" -A9 | egrep "Service Type:|Version|URL"

For the purposes of this article, I’m only interested in the Service Type, Version and URL. However, a service registration contains much more data than that, such as the Service Registration ID, Node ID, and all the URL for the different endpoints with its own SSL certificate, but we’re not going to dive into that.

Service Type: vcenterserver
Version: 6.7
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk

We can see that every registration has the URL and the version. This is really important! Keep this in the back of your minds because we’re going to go back to this!

PSC Replication Status

As we mentioned previously, the VMDIR database replicates between the PSCs

You can check the replication status of any PSC instance with the following command

/usr/lib/vmware-vmdir/bin/vdcrepadmin -f showpartnerstatus -h localhost -u administrator -w SSO_Password

This is the output on our Demo environment


Partner: vcsa2.gsslabs.org
Host available: Yes
Status available: Yes
My last change number: 10360
Partner has seen my change number: 10360
Partner is 0 changes behind.

This means that outgoing replication for this node is working, however, it does not mean that replication is working correctly in both directions. For this, you would need to run the same command on its replication partner.

Partner: vcsa1.gsslabs.org
Host available: Yes
Status available: Yes
My last change number: 10351
Partner has seen my change number: 10351
Partner is 0 changes behind.

This is good, our environment is healthy replication wise!
But what if it wasnt?

  • Host available: no, would mean that the replication partner is not reachable through the network
  • Status available: no, would mean that the replication partner is reachable through the network, but VMDIR state is either on read-only or null (this is bad!)
  • Having a big number of changes behind and not updating could mean that this local node is in read-only or null state (this is also really bad!)

So how do we check our VMDIR state if the “showpartnerstatus” command shows any of these errors?
Running the following command

echo 6 | /usr/lib/vmware-vmdir/bin/vdcadmintool
You will get an output similar to:
VmDir State is - Normal
This state could also be Null, Read-Only and Standalone – For the purposes of this document, all three are bad!

But how does a PSC get into this state?

After restoring a PSC (Either embedded or external) either from a snapshot, image-level backup, file-level backup, or VM-based backup, the Update Sequence Number (USN) value is a lower number that its replication partners. This results in the replication partners being out of synchronization with the restored node.

This is why you should always, when snapshotting a Multi-vCenter environment, you should always do it with all nodes powered off, and if you restore one of the nodes to a snapshot, you have to restore all of the involved nodes. This also applies to backups!

What can broken replication affect?

Replication issues are usually called a “Silent Killer” because you don’t notice it is working until you want to do a change in the SSO environment. These changes can be adding a new 2nd or 3rd party tool, creating local users / roles in the SSO domain, installing a new vCenter or PSC, Converging from External PSC to Embedded PSC, and the one we’re discussing in this document, upgrading!

So let’s go back to our demo environment…

This image has an empty alt attribute; its file name is image.png

And we’re now going to upgrade vcsa1.gsslabs.org – The upgrade succeeds…
Remember what we discussed about the versions?
This is what vcsa1.gsslabs.org (the upgraded one) now sees in Lookup Service

Service Type: vcenterserver
Version: 7.0
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk

This is only a simple change to demonstrate the issue. This would happen for every other internal service, and in the case of vSphere 6.7 to 7.0, it will create, rename and re-register a bunch of other services, since the whole VMDIR structure changed.

This is fine! When we log in to vcsa1.gsslabs.org, we see both vCenter Servers…
But what happens if we log in to vcsa2.gsslabs.org ? We see that vcsa1.gsslabs.org is not showing up!

So we go to check the Lookup Service entries, and we find the following…

Service Type: vcenterserver
Version: 6.7
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk


Since replication was not working, vcsa2.gsslabs.org never got the changes that vcsa1.gsslabs.org made during the upgrade… so when vcsa2.gsslabs.org‘s Web Client tries to contact the vCenter instance in vcsa1.gsslabs.org, there is a version mismatch, and therefore it does not load it.

If you now upgrade vcsa2.gsslabs.org, the same thing is going to happen, and both are going to show something like this…

VCSA1
Service Type: vcenterserver
Version: 7.0
U
RL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk

VCSA2
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 7.0
URL: https://vcsa2.gsslabs.org:443/sdk


Effective immediately, ELM is officialy broken – These vCenters won’t see eachother in the Web Client, let alone replicate VMDIR changes.

And this state is not easily fixable, this would likely involve cleaning up both sides VMDIR and then executing a cross domain repoint between eachother. Now imagine if instead of this simple environment, you have a 6 vCenter Environment, and you run into these issues, can you imagine the trouble you will get into?

OK, so now what do we do?

Now that the impacts of broken PSC replication in upgrades (it will also affect convergence, and many other SSO operations), this is something you can do to avoid being sucker punched by the upgrade process.

  • Check if replication between all your PSC instances is working correctly and showing 0 changes behind across the board. This is done using the vdcrepadmin command that was shown before
  • If you run into any issue such as the ones already mentioned, check the VMDIR status using the vdcadmintool command that was shown before
  • If you get any of the errors detailed in this article, please open a Support Request with VMware -> https://kb.vmware.com/s/article/2006985
    We have a multitude of internal tools that can help you fix the replication issues and get you into a healthy state before attempting any other disruptive process, such as upgrading!

Closing Note

I hope this blog post (my debug blog post!) is helpful for everyone that is running into these situations. The idea was to demonstrate a really possible issue you might have, using a simple aspect such as the Service Registration for vCenter Server version change, in the process of an upgrade.

Hopefully this will avoid many critical issues in Multi-vCenter Environment

Regards,