Using AWX Container groups for Kerberos authentication of playbooks/templates running against Windows servers/hosts

I have been porting some of my Ansible playbooks for Windows over to AWX and while they worked in my home lab, they didn’t cooperate when I moved them over to my work environment. This is because initially I was testing on stand-alone windows servers and clients in my home lab. In my office environment we obviously use a Windows AD domain. In Ansible cli, I would just setup Kerberos authentication on my Ansible host. This is not as easy when dealing with AWX running on Kubernetes Pods.

In this situation I will use the stock “AWX EE (latest)” Execution Environment, but with that you will need to configure AWX on how to access your Kerberos server (AD server). We will need to configure a Container Group that will be linked to the Ansible Execution Environment which lets Ansible know about your Kerberos environment. If you haven’t already configured your Windows hosts for connections via WinRM, you can read the following documentation. My environment was already setup for this since I have already been controlling/automating my Windows servers via Ansible cli.

To prepare Kubernetes for this container group, you will need to create a config map that will handle your Kerberos authentication. In your favorite editor (mines vi), create a file in your home directory or “/tmp” called krb5.conf. In my example below I have two domains listed because my AWX host works on two domains.

[libdefaults]
 default_realm = CONTOSO.COM

[realms]
 CONTOSO.COM = {
  kdc = DC2.CONTOSO.COM
 }
 STUFF.COM = {
  kdc = DOUBLE.STUFF.COM
}

[domain_realm]
.contoso.com = CONTOSO.COM
contoso.com = CONTOSO.COM
.stuff.com = STUFF.COM
stuff.com = STUFF.COM

Now we can map this file with Kubernetes by doing the following:

kubectl -n awx create configmap awx-kerberos-config --from-file=krb5.conf

Now your krb5.conf is mapped in Kubernetes, you will want to ensure it has been created by running the following:

kubectl -n awx get configmap awx-kerberos-config -o yaml

You should see output in yaml format that shows your krb.conf. Now in AWX, on the left column, click on “Instance Groups” under the Administration section:

In the “Instance Groups” menu, click “Add”, then “Add Container group”

In the new Container group menu, you can name it what you want, In my case I am naming it: Kerberos. The only other thing you will need to do is make sure you check: “Customize pod specification”

Now you will want to edit the “Custom pod spec” YAML, mine looks like:

apiVersion: v1
kind: Pod
metadata:
  namespace: awx
spec:
  serviceAccountName: default
  automountServiceAccountToken: false
  containers:
    - image: 'quay.io/ansible/awx-ee:latest'
      name: worker
      args:
        - ansible-runner
        - worker
        - '--private-data-dir=/runner'
      resources:
        requests:
          cpu: 250m
          memory: 100Mi
      volumeMounts:
        - name: awx-kerberos-volume
          mountPath: /etc/krb5.conf
          subPath: krb5.conf
  volumes:
    - name: awx-kerberos-volume
      configMap:
        name: awx-kerberos-config

Make sure you save when your done. Now we will need to link this Container group to your template (same as a playbook in Ansible cli). To link the Container group, edit your template/playbook and towards the bottom of the page, you will see “Instance Groups”, from there you will select you Container group.

Now you should be able to run your windows based playbooks/templates in AWX. For me my issue was not solved there. I had some extra trouble shooting that I had to do which turned out to be Kubernetes k3s DNS issues that I will talk about in my next post. If you need assistance troubleshooting you can refer to the README located here. You can always contact me as well.

Installing AWX on AlmaLinux 9

I ran into some issues installing AWX on AlmaLinux 9 on Proxmox (I had the same issues with Alma 8.7). This also applies to RockyLinux 9.

I was installing AWX via Rancher following https://github.com/ansible/awx-operator#basic-install. I made it all the way to the section where you create the awx-demo.yaml, add it to your kustomization.yaml and build via kustomize build . | kubectl apply -f -. From there I was receiving errors such as “unable to determine if virtual resource”,”gvk”:”apps/v1″ and the build would ultimately fail out.

In order to make it past that error I found a found a few posts which suggested changing the CPU type from “Default (kvm64)” to Host. This sets the VM to match the CPU of the host.

***If you are running HyperV, there is a similar option, see the final post in this Google Group conversation: https://groups.google.com/g/awx-project/c/4tmP0TlRODU.***

After resetting the CPU type, rebooting the vm and re-running the kustomize build, I was able to make it quite a bit further. The logs looked like there were no issues, then towards the end the script once again failed. This time I was seeing the following error: “awx unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:”. The Pod itself was also down with a CrashLoopBackOff error. From there I found the following link which was able to get me past all of my installation issues: https://stackoverflow.com/questions/62442679/could-not-get-apiversions-from-kubernetes-unable-to-retrieve-the-complete-list

I ran: kubectl api-resources which listed the resources and metrics.k8s.io/v1beta1 was in fact down.

Next I ran: kubectl delete apiservice/v1beta1.metrics.k8s.io

From there I re-ran the kustomize build command and awx installation completed successfully after the installation. I did have to open the firewall ports in Alma to allow my browser to access AWX.

Steps to Install AWX:

#Install Rancher
curl -sfL https://get.k3s.io | sh -

#Install Kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash

#Move Kustomize binary
mv kustomize /usr/local/bin/

#Goto AWX Readme and follow along from there:
# https://github.com/ansible/awx-operator#basic-install

Feel free to contact me if you have any comments or questions

Disabling Inactive Domain User and Computer Accounts in Active Directory with Ansible

In my last article I wrote about having Ansible run several audit requests including: “We need a list of all inactive user accounts” as well as “We need a list of inactive computer accounts”. Now that we have those listed, we can let Ansible clean those up. I preferred to create a new playbook for these tasks. First it will list the Users and Computers it will be handling first, next it will disable the account, followed by moving it to either the Inactive_Users or Inactive_Computers OU. I never delete the accounts as we prefer to disable, then move them.

Below is my ansible playbook “fix_AD_Inactive-Users-AND-Computers-90days.yml”

---
- hosts: pdc
  gather_facts: no
  tasks:
     - name: copy file to windows
       win_copy:
          src: files/fix_inactive_usr.ps1
          dest: c:\it\fix_inactive_usr.ps1

     - name: copy file to windows
       win_copy:
          src: files/fix_inactive_pc.ps1
          dest: c:\it\fix_inactive_pc.ps1

     - name: Fix inactive users - 90 days
       win_shell: c:\it\fix_inactive_usr.ps1
       register: inactive_usr

     - debug: var=inactive_usr.stdout_lines

     - name: Fix inactive computers - 90 days
       win_shell: c:\it\fix_inactive_pc.ps1
       register: inactive_computer

     - debug: var=inactive_computer.stdout_lines

Below is the code for “fix_inactive_usr.ps1”

$date = (get-date).AddDays(-90)

$USR = (Get-ADUser -Filter {LastLogonDate -lt $date} -Property Enabled | Where-Object {$_.Enabled -like "true"} | Select DistinguishedName).DistinguishedName
echo $USR
ForEach ($Item in $USR){
   Disable-ADAccount $Item
   Move-ADObject -Identity $Item -TargetPath "OU=Disabled_Accounts,DC=contoso,DC=com"
   }

Please note in the PowerShell scripts above and below, you will need to change “DC=contoso,DC=com” to reflect your actual domain

Below is the code for “fix_inactive_pc.ps1”

# Specify inactivity range value below
$DaysInactive = 90
# $time variable converts $DaysInactive to LastLogonTimeStamp property format for the -Filter switch to work

$time = (Get-Date).Adddays(-($DaysInactive))

# Identify inactive computer accounts

$PC = (Get-ADComputer -Filter {LastLogonTimeStamp -lt $time} -Property Enabled | Where-Object {$_.Enabled -like "true"} | Select DistinguishedName).DistinguishedName
echo $PC
ForEach ($Item in $PC){
   Disable-ADAccount $Item
   Move-ADObject -Identity $Item -TargetPath "OU=Disabled_Computers,DC=contoso,DC=com"
   }

Ansible Automation: Gather list of all services on windows servers and clients

I had another audit request to gather all services on windows servers in an environment of about 70+ servers. I knew doing this through Ansible would be allot faster than going to each server individually. In the end it took less than 5 minutes to gather the services on 70+ servers.

When running the script I usually tee the output to text:

IE: ansible-playbook Audit_win_list_all_services.yml | tee /tmp/audit/Windows_services.txt

Here is my playbook:

Audit_win_list_all_services.yml

Ansible Automation: Gather list of all software installed on windows servers and clients

I had a request to gather all software installed on windows servers in an environment of about 70+ servers. I knew doing this through Ansible would be allot faster than going to each server individually. In the end it took less than 5 minutes to gather the installed software on 70+ servers.

I had seen a few playbooks online from other Ansible Admins doing this via Win32_Product, but I have seen warnings about using Win_32Product causing problems.

So after reading this article, I created the following playbook (I initially used a normal debug statement, but the output had allot of unnecessary info, so I split the output by newline and printed that list):

Below is my playbook:

win_list_all_programs.yml

Automating with Ansible: Adding new windows server clients to Prometheus/Grafana

I needed a way to install the Windows_Exporter on our Window systems as well as automating the configuration of the client in Prometheus. I came up with this Ansible playbook to handle this task. I’m sure there may be other ways of doing this and I’m always open to any suggestions. Here is what I have:

Playbooks (Can be downloaded):

win_install_prometheus.yml which calls install_prometheus_part2.yml

I imported a dashboard from Grafana.com, but at the time it only exported the older wmi_exporter. I was able to edit the dashboard and update it to work with the new exporter. Here is my dashboard (in JSON format for importing):

Veeam Backup Failing (VSS_WS_FAILED_AT_PREPARE_SNAPSHOT) (Resolved)

Veeam Backup Failing (‘VSS_WS_FAILED_AT_PREPARE_SNAPSHOT’)

I had a Veeam backup job that was failing with: Retrying snapshot creation attempt (Writer ‘Microsoft Hyper-V VSS Writer’ is failed at ‘VSS_WS_FAILED_AT_PREPARE_SNAPSHOT’. The writer experienced a non-transient error. If the backup process is retried, the error is likely to reoccur. –tr:Failed to verify writers state. –tr:Failed to perform pre-backup tasks.)

Researching this error online was telling me the issue was on the host, but I wasn’t believing that as all of my other vm’s were backing up without issue daily.

To play it safe I checked the host by running: vssadmin list writers

I received the following error on the host:

microsoft hyper-v vss writer non-retryable error

Looking further on the host’s event logs for the error I saw this:

At this point I was still convinced the host wasn’t at fault due to the fact all other vm’s still backed up fine, so I logged onto the vm in question and ran: vssadmin list writers

I received the following on the vm:

sqlserverwriter non-retryable error

Looking into the event viewer I saw:

Researching these errors online I found several solution saying to delete the old backup software. This server used to use another backup solution prior to Veeam called Altaro, which I was pretty sure I had removed a long time ago. I checked add/remove programs and verified Altaro wasn’t listed. I even checked the vss writers for any other backup software listed and found nothing. Running out of ideas, I checked Windows backup to make sure it wasn’t running and no backup jobs were listed. I then looked into Task Scheduler and found a few manual backup jobs listed. I disabled and deleted these jobs. I then restarted the SQL VSS writer service, restarted SQL VSS service, verified it showed no errors after re-running vssasdmin list writers. I then retried the Veeam backup again and it failed out once again.

Re-running vss list writer I received the same error. I was now convinced this was tied to the old task scheduled backups I had removed.

Next, I tried: vssadmin delete shadows /all

After running that command, I received:

Error: Snapshots were found, but they were outside of your allowed context.  Try
 removing them with the backup application which created them.

After much more research, I found an outside the box way of deleting the snapshots from another site.

How to Fix “outside of your allowed context” Errors

In order to get rid of these kinds of shadows we need to apply a “trick”. Basically the VSS diff area storage is where VSS keeps these shadows “alive”.

By seriously cutting this limit to the bare minimum we invoke a mechanism in VSS itself that causes it to dump all shadows.

So we proceed by telling VSS to cut the limit down to 401 MB. For some reason the user interface will claim the bottom is 300MB but on several versions of Windows it refuses and reports:

Error: Specified number is invalid

The command that works uses 401MB and is (adapt it to your drive letter as needed):

vssadmin resize shadowstorage /for=D: /on=D: /maxsize=401MB  

*****I ran this against the C: and D: drive of my VM*****

Then once you get “success” you can increase the limit once again to the recommended “unbounded” setting, or an actual limit value if you are using shadow copies for other purposes:

vssadmin resize shadowstorage /for=d: /on=D: /maxsize=unbounded

*****I ran this against the C: and D: drive of my VM*****

Then, vssadmin happily reports:

Successfully resized the shadow copy storage association

and a quick check using

vssadmin list shadows

reveals all VSS shadow copies are now gone!

I then re-ran Veeam the Veeam backup job against the VM and it ran successfully!

STOP 0x0000007B Resolved on P2V’d Windows SBS 2011

***The following was on a Hyper-V vm, but this also applies to VMware.***

****This should work on most versions of Windows (doesn’t have to be SBS)****

The other week we picked up a new client with an emergency issue. They had an SBS 2011 Server on failing hardware. The hardware was so bad that we didn’t think it would last until the replacement server would arrive. We had an older Server that had enough power to handle their server virtualized until their new hardware arrived. So I started the virtualization process. This is where the fun began. (There were several issues minor issues, but I’ll stick to the major problem here.)

After creating the vm without any disk drives, I attached the newly created drives and powered up the vm and was greeted by the BSOD: STOP 0x0000007B.

Luckily there is an easy fix for this and  you don’t need restart the p2v.

  • Boot the vm off any Windows CD/DVD (Windows 7 & up. Doesn’t have to be the same OS as vm. You could also mount the drive on the host or another vm. If you mount the drive, just run regedit)
  • After booting off OS cd, when you encounter the language selection, hit Shift-F10 for a command prompt
  • At the command prompt, run regedit
  •  In regedit, highlight Hkey_Local_Machine
  • With Hkey_Local_Machine highlighted, goto File, and Load Hive
  • In Load Hive, select the drive letter where Windows OS was installed (C: in this case), then go to: Windows\System32\config\system
  • Name the Hive whatever you want (IE: recovery)
  • Expand HKEY_LOCAL_MACHINE\recovery\ControlSet1\Services\intelide
  • Change the data for value “Start” from “3” to “0”
  • Now goto File and “Unload Hive” (If you run into issues make sure Hkey_Local_Machine is highlighted)
  • Exit regedit and reboot the machine and you’re good to go

If you still have issues after reboot, check the following keys and set them to:

Aliide = 3
Amdide =3
Atapi = 0
Cmdide = 3
iaStorV = 3
intelide = 0
msahci = 3
pciide = 3
viaide = 3

Resolving issues after migrating Windows 7 to new hardware (BSOD Stop 7B 0x0000007B)

Awhile ago, I had a client that had purchased several of the same laptops for training purposes.  Since all of the laptops were the same make and model, I setup 1 of the 10 as a master image that I had locked down so the trainees had limited access to the pc. Any changes made are automatically wiped after logout/reboot. For faster deployment of the laptops, I had created an image of the first laptop via Clonezilla (I am a big fan of Open Source).

A few years had gone by and there was an issue with one of the laptops. We checked the warranty status and found it was out of warranty. Rather than pay for repairs, it was cheaper to find a replacement on Amazon. Unfortunately, the one on Amazon had a different processor (not that big of a deal).

The new laptop arrived and I pushed out the image to the replacement laptop and when it booted we were greeted with the BSOD Stop 7B 0x0000007B. Rather than reload and reconfigure Windows from scratch I used a tool I had used in the past to help with this exact issue: fix_7hdc.vbs. To resolve this:

  1. Download fix_7hdc.vbs and copy the .vbs to a USB drive
  2. restart the pc.
  3. When the pc is restarting keep tapping the F8 key.
  4. When the Advanced Startup Options Menu appears, Select “Repair My Computer”
  5. In that window, select “Command Prompt”
  1. Insert your USB drive
  2. To find the drive letter of your USB drive via DOS prompt type: wmic logicaldisk get name,description
  3. Once you have the drive letter, goto that drive: e:
  4. Run the script via: cscript fix_7hdc.vbs /enable /search
  5. When the script is done, you are safe to reboot.
  6. Windows made it quite a bit further after reboot, but it still had issues so I rebooted into safe mode and logged in as the administrator and let Windows Find and install the drivers it was able to on its own. When completed I rebooted to Windows and downloaded the rest of the needed drivers and installed the latest Windows updates.

Unable to activate BitLocker after imaging Surface Pro or Surface Book

I ran into the following error after pushing an image to a Microsoft Surface Book and configuring the imaged device for a new user. I tried to Turn on BitLocker and immediately saw:

This device cannot use a Trusted Platform Module.  Your administrator must set the “Allow Bitlocker without a compatible TPM” option in the  “Required additional authentication at startup” policy for OS volumes

During the imaging process I had turned off TPM via BIOS, so I rebooted into BIOS ad made sure TPM was enabled. Next I saved and exited BIOS and restarted. WIth TPM enabled in BIOS I did the following:

  1. Entered Device manager: (Type device  Manager in Start Menu)
  2.  In Device Manager, look for “Security Devices” (If you don’t see “Security Devices”, click on “View” and “Show hidden devices”.
  3. Under Security Devices you should See “Trusted Platform Module 2.0” or similar
  4. Right Click on that and select Properties
  5. Mine showed the device was not detected
  6. I then clicked on cancel (In the TPM Properties screen)
  7. I then Right Clicked on TPM module and selected “Uninstall device”
  8. This required a reboot which I did.
  9. After reboot I checked the device manager and TPM was shown as working properly. I was then able to turn on and configure BitLocker