[BUG] new node join to cluster, upgrading with 100+ apply-hvst-upgrade pods.

Describe the bug
hv111-136pods

$ kubectl get nodes -o wide
NAME    STATUS   ROLES                       AGE   VERSION          INTERNAL-IP    EXTERNAL-IP   OS-IMAGE           KERNEL-VERSION                 CONTAINER-RUNTIME
hv-n0   Ready    control-plane,etcd,master   45h   v1.24.7+rke2r1   10.16.166.21   <none>        Harvester v1.1.1   5.3.18-150300.59.101-default   containerd://1.6.8-k3s1
hv-n1   Ready    <none>                      42h   v1.24.7+rke2r1   10.16.166.22   <none>        Harvester v1.1.1   5.3.18-150300.59.101-default   containerd://1.6.8-k3s1

$ kubectl get jobs  -A
NAMESPACE                  NAME                                                              COMPLETIONS   DURATION   AGE
cattle-monitoring-system   rancher-monitoring-admission-create                               1/1           2m37s      43h
cattle-system              apply-hvst-upgrade-mzr2w-prepare-on-hv-n1-with-a43953579a-d70ed   0/1           2m21s      2m21s
harvester-system           default-vlan1                                                     1/1           2s         44h
harvester-system           hvst-upgrade-mzr2w-apply-manifests                                1/1           102s       43h
harvester-system           hvst-upgrade-mzr2w-single-node-upgrade-hv-n0                      1/1           10m        43h
harvester-system           hvst-upgrade-xkpcm-apply-manifests                                0/1           43h        43h
harvester-system           hvst-upgrade-xkpcm-single-node-upgrade-hv-n0                      1/1           30m        43h
kube-system                helm-install-rke2-canal                                           1/1           7s         42h
kube-system                helm-install-rke2-coredns                                         1/1           13s        43h
kube-system                helm-install-rke2-ingress-nginx                                   1/1           17s        98m
kube-system                helm-install-rke2-metrics-server                                  1/1           20s        43h
kube-system                helm-install-rke2-multus                                          1/1           13s        43h
longhorn-system            longhorn-post-upgrade                                             1/1           3m7s       43h

$  kubectl logs  jobs/apply-hvst-upgrade-mzr2w-prepare-on-hv-n1-with-a43953579a-d70ed -n cattle-system  -f
Found 14 pods, using pod/apply-hvst-upgrade-mzr2w-prepare-on-hv-n1-with-a43953579a-5rb2x
+++ dirname /usr/local/bin/upgrade_node.sh
++ cd /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/lib.sh
++ UPGRADE_NAMESPACE=harvester-system
++ UPGRADE_REPO_URL=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso
++ UPGRADE_REPO_VM_NAME=upgrade-repo-hvst-upgrade-mzr2w
++ UPGRADE_REPO_RELEASE_FILE=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso/harvester-release.yaml
++ UPGRADE_REPO_SQUASHFS_IMAGE=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso/rootfs.squashfs
++ UPGRADE_REPO_BUNDLE_ROOT=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso/bundle
++ UPGRADE_REPO_BUNDLE_METADATA=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso/bundle/metadata.yaml
++ CACHED_BUNDLE_METADATA=
++ HOST_DIR=/host
+ UPGRADE_TMP_DIR=/host/usr/local/upgrade_tmp
+ mkdir -p /host/usr/local/upgrade_tmp
+ case $1 in
+ command_prepare
+ wait_repo
+ local repo_vm_status
++ kubectl get virtualmachines.kubevirt.io upgrade-repo-hvst-upgrade-mzr2w -n harvester-system '-o=jsonpath={.status.printableStatus}'
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-mzr2w" not found
+ repo_vm_status=

$ kubectl logs jobs/hvst-upgrade-xkpcm-apply-manifests  -n harvester-system
...
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found

To Reproduce

  1. create a new cluster of v1.0.2, single node-0.
  2. setup backup nfs. load all old vm-backup
  3. restore 1 vm-backup to test env. restore success
  4. upgrade cluster to v1.0.3, success
  5. test restore, success
  6. upgrade cluster to v1.1.1, success
  7. install a new node-1 of v1.1.1, join cluster, success
  8. restore vm to node-1, success

after a day running, i found node-1 full of err pods. I tried:

  • delete all error pod. All pods recreated in 1min.
  • delete jobs/apply-hvst-upgrade-mzr2w-prepare-on-hv-n1-with-a43953579a-d70ed. It’s recreated.

Maybe delete jobs/hvst-upgrade-xkpcm-apply-manifests , tomorrow.

Expected behavior
node-1 should not be upgraded, it’s already v1.1.1

Environment

  • Harvester ISO version: v1.1.1
  • Underlying Infrastructure: dell r750

Read more here: Source link