[BUG] new node join to cluster, upgrading with 100+ apply-hvst-upgrade pods.
Describe the bug
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
hv-n0 Ready control-plane,etcd,master 45h v1.24.7+rke2r1 10.16.166.21 <none> Harvester v1.1.1 5.3.18-150300.59.101-default containerd://1.6.8-k3s1
hv-n1 Ready <none> 42h v1.24.7+rke2r1 10.16.166.22 <none> Harvester v1.1.1 5.3.18-150300.59.101-default containerd://1.6.8-k3s1
$ kubectl get jobs -A
NAMESPACE NAME COMPLETIONS DURATION AGE
cattle-monitoring-system rancher-monitoring-admission-create 1/1 2m37s 43h
cattle-system apply-hvst-upgrade-mzr2w-prepare-on-hv-n1-with-a43953579a-d70ed 0/1 2m21s 2m21s
harvester-system default-vlan1 1/1 2s 44h
harvester-system hvst-upgrade-mzr2w-apply-manifests 1/1 102s 43h
harvester-system hvst-upgrade-mzr2w-single-node-upgrade-hv-n0 1/1 10m 43h
harvester-system hvst-upgrade-xkpcm-apply-manifests 0/1 43h 43h
harvester-system hvst-upgrade-xkpcm-single-node-upgrade-hv-n0 1/1 30m 43h
kube-system helm-install-rke2-canal 1/1 7s 42h
kube-system helm-install-rke2-coredns 1/1 13s 43h
kube-system helm-install-rke2-ingress-nginx 1/1 17s 98m
kube-system helm-install-rke2-metrics-server 1/1 20s 43h
kube-system helm-install-rke2-multus 1/1 13s 43h
longhorn-system longhorn-post-upgrade 1/1 3m7s 43h
$ kubectl logs jobs/apply-hvst-upgrade-mzr2w-prepare-on-hv-n1-with-a43953579a-d70ed -n cattle-system -f
Found 14 pods, using pod/apply-hvst-upgrade-mzr2w-prepare-on-hv-n1-with-a43953579a-5rb2x
+++ dirname /usr/local/bin/upgrade_node.sh
++ cd /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/lib.sh
++ UPGRADE_NAMESPACE=harvester-system
++ UPGRADE_REPO_URL=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso
++ UPGRADE_REPO_VM_NAME=upgrade-repo-hvst-upgrade-mzr2w
++ UPGRADE_REPO_RELEASE_FILE=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso/harvester-release.yaml
++ UPGRADE_REPO_SQUASHFS_IMAGE=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso/rootfs.squashfs
++ UPGRADE_REPO_BUNDLE_ROOT=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso/bundle
++ UPGRADE_REPO_BUNDLE_METADATA=http://upgrade-repo-hvst-upgrade-mzr2w.harvester-system/harvester-iso/bundle/metadata.yaml
++ CACHED_BUNDLE_METADATA=
++ HOST_DIR=/host
+ UPGRADE_TMP_DIR=/host/usr/local/upgrade_tmp
+ mkdir -p /host/usr/local/upgrade_tmp
+ case $1 in
+ command_prepare
+ wait_repo
+ local repo_vm_status
++ kubectl get virtualmachines.kubevirt.io upgrade-repo-hvst-upgrade-mzr2w -n harvester-system '-o=jsonpath={.status.printableStatus}'
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-mzr2w" not found
+ repo_vm_status=
$ kubectl logs jobs/hvst-upgrade-xkpcm-apply-manifests -n harvester-system
...
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
Try to bring up the upgrade repo VM...
Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xkpcm" not found
To Reproduce
- create a new cluster of v1.0.2, single node-0.
- setup backup nfs. load all old vm-backup
- restore 1 vm-backup to test env. restore success
- upgrade cluster to v1.0.3, success
- test restore, success
- upgrade cluster to v1.1.1, success
- install a new node-1 of v1.1.1, join cluster, success
- restore vm to node-1, success
after a day running, i found node-1 full of err pods. I tried:
- delete all error pod. All pods recreated in 1min.
- delete jobs/apply-hvst-upgrade-mzr2w-prepare-on-hv-n1-with-a43953579a-d70ed. It’s recreated.
Maybe delete jobs/hvst-upgrade-xkpcm-apply-manifests , tomorrow.
Expected behavior
node-1 should not be upgraded, it’s already v1.1.1
Environment
- Harvester ISO version: v1.1.1
- Underlying Infrastructure: dell r750
Read more here: Source link