google cloud platform – CloudComposer GKEStartPodOperator status is failed for long jobs

I have implemented an airflow DAG made up of the GKEStartPodOperator wrapper. This is the implementation of the operator :

pod1 = GKEStartPodOperator(
        task_id="pod-1",
        name="pod-1",
        project_id="Project1",
        location="Zone1",
        cluster_name="Cluster1",
        namespace="default",
        retries=2,
        image_pull_policy="Always",
        image="Image:v1")  

The problem noticed is that when running longer-running jobs, the cloud composer takes the job as failed status first and only when on-job complete, changes it from failed to complete with the job run time. However, for a couple of jobs, the actual run time was not captured but the job did change from failed to complete.

I have tried the following but to no avail :

  1. I have tried removing the retries as it might have affected the contact with the PID. However, that just resulted in the pod having to retry multiple times when the job status is detected to be failed.
  2. I have tried running the pods in different clusters, setting them up to connect to clusters across the same or different VPN and namespaces to see if the behavior of the pods when spread across the clusters would be different.

I am curious if anyone has insights as to such a problem :

  1. How is such a situation possible and what could have failed?
  2. Is there any recommended changes to the POD operator definition that can help?
  3. I have considered the KubernetesPodOperator, documentation found here. However, there is a requirement to use the GKEStartPodOperator. However, if a solution does arise from using KubernetesPodOperator it can be considered.

Thank you all in advance!

Read more here: Source link