VM jobs run indefinitely #152

magsol · 2018-06-06T19:05:39Z

We're testing out a new autograder image on AutoLab, but after making a submission and the autograder spinning up, the launched job simply runs forever.

Expected Behavior

I would expect the job to eventually halt (particularly around the "Timeout" interval set in the autograder, which is 360 seconds in our case), and a Runtime Trace to be available showing the entire command and its output.

Actual Behavior

Each time I refresh the status page, all the "time elapsed" columns increment, indicating that the jobs are still running. However, when I actually SSH into the running Tango container, I don't see anything after running the docker ps -a command?

Are these jobs actually running? Or is it some bug in the database? Either way, how can I stop them?

Steps to Reproduce the Behavior

???

Honestly not sure. The autograder settings have a 360 second timeout set, so why these jobs are still running is beyond me. There's no discernible output for any of the jobs; under the "Runtime Trace" for each job, it has entries for adding the job to the queue and eventually the "Job [x] started", but that's the very last one. Nothing after that; no debugging output.

The text was updated successfully, but these errors were encountered:

skoch9 · 2018-12-11T10:21:28Z

We encounter the same problem, the docker grading container is not running anymore, but the job does not get deleted from the tango jobs queue and is therefore blocking the queue.

nitsanshai · 2019-02-04T20:19:20Z

Was this issue resolved? I'm hitting the same problem with the default image (ubuntu).

magsol · 2019-02-04T20:21:36Z

No, it wasn't. Still no answer on our end for what was going on.

nitsanshai · 2019-02-04T20:34:10Z

Were you able to get this to work with any image? What did you have to do for the default image?

devanshk · 2019-02-06T20:54:47Z

Hm just looking at the symptoms, it looks like

Autolab successfully sends a job request to Tango
Autolab is waiting for Tango to trigger a callback with the results of the job
Tango must have received the job
Tango fails to add the job or completes it and fails to communicate the result back to Autolab.

Can you give me some more context on your setup? Have you had autograders run successfully in the past? Did you change anything recently? Are Autolab and Tango deployed on the same machine?

magsol · 2019-02-07T17:47:46Z

@nitsanshai AutoLab has actually worked fine for me for over two years; the problem only started when we significantly overhauled the Docker image used by the autograder. I unfortunately don't have immediate access to that image anymore (I could probably drum it up, though) but that was the only thing we changed prior to this behavior.

@devanshk Yes, we've had autograders run successfully in the past. It was only when we added an entirely new Docker image for the autograder (one with significant JVM/Scala dependencies) that we observed these problems. That was the only change. Otherwise it's a vanilla one-click install on a single machine, and has otherwise been working fine with no issues.

devanshk · 2019-02-07T18:29:00Z

Could you join our Slack channel? It would be easier to debug this 1-on-1 and post our findings back here.

https://autolab-slack.herokuapp.com

magsol · 2019-02-07T18:35:59Z

Sure, but I can't really debug this in the short-term. I filed this ticket in June of last year when these issues cropped up, but we had to move past it months ago. I'm actively using AutoLab for my course right now and that prevents me from doing any live debugging on this issue until the semester is over.

devanshk · 2019-02-07T20:28:33Z

If you're able to create a test course, we could try things out there - or if you send me your dockerfile, I can replicate your bug on my end and experiment with it.

pratikbin · 2021-07-23T11:37:58Z

running docker-compose form fresh, created course with hello lab as per docs and when i submit my assignment for auto grade it shown me

so i docker pull autolabproject/autograding_image
and it then it worked and went in queue, but now it's in queue forever and not executing auto grading as per

We're testing out a new autograder image on AutoLab, but after making a submission and the autograder spinning up, the launched job simply runs forever.

pratikbin · 2021-07-23T12:49:27Z

i restarted whole docker-compoe stack, and now it seems working but got status 125... further checking

Runtime Trace

    2021-07-23 18:14:06 +0530 | Added job devops-00-02_hello_3_pratik@xxx:1 to queue
    2021-07-23 18:14:08 +0530 | Dispatched job devops-00-02_hello_3_pratik@xxx:1 [try 0]
    2021-07-23 18:14:08 +0530 | Assigned job devops-00-02_hello_3_pratik@xxxx:1 existing VM prod-1001-autograding_image
    2021-07-23 18:14:08 +0530 | Job devops-00-02_hello_3_pratik@xxxx:1 waiting for VM prod-1001-autograding_image
    2021-07-23 18:14:08 +0530 | VM prod-1001-autograding_image ready for job devops-00-02_hello_3_pratik@xxxx:1
    2021-07-23 18:14:08 +0530 | Input copied for job devops-00-02_hello_3_pratik@xxxx:1 [status=0]
    2021-07-23 18:14:12 +0530 | Job devops-00-02_hello_3_pratik@xxxx:1 executed [status=125]

umar221b · 2023-06-01T06:14:42Z

@pratikbin did you manage to figure out what was going on? I seem to be having the same problem locally.

damianhxy · 2023-06-02T07:54:13Z

Unfortunately, this seems to be a sporadic issue that we've yet to fully resolve. Some related PRs include #227 and #228.

If you could share the lab / image that you're using, I could take a look

victorhuangwq added the important enhancement label May 16, 2020

victorhuangwq added Priority: High Type: Bug and removed important enhancement labels Jul 8, 2020

victorhuangwq linked a pull request Jul 8, 2020 that will close this issue

Temporary PR to integrate PDL's mass of bug fixes and enhancements #148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM jobs run indefinitely #152

VM jobs run indefinitely #152

magsol commented Jun 6, 2018

skoch9 commented Dec 11, 2018

nitsanshai commented Feb 4, 2019

magsol commented Feb 4, 2019

nitsanshai commented Feb 4, 2019

devanshk commented Feb 6, 2019

magsol commented Feb 7, 2019

devanshk commented Feb 7, 2019

magsol commented Feb 7, 2019

devanshk commented Feb 7, 2019

pratikbin commented Jul 23, 2021 •

edited

Loading

pratikbin commented Jul 23, 2021

umar221b commented Jun 1, 2023

damianhxy commented Jun 2, 2023

VM jobs run indefinitely #152

VM jobs run indefinitely #152

Comments

magsol commented Jun 6, 2018

Expected Behavior

Actual Behavior

Steps to Reproduce the Behavior

skoch9 commented Dec 11, 2018

nitsanshai commented Feb 4, 2019

magsol commented Feb 4, 2019

nitsanshai commented Feb 4, 2019

devanshk commented Feb 6, 2019

magsol commented Feb 7, 2019

devanshk commented Feb 7, 2019

magsol commented Feb 7, 2019

devanshk commented Feb 7, 2019

pratikbin commented Jul 23, 2021 • edited Loading

pratikbin commented Jul 23, 2021

Runtime Trace

umar221b commented Jun 1, 2023

damianhxy commented Jun 2, 2023

pratikbin commented Jul 23, 2021 •

edited

Loading