random jobs in matrix labeled as exceed exec time, yet show uploaded test results

Dale Phurrough's Avatar

Dale Phurrough

20 Jul, 2019 01:38 PM

Hello. I am seeing that a random set of jobs are being labeled as failed with "Build execution time has reached the maximum allowed time for your plan (60 minutes)." However, I have other evidence that the jobs actually ran to completion.

Using nightly scheduled builds, the jobs are compiles and a ctest. Normally, jobs are average 16 minutes. https://ci.appveyor.com/project/gnucashbuilder/gnucash-maint

Two night's ago, I job "failed". I see test results on the test result tab of that job which indicates the entire compile and test did successfully run. Yet this same job in the console tab does not have all the logs. And it was labeled "failed exceeded time".

Last night's build, I see two jobs labeled as failed with "exceeded time". Both have some portion of console logs. One of the two has test results which again indicates that job actually ran to successful completion.

  1. 1 Posted by Owen McDonnell on 22 Jul, 2019 06:17 PM

    Owen McDonnell's Avatar

    It looks like the job is not actually 'running to completion' since artifacts are not uploaded for those failed jobs.
    I admit the truncated console output is strange though.

    Have you tried to ssh to the build worker to see if you can find anything that might be holding up the build?

  2. 2 Posted by Dale Phurrough on 22 Jul, 2019 08:41 PM

    Dale Phurrough's Avatar

    Oy vey. Job, build, compile...too many words that are unclear ;-)

    Reworded...
    I am seeing that a random set of Appveyor jobs-created-by-a-matrix within an Appveyor project are being labeled by Appveyor as those jobs-created-by-a-matrix failing. By failing I mean there are labeled RED in the "Jobs" tab of an Appveyor project. And that the Appveyor project is labeled RED when any of its jobs-created-by-a-matrix have been labeled RED.

    I have a nightly schedule at the link I provided above. You can see all the data I can see. There is no other data to review. I do not have the ability to SSH into something when this occurs...because there is no way to predict what will occur and when it will occur. It is only sometime after "the occur" that it is labeled RED and then I can see results.

    As you see in the three examples, the failing occurs in various ways, at various times, and strongly suggests that this is not related to the scripts I run. Why you ask? Because each jobs-created-by-a-matrix has sequential steps it runs. In summary, it is
    1. Download a docker image
    2. Run that docker image
    3. Docker image compiles app within the container
    4. Docker image installs compiled app into a directory within the container
    5. Docker image compiles test app within the container
    6. Docker image runs test app within the container

    It is not possible to do step 5 and 6 unless step 3 successfully completes. Impossible. Why? Because steps 5 and 6 require libraries and compiled binaries from step 3.

    Please click on https://ci.appveyor.com/project/gnucashbuilder/gnucash-maint/builds...
    This is one of the three examples. In this job-created-by-a-matrix, scroll down to the bottom of the console. Its last line is "[626/916] Build...".
    If that is truly the last line of the console, that indicates that the compile did not complete. It only did 626 of 916 steps. Also missing is the install (step 4 above). Also missing is the test compile (step 5) and the test run (step 6).
    Now at that same link, click on the Tests tab. You can see there 124 successful tests. That is impossible. Why? Because the console indicated that the compile didn't complete and steps 4-6 didn't run.
    The data within the Test tab is well formatted. This suggests that something somewhere somehow uploaded very well formatted test data.
    Yet the console has missing activity and Appveyor marks it red FAILED.
    This suggests that
    a) the steps 3-6 did actually occur. Step 6 successfully uploaded the 124 test results.
    b) the appveyor infrastructure failed to capture part of the log of step 3 and all log from steps 4-6.
    c) the appveyor infrastructure did not realize the successful end of steps 1-6. Appveyor hit the 60 minute time limit, and therefore marked it RED failed.

    These job-created-by-a-matrix are all the same compile,test steps; except they are across 14 OS versions. I am not able to image a scenario within my codebase that would cause the Appveyor console to stop collecting the log yet continue to run the job-created-by-a-matrix and upload the test results.

    Last, you mention the artifacts. My code doesn't upload artifacts. That is appveyor's code. When this random fail of Appveyor hits the 60 minute timelimit, I would expect the appveyor code to abort and not reach its own code that would do the artifact upload.

    I don't think I can help here. This seems an Appveyor infrastructure issue.

  3. 3 Posted by Owen McDonnell on 23 Jul, 2019 11:36 PM

    Owen McDonnell's Avatar

    I think we're both on the same page about what "job" means. I admitted that the lack of console output while tests were obviously running was odd, but something is clearly holding up progress as evidenced by the fact that artifact upload stage is never reached In any case, we'll investigate.

    Its hard to troubleshoot though, as you have variables set in the UI.
    Can you create a simplified repo that doesn't rely on any UI settings or passwords etc. that reproduces this problem and point me to it?

  4. 4 Posted by Dale Phurrough on 27 Jul, 2019 09:09 PM

    Dale Phurrough's Avatar

    I do not have the bandwidth nor access to investigate a service-side issue. I don't have the tools nor access to your services to do it. I would be trying to see if Schrödinger's cat inside the black box is cleaning itself. I don't even know if the cat exists ;-)

    Internal monitoring and diagnostics can do it; monitoring and a stop on the examples I provided. If at any time they timeout, then you have encountered the service bug. Capture the backend data you need and then you can begin to isolate.

    And you can also setup a trigger in your infrastructure. Like an assert() in C. It is impossible for this project to both upload test results and build-stage-timeout. If you ever encounter that, then it is a failure needing investigation.
    assert( gotTestResults && !buildStageTimeout )

    None of that can I do because I have no access to the infrastructure level.
    FYI. I have not seen the failures in the last few days. You can always see the latest at the public project URL.

Reply to this discussion

Internal reply

Formatting help / Preview (switch to plain text) No formatting (switch to Markdown)

Attaching KB article:

»

Attached Files

You can attach files up to 10MB

If you don't have an account yet, we need to confirm you're human and not a machine trying to post spam.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac

Recent Discussions

13 Dec, 2019 05:26 PM
13 Dec, 2019 11:28 AM
12 Dec, 2019 09:36 PM
12 Dec, 2019 09:25 PM
12 Dec, 2019 06:01 PM

 

11 Dec, 2019 11:29 PM
11 Dec, 2019 08:47 PM
11 Dec, 2019 01:39 PM
10 Dec, 2019 12:29 AM
09 Dec, 2019 05:35 AM
07 Dec, 2019 04:20 PM