Odd behaviour with a long running part of the build

v.f.baines's Avatar

v.f.baines

03 Oct, 2018 04:18 PM

Hi,

I've been trialing private build cloud as we have a couple of jobs that build tensorflow, which are really long running. I think I've got the local Hyper V setup working OK, maybe a small issue here and there with paths but not bad on the whole.

But, the long running job is still a little strange. If you look at most recent run here:
https://ci.appveyor.com/project/vb216/javacpp-presets/build/536/job/swer6y1qqu19j440

There's a known issue at that point that conv_ops.cc and conv_ops_3d.cc can take a couple of hours to build. So, no output to the terminal for that time while it's busy.

But right now, if I RDP to that VM, there is no CPU activity. Also a bit strange, there is no C:\Projects on that VM. I'm only running a single worker so its definitely the right instance.

Even stranger, if you look at an earlier run:
https://ci.appveyor.com/project/vb216/javacpp-presets/build/533
Tensorflow was timed out at 10hour (the limit I set). But there's no job output available at all?

If I reproduce running that job inside the VM via commandline myself, it compiles OK in around 3hours or so.

Is there something around there not being any output to the console, that could cause some unexpected behaviour? I think we've seen something similar with Travis and had to output a . character to the terminal every 5 minutes as a workaround

Thanks for the support

  1. Support Staff 1 Posted by Ilya Finkelshte... on 04 Oct, 2018 12:31 AM

    Ilya Finkelshteyn's Avatar

    Does this happens intermittently or only started happening after you changed jobs order?

  2. Support Staff 2 Posted by Ilya Finkelshte... on 04 Oct, 2018 01:56 AM

    Ilya Finkelshteyn's Avatar

    Also what you see in AppVeyor event log on both Hyper-V server and build VM when this log-less build starts and runs?

    You can find AppVeyor even log is located under Applications and Services Logs in Event viewer.

  3. 3 Posted by v.f.baines on 04 Oct, 2018 07:52 PM

    v.f.baines's Avatar

    Hmm.. Only thing that caught my eye in event logs is a DNS problem in the build VM instance, at the same time I have run out of free IP's - I'm using Sophos UTM and so limited to 50 active IP addresses.

    Maybe something there disrupted the intended communication flow. Unfortunately I need to wait for a while to come back under the 50 limit.

    Hard to say regarding the job order change - I moved them around since the really long running job was towards the end of the list, and I wanted to test that first. On the first run, I'd left a build time limit still in there so didn't get to see if it would complete in time - but, there was certainly output from the job.

    I'll give it another whirl as soon as I can. Thanks for the suggestion of checking events log.

  4. Support Staff 4 Posted by Ilya Finkelshte... on 04 Oct, 2018 10:08 PM

    Ilya Finkelshteyn's Avatar

    I am not familiar with Sophos UTM, but from what you describe this can be very well the cause of the problem. As you see first build runs OK, and subsequent are very silent. If some networking services (like name resolution) are not provided because of IP limit is exhausted, newly created build VM (with probably new IP) will keep retrying, which you will likely see in AppVeyor log on build VM.

  5. 5 Posted by v.f.baines on 05 Oct, 2018 10:37 AM

    v.f.baines's Avatar

    Ah, sadly either that was a red herring, or there's another problem lurking as well.

    If you take a look at current build https://ci.appveyor.com/project/vb216/javacpp-presets/build/job/aau3tlogj4yq04sg now, I set this off before work this morning, and could see it was outputting as normal. I checked mid morning, and was still going OK - and it was sat on the long compile step that I mentioned in first post.

    I just looked now, and the log output is now no longer visible. Unfortunately I can't login to the VM as I'm not at home now, but this looks a repeat of the same case as before. I actually wonder, on the original run before I changed job order, it probably timed out before it got to this state.

    It's like either something goes wrong because there has been no output from the terminal for more than about 1.5 hours, or the log output has got too large

    Will check event log when I get home + update. I've been running the host with debug level output set in the config file, and there was nothing output there apart from the usual kind of keep-alive checks that seem to go on.

  6. Support Staff 6 Posted by Ilya Finkelshte... on 05 Oct, 2018 06:52 PM

    Ilya Finkelshteyn's Avatar

    Is it expected in your build scenario that it will be no output from terminal for 1.5 hours, e.g. something will build silently without any output?

    Also I am not sure I understood what "host with debug level output set in the config file" you mention? You mean build process should generate debug level output?

    In any case I believe that AppVeyor logs from build VM should help us. Note that VM is being re-created for every build so you need to connect to VM during that silent stuck build.

  7. 7 Posted by v.f.baines on 06 Oct, 2018 01:24 PM

    v.f.baines's Avatar

    So in the current build setup, yeah, the tensorflow project we build against has a couple of steps where just compiling one part does indeed run for a couple of hours with no output. What I've done is just added a while loop in the background to output a . every minute, and the output is here https://ci.appveyor.com/project/vb216/javacpp-presets/build/538/job/3955p6yvs73voxtx

    You can see there, it ran for about 71 minutes after that last compile line. But then things get a little stranger! It just caught my eye, that the original VM was shutdown, even though appveyor is saying the job is still running. And, if you substract the time that the new VM has been running from the total runtime in the screenshots, it looks like the first one ran for exactly 2 hours, then for some reason a new one fired up. That might explain why I can't see any c:\projects dir in the newly spun up VM.

    What I meant about the debug level, was I modified the Appveyor.HostAgent.exe.config file Log level to 4, to see if there was anything more useful output.

    So it's looking to me like something is terminating the VM at the two hour point, which I guess then breaks the rest of the build. Will check more on my side, could be something in the evaluation version of server2012 I guess.

    There might be a better approach we could take too, I think we could compile this long running dependency in a docker image outside of the CI process, and let Appveyor use that, could hugely reduce our build time.

  8. 8 Posted by v.f.baines on 07 Oct, 2018 06:48 PM

    v.f.baines's Avatar

    Ah, cracked it I think! So it wasn't quite 2 hours that the shutdown happened at, it was 1h50 mins - 110 mins which is/was the overall limit on my account. I asked about it a week or so ago, as the first time I tried to build on private cloud, the long running job aborted at 1h50 - the old limit (https://ci.appveyor.com/project/vb216/javacpp-presets/builds/19151759/job/9sif1b4vbd2v1489). So, I updated the project specific value in build limit time for private cloud, and set it to 600 minutes.

    I modified the delete environment power script to just echo it was called, so I could capture what was going on.
    And, after 110 minutes this appears in the log:
     [worker-213-001] Job ioqwmnkmuthim5rc has not finished in 110 minutes. Reverting worker...
    Because of the script modification, it couldn't actually delete the VM, and the build continues OK.

    So, looks like the account build time limit is still taking precedence over the project build limit in some way?

  9. Support Staff 9 Posted by Ilya Finkelshte... on 08 Oct, 2018 06:08 PM

    Ilya Finkelshteyn's Avatar

    Ah, you are correct. Sorry, it appears to be a recent regression with Hyper-V private build cloud does not respect this timeout setting. Just got another report from different customer. Sorry for having you spending a lot of time investigating that.

    Let us continue over email, I will reply to Samuel.

  10. Ilya Finkelshteyn closed this discussion on 08 Dec, 2018 09:00 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac