Inexplicable timeouts

tony's Avatar

tony

10 Dec, 2014 09:56 PM

  1. 1 Posted by tony on 10 Dec, 2014 09:57 PM

    tony's Avatar

    by "that we have run into lately," I meant "that we have run into locally"

  2. Support Staff 2 Posted by Feodor Fitsner on 10 Dec, 2014 10:01 PM

    Feodor Fitsner's Avatar

    Sure, will check it out.

    -Feodor

  3. 3 Posted by tony on 11 Dec, 2014 01:56 AM

    tony's Avatar

    Here's another one that's running now and looks frozen https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.404...

  4. Support Staff 4 Posted by Feodor Fitsner on 11 Dec, 2014 02:09 AM

    Feodor Fitsner's Avatar

    Hm, right now there are only about 4 projects running on that server.

    At first, I thought there was high load on the server and you were switched to Azure, but no - all recent Julia builds were running on new environment.

    Maybe it's worker-AppVeyor communication issue (if it looks like frozen). Can I look into VM with running Julia build to see if there are any SignalR errors?

    -Feodor

  5. 5 Posted by tony on 11 Dec, 2014 02:23 AM

    tony's Avatar

    Whatever instrumentation you want to do. I can contact Stefan as owner of the account if you need input from him.

    It seems to only happen on our 64-bit builds. There is a chance that some change in the Julia codebase could have introduced freezing during build/tests, I'm running locally on a range of commits to check.

    But in case it's some communication issue I do think it's worth looking into if you can.

  6. Support Staff 6 Posted by Feodor Fitsner on 11 Dec, 2014 02:26 AM

    Feodor Fitsner's Avatar

    Sure, will take a look.

    Though you could be right there was some change into x64 as it seems 32-bit builds manage to complete in time.

    -Feodor

  7. Support Staff 7 Posted by Feodor Fitsner on 11 Dec, 2014 04:12 AM

    Feodor Fitsner's Avatar

    Next time you see it's going to stuck drop me a quick message - I'd like to see what's going on there. I can do that during the build only as after that VM is immediately restored.

  8. Support Staff 8 Posted by Feodor Fitsner on 11 Dec, 2014 04:13 AM

    Feodor Fitsner's Avatar

    Was watching this one: https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.408...

    By the end of build Julia process took 820 MB (75%) of RAM. CPU was around 40%. So it's probably neither CPU nor RAM.

  9. 9 Posted by tony on 11 Dec, 2014 04:34 AM

    tony's Avatar

    https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.410... might be stuck? The last line I see is "profile.jl" which should only take a few seconds.

  10. 10 Posted by tony on 11 Dec, 2014 04:36 AM

    tony's Avatar

    Oh and that build is running on our release branch which doesn't change anywhere near as dramatically as master

  11. Support Staff 11 Posted by Feodor Fitsner on 11 Dec, 2014 06:04 PM

    Feodor Fitsner's Avatar

    Looking at recently failing builds you may notice that ARCH=x86_64 job was running relatively small time (on the right) before getting stuck.

    https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.410... - 6 min
    https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.404... - 9 min
    https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.398... - 9 min
    https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.395... - 5 min
    https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.393... - 6 min

    What's interesting in all these cases ARCH=i686 job was successful with pretty consistent run time of 13-16 min.

    Did that start since you moved to a new environment or it was working there then started to fail?

  12. 12 Posted by tony on 11 Dec, 2014 11:12 PM

    tony's Avatar

    It's freezing either during the "system image build," the list of *.jl files that is the last build step before running the tests, or in one of the first few tests.

    It seems to me as though it had been a bit more reliable at first, but it may have been triggered by a code change. Since it's an intermittent problem it's quite difficult and time-consuming to try running git bisect on it.

  13. 13 Posted by tony on 12 Dec, 2014 04:44 AM

    tony's Avatar

    I'm not so sure what's actually happening here, some or all of these might be real Julia freezes that will have to be looked into if we can figure out what's causing them.

    In the meantime, maybe an optional mitigation feature of early timeouts when no output is received for, say, 10 minutes? That would at least make these hold up the build queue a bit less.

  14. Support Staff 14 Posted by Feodor Fitsner on 12 Dec, 2014 05:32 PM

    Feodor Fitsner's Avatar

    Log inactivity timeout is a great idea. I'll add a new issue for that. I'm not sure about 10 minutes cap though - maybe we should make that configurable. I don't know but maybe there could be some "heavy" projects silently doing something longer than 10 minutes :)

  15. 15 Posted by tony on 12 Dec, 2014 07:53 PM

    tony's Avatar

    Yeah, configurable would make sense. Some people like running their builds with logging disabled so yeah this might be a not-for-everyone option.

  16. 16 Posted by tony on 16 Dec, 2014 11:31 PM

    tony's Avatar
  17. 17 Posted by tony on 17 Dec, 2014 12:20 AM

    tony's Avatar
  18. Support Staff 18 Posted by Feodor Fitsner on 17 Dec, 2014 12:23 AM

    Feodor Fitsner's Avatar

    Let me see what's gong on there.

  19. Support Staff 19 Posted by Feodor Fitsner on 17 Dec, 2014 12:37 AM

    Feodor Fitsner's Avatar

    julia.exe - 50% CPU and 266 MB RAM. Server memory: 1.1/1.7 GB

    Build started at 12:10, stalled at 12:17

    Found these errors in "Application" event log:

    Level Date and Time Source Event ID Task Category
    Error 12/17/2014 12:05:44 AM SideBySide 33 None "Activation context generation failed for ""c:\program files (x86)\microsoft visual studio 9.0\VC\bin\ia64\pgosweep.exe"". Dependent Assembly Microsoft.VC90.CRT,processorArchitecture=""ia64"",publicKeyToken=""1fc8b3b9a1e18e3b"",type=""win32"",version=""9.0.21022.8"" could not be found. Please use sxstrace.exe for detailed diagnosis."
    Error 12/17/2014 12:05:44 AM SideBySide 33 None "Activation context generation failed for ""c:\program files (x86)\microsoft visual studio 9.0\VC\bin\ia64\pgomgr.exe"". Dependent Assembly Microsoft.VC90.CRT,processorArchitecture=""ia64"",publicKeyToken=""1fc8b3b9a1e18e3b"",type=""win32"",version=""9.0.21022.8"" could not be found. Please use sxstrace.exe for detailed diagnosis."
    Error 12/17/2014 12:05:44 AM SideBySide 33 None "Activation context generation failed for ""c:\program files (x86)\microsoft visual studio 9.0\VC\bin\ia64\pgocvt.exe"". Dependent Assembly Microsoft.VC90.CRT,processorArchitecture=""ia64"",publicKeyToken=""1fc8b3b9a1e18e3b"",type=""win32"",version=""9.0.21022.8"" could not be found. Please use sxstrace.exe for detailed diagnosis."
    
  20. 20 Posted by tony on 17 Dec, 2014 12:39 AM

    tony's Avatar

    Thanks for looking into it!

    Does that happen in a successful build too? Like the i686 builds, or a x86_64 build that didn't freeze? That's really strange since we aren't building with visual studio 9 at all, and I don't think we do anything with any files named pgo* either.

  21. Support Staff 21 Posted by Feodor Fitsner on 17 Dec, 2014 12:41 AM

    Feodor Fitsner's Avatar

    Will try to catch i686 build next time - can do that while it's running only.

  22. 22 Posted by tony on 17 Dec, 2014 12:46 AM

    tony's Avatar

    Okay, builds 644 and 645 should fail quickly, 644 is an already-merged pull request and 645 will be stopped by my code since there are other builds pending for the same PR. 646 will have an i686 build bug that should take about 5-10 minutes to get to.

  23. 23 Posted by tony on 17 Dec, 2014 01:14 AM

    tony's Avatar

    https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.648... should be a normal successful i686 build, taking maybe 15 minutes or so

  24. Support Staff 24 Posted by Feodor Fitsner on 17 Dec, 2014 01:15 AM

    Feodor Fitsner's Avatar

    Will take a look.

    -Feodor

  25. Support Staff 25 Posted by Feodor Fitsner on 17 Dec, 2014 01:25 AM

    Feodor Fitsner's Avatar

    OK, false alarm. On the worker running i686 those 3 errors in Windows event log were before the build started.

  26. Support Staff 26 Posted by Feodor Fitsner on 17 Dec, 2014 01:28 AM

    Feodor Fitsner's Avatar

    I noticed i686 job runs like 2 julia.exe processes while x64 only one?

  27. 27 Posted by tony on 17 Dec, 2014 01:30 AM

    tony's Avatar

    Sometimes I run 2 julia.exe processes for doing the tests. This is less reliable on win64, sometimes running tests in parallel can freeze even locally. So I've sometimes has win64 running tests in serial, sometimes tried my luck at parallel. But that type of freeze would look different, we would get "From worker 2:" and "From worker 3:" and one of the workers will get to the end, waiting at "parallel" test for the other worker to finish.

    The freezing that's most common on appveyor is during an early stage, building the system image with the list of *.jl files, that runs on a single process right now.

  28. Ilya Finkelshteyn closed this discussion on 25 Aug, 2018 01:53 AM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac