Did recent changes apply to possibly slow down builds?

acrichton's Avatar

acrichton

12 Nov, 2018 03:20 PM

Over in the rust-lang/rust GitHub repository our CI has recently (over the past week or so) started timing out on AppVeyor far more than it used to. Our builds run with a maximum of 3 hours and typically come in around 2 hours and they're somewhat sensitive to timing unfortunately.

Over the past week or so we've been seeing an elevated number of our builds timing out. Analyzing the build logs and comparing them to successful runs that didn't time out tend to show blanket slowdowns across the board, where basically all the granular steps of the build take longer than they previously did (no one point which is greatly slowing down). To solve this problem on our end we typically retry the build (re-enqueue it) and the second time the build happens (with effectively the same code) it often runs much faster, completing in the alloted time.

In investigating this we wanted to reach out to see if y'all know if something may possibly be awry? Were there any underlying changes that happened in the past week or two which might affect this? If not, we'll keep digging!

An example we have of this is here:

The job which previously took 3 hours afterwards took under two hours. Although the commits were different the contents of the two commits that were tested should have been exactly the same.

  1. 1 Posted by Ilya Finkelshte... on 12 Nov, 2018 06:41 PM

    Ilya Finkelshteyn's Avatar

    Hi Alex,

    In the end of last week we indeed found a node with degraded performance and replaced it during a weekend. Now it should be back to normal, but I see that this and this builds are still failed and this happens on presumably healthy nodes.

    We do not see a performance degradation on our side, but we will investigate more.

    Meanwhile we increased your build timeout to 4 hours. If you do not like it and want to fail fast, you can decrease it in General tab of project settings in Build timeout, minutes. Also new button called RE-RUN INCOMPLETE is coming soon (hopefully this week or so) which will allow you to re-run only failed and cancelled jobs in matrix.

    Also build worker image update happened on Friday evening. Do you think that any of those changes could affect the performance?

    And finally, I would recommend to try to run jobs which fails often on Visual Studio 2017 image and let us know if it behaves differently. You can set image pretty granularity with APPVEYOR_BUILD_WORKER_IMAGE environment variable as described here. You are using Visual Studio 2017 Preview this way now.

    Please keep in touch and let us know what you found.

    We from out side are looking into performance issues deeply now and plan some major datacenter upgrades and migrations in the near future.

    Ilya.

  2. 2 Posted by acrichton on 12 Nov, 2018 09:52 PM

    acrichton's Avatar

    Ok thanks for all the information! The increased timeout will hopefully help for now (thanks!) and we'll keep our eyes peeled on our end.

  3. 3 Posted by acrichton on 13 Nov, 2018 07:14 PM

    acrichton's Avatar

    Do y'all perhaps have statistics for if the VMs that we're running on are shared with other possibly high-cpu workloads? (or maybe even our own workloads?)

    Comparing this 1h36m build with this 3h17m build the build got nearly 2x slower with very similar code being tested. Our own analysis shows that building the Rust compiler, a very CPU intensive workload, was nearly 40% slower in the latter build than the previous build. (compiling the Rust compiler does a little I/O but is almost always bound by CPU/memory).

    A still currently running build is executing over an hour slower than the previous build as well :(. If y'all have any data you can share about the hosting environment and if we're on maybe noisy machines to help explain this, that'd be much appreciated!

  4. 4 Posted by acrichton on 13 Nov, 2018 07:15 PM

    acrichton's Avatar

    Er sorry I meant to mention earlier, but the build image update doesn't seem like it'd affect us much, it's mostly compiler toolchain revisions and/or runtime updates which tend to affect our builds the most.

  5. 5 Posted by Ilya Finkelshte... on 13 Nov, 2018 09:31 PM

    Ilya Finkelshteyn's Avatar

    We experiencing very high load lately and this affects I/O (CPU is OK). This obviously should not affect you. We are doing the following things at the moment:

    Short term solution -- we are decreasing builds density at the moment - high load and noisy neighbors will not affect you (or at least will affect much less) at peak hours. Note however that side effect of that will be that some builds at peak hours will run on Google cloud. Performance there is good, but build start time is 3-4 minutes (time to provision a VM), which you can neglect with your build times.

    Long term -- we are working on adding new datacenter for our Hyper-V infrastructure. I cannot say exact ETA, but it should be added in a couple of weeks.

    Also we propose you to forcible run some specific very heavy jobs on Google cloud. For that please set environment variable appveyor_build_worker_cloud to gce for those jobs in the matrix.

    Please let us know how it goes.

  6. 6 Posted by acrichton on 13 Nov, 2018 09:44 PM

    acrichton's Avatar

    Oh that sounds perfect, thanks for the information! I've sent a PR to switch to GCE, and I also sent a PR to switch to VS2017 preview images. It'll probably take awhile for those to land and get a feeling if we see any more timeouts, but we'll get back to you if anything shows up!

  7. 7 Posted by acrichton on 19 Nov, 2018 04:36 PM

    acrichton's Avatar

    This appears to have basically solved our issue, thanks so much again for the tip!

  8. 8 Posted by Ilya Finkelshte... on 21 Nov, 2018 01:53 AM

    Ilya Finkelshteyn's Avatar

    Just noticed that your current build is running in our main datacenter and appveyor.yml for the auto branch does not have this setting. Trying to track down related changes in your repo... If you have an idea how this happened, please let me know.

  9. 9 Posted by acrichton on 21 Nov, 2018 03:36 AM

    acrichton's Avatar

    Oh no worries! That's to be expected. All our CI happens on the auto branch regardless of what the destination branch is, so our master branch has the fix (scheduled on GCE) but our beta branch doesn't have the fix yet (it's not forcibly scheduled on GCE). That build you saw was form the beta branch, so it's just configuration on our end!

  10. acrichton closed this discussion on 15 Jan, 2019 07:49 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac

 

26 Sep, 2024 03:49 PM
26 Sep, 2024 09:02 AM
25 Sep, 2024 07:07 PM
24 Sep, 2024 08:39 PM
24 Sep, 2024 06:47 AM
20 Sep, 2024 05:50 PM