Did recent changes apply to possibly slow down builds?
Over in the rust-lang/rust GitHub repository our CI has recently (over the past week or so) started timing out on AppVeyor far more than it used to. Our builds run with a maximum of 3 hours and typically come in around 2 hours and they're somewhat sensitive to timing unfortunately.
Over the past week or so we've been seeing an elevated number of our builds timing out. Analyzing the build logs and comparing them to successful runs that didn't time out tend to show blanket slowdowns across the board, where basically all the granular steps of the build take longer than they previously did (no one point which is greatly slowing down). To solve this problem on our end we typically retry the build (re-enqueue it) and the second time the build happens (with effectively the same code) it often runs much faster, completing in the alloted time.
In investigating this we wanted to reach out to see if y'all know if something may possibly be awry? Were there any underlying changes that happened in the past week or two which might affect this? If not, we'll keep digging!
An example we have of this is here:
- Failed build: https://ci.appveyor.com/project/rust-lang/rust/builds/20187550
- Succeeded build: https://ci.appveyor.com/project/rust-lang/rust/builds/20191704
The job which previously took 3 hours afterwards took under two hours. Although the commits were different the contents of the two commits that were tested should have been exactly the same.
Comments are currently closed for this discussion. You can start a new one.
Keyboard shortcuts
Generic
? | Show this help |
---|---|
ESC | Blurs the current field |
Comment Form
r | Focus the comment reply box |
---|---|
^ + ↩ | Submit the comment |
You can use Command ⌘
instead of Control ^
on Mac
1 Posted by Ilya Finkelshte... on 12 Nov, 2018 06:41 PM
Hi Alex,
In the end of last week we indeed found a node with degraded performance and replaced it during a weekend. Now it should be back to normal, but I see that this and this builds are still failed and this happens on presumably healthy nodes.
We do not see a performance degradation on our side, but we will investigate more.
Meanwhile we increased your build timeout to 4 hours. If you do not like it and want to fail fast, you can decrease it in General tab of project settings in
Build timeout, minutes
. Also new button calledRE-RUN INCOMPLETE
is coming soon (hopefully this week or so) which will allow you to re-run only failed and cancelled jobs in matrix.Also build worker image update happened on Friday evening. Do you think that any of those changes could affect the performance?
And finally, I would recommend to try to run jobs which fails often on
Visual Studio 2017
image and let us know if it behaves differently. You can set image pretty granularity withAPPVEYOR_BUILD_WORKER_IMAGE
environment variable as described here. You are usingVisual Studio 2017 Preview
this way now.Please keep in touch and let us know what you found.
We from out side are looking into performance issues deeply now and plan some major datacenter upgrades and migrations in the near future.
Ilya.
2 Posted by acrichton on 12 Nov, 2018 09:52 PM
Ok thanks for all the information! The increased timeout will hopefully help for now (thanks!) and we'll keep our eyes peeled on our end.
3 Posted by acrichton on 13 Nov, 2018 07:14 PM
Do y'all perhaps have statistics for if the VMs that we're running on are shared with other possibly high-cpu workloads? (or maybe even our own workloads?)
Comparing this 1h36m build with this 3h17m build the build got nearly 2x slower with very similar code being tested. Our own analysis shows that building the Rust compiler, a very CPU intensive workload, was nearly 40% slower in the latter build than the previous build. (compiling the Rust compiler does a little I/O but is almost always bound by CPU/memory).
A still currently running build is executing over an hour slower than the previous build as well :(. If y'all have any data you can share about the hosting environment and if we're on maybe noisy machines to help explain this, that'd be much appreciated!
4 Posted by acrichton on 13 Nov, 2018 07:15 PM
Er sorry I meant to mention earlier, but the build image update doesn't seem like it'd affect us much, it's mostly compiler toolchain revisions and/or runtime updates which tend to affect our builds the most.
5 Posted by Ilya Finkelshte... on 13 Nov, 2018 09:31 PM
We experiencing very high load lately and this affects I/O (CPU is OK). This obviously should not affect you. We are doing the following things at the moment:
Short term solution -- we are decreasing builds density at the moment - high load and noisy neighbors will not affect you (or at least will affect much less) at peak hours. Note however that side effect of that will be that some builds at peak hours will run on Google cloud. Performance there is good, but build start time is 3-4 minutes (time to provision a VM), which you can neglect with your build times.
Long term -- we are working on adding new datacenter for our Hyper-V infrastructure. I cannot say exact ETA, but it should be added in a couple of weeks.
Also we propose you to forcible run some specific very heavy jobs on Google cloud. For that please set environment variable
appveyor_build_worker_cloud
togce
for those jobs in the matrix.Please let us know how it goes.
6 Posted by acrichton on 13 Nov, 2018 09:44 PM
Oh that sounds perfect, thanks for the information! I've sent a PR to switch to GCE, and I also sent a PR to switch to VS2017 preview images. It'll probably take awhile for those to land and get a feeling if we see any more timeouts, but we'll get back to you if anything shows up!
7 Posted by acrichton on 19 Nov, 2018 04:36 PM
This appears to have basically solved our issue, thanks so much again for the tip!
8 Posted by Ilya Finkelshte... on 21 Nov, 2018 01:53 AM
Just noticed that your current build is running in our main datacenter and appveyor.yml for the
auto
branch does not have this setting. Trying to track down related changes in your repo... If you have an idea how this happened, please let me know.9 Posted by acrichton on 21 Nov, 2018 03:36 AM
Oh no worries! That's to be expected. All our CI happens on the
auto
branch regardless of what the destination branch is, so ourmaster
branch has the fix (scheduled on GCE) but ourbeta
branch doesn't have the fix yet (it's not forcibly scheduled on GCE). That build you saw was form thebeta
branch, so it's just configuration on our end!acrichton closed this discussion on 15 Jan, 2019 07:49 PM.