The race condition which prompted STAGING_SLEEP rears its ugly head
again: when pushing a base repo and its dependents, it's possible for
the update to the base repo's new head to take much longer to be visible
than the dependents (or so it seems?).
In this case, CI might pick up the correct dependent but pick an older /
incorrect revision of the base, leading to a staging failing for no good
reason.
This change uses info/refs to check for the updated staging head to be
visible at the repo level after it's been set / updated via the API. It
assumes repos are in topological order.
Use the proper / actual "is there any stageable PR" query to check if
a PR is blocked as well, that way they shoudn't be diverging all the
time even if it might make PR.blocked a bit more expensive.
fixes#111
Will comment any time a statuses update folds to a CI failure on a
reviewed pull request. Might be somewhat spammy, we'll see.
No notification if the PR is not reviewed yet.
fixes#87
Before this, impacting a commit's statuses on the relevant PR or
staging would be performed immediatly / inline with its
consumption. This, however, is problematic if we want to implement
additional processing like #87 (and possibly though probably not #52):
webhook handlers should be kept short and fast, feeding back into
github would not be acceptable.
- flag commits as needing processing instead of processing them
immediately, this uses a partial index as it looks like the
recommended / proper way to index a boolean column in which one of
the values is searched much more than the other (todo: eventually
check if that actually does anythnig)
- add a new cron for commits processing
- alter tests so they use this new cron (mostly by migrating them to
`run_crons` though not solely as some still need more detailed
management to properly check intermediate steps)
Fix an issue with closing a staged PR while at it (the "merging" tag
would potentially never be removed).
Proper RFC5322 makes for much noisier messages, and seems completely
unnecessary as examples of sign-off on the internet don't quote spaces
/ names.
closes#102
* split out truly awaiting PRs from those waiting on an event of some
sort
* if a staging is active but doesn't have a state yet, it should be
considered pending not cancelled
closes#74
If a PR gets sync'd to a known-valid commit, it should be marked as
valid rather than get in this weird state where it's merely open but
github knows it passes CI.
Fixes#72
This is somewhat less useful with runbot's fail-fast as a runbot
failure (false positive or not) will now very quickly trigger an end
to the current staging.
Still, could be of use.
closes#89
Turns out skipping locks is not very useful when there are no locks
being held because we only touch the PRs *after* the merge has been
applied.
So finally do that, lock all of a staging's PRs before we try to
fast-forward the relevant repositories, so a close command coming back
from github (from having seen the closes #xxx annotation) doesn't
screw us over.
No test because I don't understand how / why it's triggered, it's just
that some PRs don't have a label. I assumed the issue occurred when
the source branch or even repo (cross-repo PR) was deleted, but it
doesn't seem to trigger the issue (or in any case not in as short a
time as a test, maybe GH eventually does some vacuuming which causes
the issue?
Anyway we may eventually want to reclaim these PRs (allowing a lack of
label and treating them like the patch-\d labels: with no semantic
value) however the simplest thing to do for now is to just ignore the
corresponding PR.
closes#101
Github is subject to a fair amount of transient failures, which are
currently ill-logged: an exception is raised and the caller /
responsible might eventually log something, but it's not really
formalised and centralised, and is thus inconvenient to try and
post-mortem issues with github's support.
Change this such that *almost* all github API calls get extensively
logged (status, reason, all headers, body) on failure.
Also automatically sets debug logging for odoo in local tests, and
alter the fake response constructor thing so it doesn't set a json
mimetype when the body is not valid json.
Closes#98
When a user changes the merge method via github messages, no feedback is
sent. This could lead to strange behavior, for example when a user try
to joke with the mergebot like this:
> robodoo are you goin ti merge my PR rogntudju !
This sets the merge method to "merge" and the user is not aware of it.
Before this change, the order of PRs to list in an "unready" feedback
message was whatever the DB returned which could vary. This change
fixes the order by applying model order.
If a transient github failure makes the integration fail but also
makes the following reset fail the entire staging process would be
cancelled (and operations so far rollbacked) except for the failure
comment which would be effected, as in odoo/odoo#26380.
By pushing the comment to the feedback queue, if the reset fails the
comment is rollbacked and "unqueued".
Hopefully this finally fixes the double commenting
issue (e.g. odoo/odoo#28160). This seems (according to reading the
logs and also logic) like a failure (concurrency of some sort) leading
to a transaction rollback *after* the GH API call, so the cron's DB
actions are rollbacked (then performed again on the next run) *but*
the GH API interactions are not rollbacked, and are thus performed
twice.
Since the entire purpose of the feedback table is to send comments,
send both "staging failed" and "Linked pull requests" comments via
that.
Once more unto the breach, with the issue of pushing stagings (with
"closes" annotations) to the target branch making GH close the PR &
send the hook, which makes runbot_merge consider the PR closed and the
staging cancelled.
This probably still doesn't fix the issue, but it reduces the
problematic window: before this, the process first updates the
branches, then marks the PRs, then comments & closes the PRs, and
finally commits the PR update.
This means as runbot_merge is sending a comment & a status update to
each PR in a staging, GH has some time to send the "closed" webhook
behind its back, making the controller immediately cancel the current
staging, especially if the v3 endpoint is a bit slow.
By moving the commenting & closing out of the critical path (to the
feedback queue), this window should be significantly shortened.
Normally, two PRs from the same owner with the same branch
name (source) are batched together, and that's represented by batching
them by label e.g. two PRs labelled
`odoo-dev:12.0-snailmail-address-format-anp` means they probably
should be merged together somehow.
*However* this is an issue when editing via the web interface: if the
editor doesn't have write access to the repository, github
automatically createa a branch called `patch-<n>` under their
ownership, where `<n>` is a sequence (of sorts?) *within the user's
fork*.
This means it's possible (and indeed easy) to create <foo>:patch-1 on
different (non-forks) but related repositories without them actually
being co-dependent, at which point they get blocked on one
another, which can lead to them being blocked (period) due to runbot
not currently handling co-dependencies between PRs.
Currently, if a staging is ongoing or failed one has to hunt for the
staging branches on the runbot dashboard in order to find out what
happens.
This adds a dropdown to the staging box/block providing direct status
and access to all the CI information whether the CI is ongoing or done,
successful or not.
Apparently the split and move arounds caused the _tagstate to get
computed/updated earlier (/differently), and thus the tagging update
relying on it to... not work anymore.
At least restore adding a "merging" tag when a PR is staged.
rebase() can fail after merge(), during set_ref(), having already
updated the target.
Under the pre-rebase model, stage() assumed on a staging failure on a
given repo it only had to rollback stagings having succeeded. This
assumption fails in a post-rebase model as even a failed staging can
have modified the target, leading to the next staging (if multiple
batches are ready) containing the failed one.
Things can get really strange if the set_ref failure was (as it
probably is) some sort of transient failure, as the following staging
will likely succeed (under the assumption that most PRs/batches pass
staging) as PR1's content gets merged as part of PR2, and PR2 is
merged empty of content later on.
Original issue (staging would get cancelled just as it was being
merged) was not really fixed but traded for a new one: serialization
errors which can lock up the mergebot for a long time, stopping
handling of all incoming signals (possibly/probably because all of
them try to write on the PR which is locked?).
Splitting the tagging cron out should already way improve things as
the status update cron should be way shorter (and thus hold its locks
for a smaller amount of time). This should also avoid the "close"
handler waiting on the extant transaction, and make the "pr update"
transaction be much shorter as each staging gets its own tnx.
Send reponse comments when users mis-interact with robodoo e.g.
send comments they don't have the right to, or commands which don't make
sense in the PR's state, or tentative interactions with robodoo from
unmanaged PRs.
People get surprised/worried that their ready PR never gets picked up,
but it's because there is a non-ready (either unreviewed or failing CI)
pull request linked to it. This aims to at least warn them of the issue.
As noted in the old comment, the provided commit was the head of the
staging branch. This confused many users as they'd click the link
expecting to see their commit and potentially got something completely
unrelated.
Since we already get backlinks from the commit through the "closes
<pr_number>" added either by the committer or by the bot itself, the
information is already available.
They're completely independent (or should be), and there's no reason
for the tagging cron to extend the "lifetime" of the main cron's
transaction (and thus extend the odds of racey behaviour).
If a staging covers multiple repositories and there's a fast-forward
issue on any but the first repo/target, runbot_merge attempted to
revert the commits it had fast-forwarded on the previous repos.
This doesn't work when branch-protection is active, unless runbot_merge
is a repository administrator (and branch protection is not configured
to apply to those): reverting is done by push-forcing the original head
back onto the ref, which branch-protection unconditionally precludes.
This commit does not entirely fix the race condition (it does not look
like github provides any way to do that), but it should significantly
reduce the race-condition window as it performs a semi-wet run of the
fast-forward process on the tmp branches before actually updating the
targets. That way the only remaining breakage should be when somebody
pushes on repositories 1.. between the test-FF on tmp branches and the
actual fast forward.
While at it, updated the github API thing to *always* dump the JSON body
on an error response, if the content-type is json.
* [ADD] runbot_merge: more informative states to stagings on error
Currently, when a staging fails for other reasons than a CI failure:
* the staging having been cancelled is known implicitly, because the
staging will be deactivated but will never get a status beyond
pending (because it's not found when looking for it since it's not
`active`)
* the fast-forward having failed is completely silent (logging aside),
it looks for all the world like the staging succeeded
Timeout fails the PR already, but split-on-timeout was not so fix that
one bit.
* [FIX] odoo/odoo#cb2862ad2a60ff4ce66c14e7af2548fdf6fc5961
Closes#41
I just spent 10mn trying to find out why staging 28 was cancelled
(a p=0 comment). Add a common prefix to all staging cancels to make
them easier to find.
staging delay was mistakenly commented in
bb664455ec
Also modified testing fixtures so the staging delay is not enabled when
running tests locally: on my box it increases the local runtime from
~70s to ~1500s (20s/staging, ~1 staging/test, 73 tests)
Before this, the bot would only acknowledge commands of the form
<botname>: <commands>
but since the bot is an actual user, people regularly use `@<botname>`
as it seems like it should work *and* provides for autocompletion.
Support that, as well as the octothorpe in case users want to pound
robodoo.
Related to odoo/runbot#38
Continuation of fa94b269de which is
apparently not sufficient:
1. log the staging event so we can check that we're staging in the
correct order
2. add a delay after each staging in case there's some sort of race
in the updating of codependent repositories
When creating staging branches from tmp, use the iteration order of
the repos in the project (that way it's easy to see and eventually
configure if we add sequences or whatever, in the short term it's the
order in which the repos were added which is the one we want).
This ensures we stage odoo/odoo before we stage odoo/enterprise
without relying on dict order of iteration, or needing meta to be an
OrderedDict.
The issue is if stagings are created/updated the other way around, the
runbot may pick up staging on odoo/enterprise before odoo/odoo has
been updated, and thus build odoo/enterprise with the wrong odoo/odoo
commit, and defeat the entire point of it.
Example: http://runbot.odoo.com/runbot/build/376112 was triggered by
the same staging as http://runbot.odoo.com/runbot/build/376113, but
used the previous staging head.
The creation order of tmp branches should not matter so ignore it.
A limitation to 50 commits PRs was put in place to avoid rebasing
huge PRs (as a rebase means 1 merge + 1 commit *per source commit*),
however the way it was done would also limit regular merges, and the
way the limitation was implemented was not clear.
* explicitly check that limit in the rebase case
* make error message on PR sizes (rebase 50 or merge 250) clearer
* remove limit from commits-fetching (check it beforehand)
* add a test to merge >50 commits PRs
* fix the local implementation of pulls/:number/commits to properly
paginate
a0063f9df0 slightly improved the error
message on non-PR ci failure (e.g. a community PR makes enterprise
break) by adding the failed commit, but that's still not exactly clear,
even for technical users (plus it requires having access to all the
repos which is not the case for everyone).
This commit further improves the situation by storing the target_url
and description fields of the commit statuses, and printing out the
target_url on failure if it's present.
That way the PR comment denoting build failure should now have a link to
the relevant failed build on runbot, as that's the target_url it
provides.
The change is nontrivial as it tries to be compatible with both old and
new statuses storage format, such that there is no migration to perform.
e98a8caffb added dummy commits to the
heads of stagings and fixed most places to make a difference between
the staging head (including dummy commit) and the actual merge head,
but the difference was missed in the comment closing a PR, which was
still using the staging head and thus pointing to the dummy commit
e.g. (https://github.com/odoo/odoo/pull/26821#issuecomment-420244592)
If CI fails on a non-PR'd branch of a staging (e.g. given repos A and B
and a PR to A, CI fails on the staging branch to B), the error message
(log and comment on the PR) is unhelpful as it states that the staging
failed for "unknown reason".
Improve this by providing the failed CI context and the commit, which
should allow finding out the branch & CI logs, and understanding the
why of the failure.
Fixes#36
Before this change, when staging batches only affecting one repo (of n)
the unaffected repositories would get a staging branch exactly matching
the target.
As a result, either runbot_merge or runbot would simply return the
result of an unrelated build, potentially providing incorrect
information and either failing a staging which should have succeeded
(e.g. change in repo A broke B, PR is making a change in repo A which
fixes B, but B's state is reported as the previous broken build) or
succeeding a staging which should have failed (change in repo A breaking
B except a previous build of the exact same B succeeded with a different
A and is returned).
To fix this issue, create a dummy commit at the head of each staging
branch. Because commit dates are included in the hash and have a second
precision it's pretty unlikely that we can get built duplicates, but
just to be completely sure some random bits are added to the commit
message as well.
Various tests fixed to correctly handle the extra dummy commit on
staging branches.
fixes#35
rebase-and-merge (or squash-merge if pr.commits == 1) remains default,
but there are use cases like forward ports (merge branch X into branch
X+1 so that fixes to X are available in X+1) where we really really
don't want to rebase the source.
This commits implements two alternative merge methods:
If the PR and its target are ~disjoint, perform a straight merge (same
as old default mode).
However if the head of the PR has two parents *and* one of these
parents is a commit of the target, assume this is a merge commit to
fix a conflict (common during forward ports as X+1 will have changed
independently from and incompatibly with X in some ways).
In that case, merge by copying the PR's head atop the
target (basically rebase just that commit, only updating the link to
the parent which is part of target so that it points to the head of
target instead of whatever it was previously).
After discussion with al & rco, conclusion was default PR merging method
should be rebase-and-merge for cleaner history.
Add test for that scenario (w/ test for final DAG) and implement this
change.
* avoid fetching PRs for un-managed branches if we know up-front
* avoid processing comments with no commands (avoids fetching the
corresponding PR which we know nothing about yet and which may or
may not be for a managed branch)
The old "sync pr" thing is turning out to be a bust, while it
originally worked fine these days it's a catastrophe as the v4 API
performances seem to have significantly degraded, to the point that
fetching all 15k PRs by pages of 100 simply blows up after a few
hundreds/thousands.
Instead, add a table of PRs to sync: if we get notified of a
"compatible" PR (enabled repo & target) which we don't know of, create
an entry in a "fetch jobs" table, then a cron will handle fetching the
PR then fetching/applying all relevant metadata (statuses,
review-comments and reviews).
Also change indexation of Commit(sha) and PR(head) to hash, as btree
indexes are not really sensible for such content (the ordering is
unhelpful and the index locality is awful by design/definition).
Previously when splitting staging we'd create two never-staged
stagings. In a system where the stagings get deleted once done
with (succeeeded or failed) that's not really important, but now that
we want to keep stagings around inactive things get problematic as
this method gunks up the stagings table, plus the post-split stagings
would "steal" the original's batches, losing information (relation
between stagings and batches).
Replace these empty stagings with dedicated *split* objects. A batch
can belong to both a staging and a split, the split is deleted once a
new staging has been created from it.
Eventually we may want to make batches shared between stagings (so we
can track the entire history of a batch) but currently that's only
PR-level.
If we want a dashboard with a history of stagings, maybe not deleting
them would be a good idea.
A replacement for the headless stagings would probably be a good idea:
currently they're created when splitting a failed staging containing
more than one batch, but their only purpose is as splits of existing
batches to be deactivated/deleted to be re-staged (new batches &
stagings are created then as e.g. some of the batches may not be
merge-able anymore) and that's a bit weird.
AL thinks it's not useful and it's better to always squash/rebase a
single commit & merge multiple. Mark tests as xfail'd instead of
removing them.
Also mark test_edit_retarget_managed as skipped explicitly
* p0 cancel existing stagings in order to be staged as soon as
possible
* p0 PRs should be picked over split batches
* p0 bypass PR-level CI and review requirements
* p0 can be set on any of a batch's PR, matched PRs will be staged
alongside even if their priority is the default