Thank god I have a bunch of tests because once again I forgot / missed
a bunch of edge cases in doing the conversion, which the tests
caught (sadly that means I almost certainly broke a few untested edge
cases).
Important notes:
Handling of parent links
------------------------
Unlike PRs, batches don't lose their parent info ever, the link is
permanent, which is convenient to trawl through a forward port
(currently implemented very inefficiently, maybe we'll optimise that
in the future).
However this means the batch having a parent and the batch's PRs
having parents are slightly different informations, one of the edge
cases I missed is that of conflicting PRs, which are deparented and
have to be merged by hand before being forward ported further, I had
originally replaced the checks on a pr and its sibling having parents
by just the batch.
Batches & targets
-----------------
Batches were originally concepted as being fixed to a target and PRs
having that target, a PR being retargeted would move it from one batch
to an other.
As it turns out this does not work in the case where people retarget
forward-port PRs, which I know they do because #551
(2337bd8518). I could not think of a
good way to handle this issue as is, so scrapped the moving PRs thing,
instead one of the coherence checks of a batch being ready is that all
its PRs have the same target, and a batch only has a target if all its
PRs have the same target.
It's possible for somewhat odd effects to arise, notably if a PR is
closed (removed from batch), the other PRs are retargeted, and the new
PR is reopened, it will now be on a separate batch even if it also
gets retargeted. This is weird. I don't quite know how I should handle
it, maybe batches could merge if they have the same target and label?
however batches don't currently have a label so...
Improve limits
--------------
Keep limits on the PRs rather than lift them on the batchL if we can
add/remove PRs of batches having different limits on different PRs of
the same batch is reasonable.
Also leave limit unset by default: previously, the limit was eagerly
set to the tip (accessible) branch. That doesn't really seem
necessary, so stop doing that.
Also remove completely unnecessary `max` when trying to find a PR's
next target: `root` is either `self` or `self.source_id`, so it should
not be possible for that to have a later target.
And for now ensure the limits are consistent per batch: a PR defaults
to the limit of their batch-mate if they don't have one, and if a
limit is set via command it's set on all PRs of a batch.
This commit does not allow differential limits via commands, they are
allowed via the backend but not really tested. The issue is mostly
that it's not clear what the UX should look like to have clear and not
super error prone interactions. So punt on it for now, and hopefully
there's no hole I missed which will create inconsistent batches.
In case of PRs not being ready, don't just say the PRs are waiting for
CI even though they might be unreviewed, and make a difference
between *waiting* for CI (pending) and having failed CI.
It's a bit weird and inconsistent to have a PR being staged while
unreviewed or unapproved or w/e. If we compute the state based on
skipchecks then everything is consistent.
Also remove the implicit override of all statuses when explicitly
marking the pr as `ready`, it risks creating difficult to understand
states, and it's unnecessary since `skipchecks` gets set.
Also as with setting skipchecks, sets the current user as reviewer on
all PRs of the batch without a reviewer.
Move staging cancellation to the batch, remove its (complicated)
handling from the PRs.
This loses some precision in the cancellation message, but that could
likely be recovered (in part) by adding more precise checks &
diagnostic extractions in the compute.
Because `alone` (formerly p != 2) is selected before split PRs, if a
prioritised PR gets split (or a split PR gets prioritised) it will be
staged once as prioritised, and again because split.
Improve the selection of ready batches to exclude split batches
upstream, such that they don't have to be rechecked over and over, and
their priorities don't cause us issues.
Simplifies the `ready_prs` query a bit and allows it to be converted
to an ORM search, by moving the priority check outside. This also
allows the caller to not need to post-process the records list
anywhere near the previous state of affairs.
`ready_prs` now returns *either* the "alone" batches, or the non-alone
batches, rather than mixing both into a single sequence. This requires
correctly applying the search filters to not retrieve priority of
batches in error or targeting other branches.
Staging readiness is a batch-level concerns, and many of the markers
are already there though a few need to be aggregated from the PRs. As
such, staging has no reason to be performed in terms of PRs anymore,
it should be performed via batches directly.
There is a bit of a mess in order not to completely fuck up when
retargeting PRs (implicitly via freeze wizard, or explicitely) as for
now we're moving PRs between batches in order to keep the
batches *mostly* target-bound.
Some of the side-effects in managing the coherence of the targeting
and moving PRs between batches is... not great. This might need to be
revisited and cleaned up with those scenarios better considered.
- `merge_date` should be common to an entire batch, so move it there
- remove `Batch.active` which should probably have been removed when
batches were made persistent (can eventually re-add as a proxy for
`merge_date` being set maybe, but for now removing it seems a better
way to catch mistakes)
- update various sites to use `Batch.merge_date` instead of
`Batch.active`
This probably has latent bugs, and is only the start of the road to v2
(#789): PR batches are now created up-front (alongside the PR), with
PRs attached and detached as needed, hopefully such that things are
not broken (tests pass but...), this required a fair number of
ajustments to code not taking batches into account, or creating
batches on the fly.
`PullRequests.blocked` has also been updated to rely on the batch to
get its batch-mates, such that it can now be a stored field with the
right dependencies.
The next step is to better leverage this change:
- move cross-PR state up to the batch (e.g. skipchecks, priority, ...)
- add fw info to the batch, perform forward-ports batchwise in order
to avoid redundant batch-selection work, and allow altering batches
during fw (e.g. adding or removing PRs)
- use batches to select stagings
- maybe expose staging history of a batch?
Not sure it's going to be useful but it's hard to know if we can't
test it. The intent is mostly the ability to prioritize throughput (or
attempt to) during high-load events, if we can favour staging N
new batches over a split's N/2 we might be able to merge more crap.
But maybe not, we'll see, either way now it's here and seems to more
or less work.
Fixes#798
Because the mergebot crons are on such a tight scheduling, and just
them finding out they have nothing to do can take a while, disabling
them can be a chore. Disabling staging via the project is much less
likely to cause issues as the projects don't normally (or ever?) get
exclusively locked, so they can generally be written to at any moment.
Furthermore, if we ever get in a situation where we have multiple
active projects (not really the case currently, we have multiple
projects but only one is really active) it's less disruptive to
disable stagings on a single specific project.
Fixes#860
This commit revisits the commands set in order to make it more
regular, and limit inconsistent command-sets, although it includes
pseudo-command aliases for common tasks now removed from the core set.
Hard Errors
===========
The previous iteration of the commands set would ignore any
non-command term in a command line. This has been changed to hard
error (and ignoring the entire thing) if any command is unknown or
invalid.
This fixes inconsistent / unexpected interpretations where a user
sends a command, then writes a novel on the same line some words of
which happen to *also* be commands, leading to merge states they did
not expect. They should now be told to fuck off.
Priority Restructuring
----------------------
The numerical priority system was pretty messy in that it confused
"staging priority" (in ways which were not entirely straightforward)
with overrides to other concerns.
This has now being split along all the axis, with separate command
subsets for:
- staging prioritisation, now separated between `default`, `priority`,
and `alone`,
- `default` means PRs are picked by an unspecified order when
creating a staging, if nothing better is available
- `priority` means PRs are picked first when staging, however if
`priority` PRs don't fill the staging the rest will be filled with
`default`, this mode did not previously exist
- `alone` means the PRs are picked first, before splits, and only
`alone` PRs can be part of the staging (which usually matches the
modename)
- `skipchecks` overrides both statuses and approval checks, for the
batch, something previously implied in `p=0`, but now
independent. Setting `skipchecks` basically makes the entire batch
`ready`.
For consistency this also sets the reviewer implicitly: since
skipchecks overrides both statuses *and approval*, whoever enables
this mode is essentially the reviewer.
- `cancel` cancels any ongoing staging when the marked PR becomes
ready again, previously this was also implied (in a more restricted
form) by setting `p=0`
FWBot removal
=============
While the "forwardport bot" still exists as an API level (to segregate
access rights between tokens) it has been removed as an interaction
point, as part of the modules merge plan. As a result,
fwbot stops responding
----------------------
Feedback messages are now always sent by the mergebot, the
forward-porting bot should not send any message or notification
anymore.
commands moved to the merge bot
-------------------------------
- `ignore`/`up to` simply changes bot
- `close` as well
- `skipci` is now a choice / flag of an `fw` command, which denotes
the forward-port policy,
- `fw=default` is the old `ci` and resets the policy to default,
that is wait for the PR to be merged to create forward ports, and
for the required statuses on each forward port to be received
before creating the next
- `fw=skipci` is the old `skipci`, it waits for the merge of the
base PR but then creates all the forward ports immediately (unless
it gets a conflict)
- `fw=skipmerge` immediately creates all the forward ports, without
even waiting for the PR to be merged
This is a completely new mode, and may be rather broken as until
now the 'bot has always assumed the source PR had been merged.
approval rework
---------------
Because of the previous section, there is no distinguishing feature
between `mergebot r+` = "merge this PR" and `forwardbot r+` = "merge
this PR and all its parent with different access rights".
As a result, the two have been merged under a single `mergebot r+`
with heuristics attempting to provide the best experience:
- if approving a non-forward port, the behavior does not change
- else, with review rights on the source, all ancestors are approved
- else, as author of the original, approves all ancestors which descend
from a merged PR
- else, approves all ancestors up to and including the oldest ancestor
to which we have review rights
Most notably, the source's author is not delegated on the source or
any of its descendants anymore. This might need to be revisited if it
provides too restrictive.
For the very specialized need of approving a forward-port *and none of
its ancestors*, `review=` can now take a comma (`,`) separated list of
pull request numbers (github numbers, not mergebot ids).
Computed State
==============
The `state` field of pull requests is now computed. Hopefully this
makes the status more consistent and predictable in the long run, and
importantly makes status management more reliable (because reference
datum get updated naturally flowing to the state).
For now however it makes things more complicated as some of the states
have to be separately signaled or updated:
- `closed` and `error` are now separate flags
- `merge_date` is pulled down from forwardport and becomes the
transition signal for ready -> merged
- `reviewed_by` becomes the transition signal for approval (might be a
good idea to rename it...)
- `status` is computed from the head's statuses and overrides, and
*that* becomes the validation state
Ideally, batch-level flags like `skipchecks` should be on, well, the
batch, and `state` should have a dependency on the batch. However
currently the batch is not a durable / permanent member of the system,
so it's a PR-level flag and a messy pile.
On notable change is that *forcing* the state to `ready` now does that
but also sets the reviewer, `skipchecks`, and overrides to ensure the
API-mediated readying does not get rolled back by e.g. the runbot
sending a status.
This is useful for a few types of automated / programmatic PRs
e.g. translation exports, where we set the state programmatically to
limit noise.
recursive dependency hack
-------------------------
Given a sequence of PRs with an override of the source, if one of the
PRs is updated its descendants should not have the override
anymore. However if the updated PR gets overridden, its descendants
should have *that* override.
This requires some unholy manipulations via an override of `modified`,
as the ORM supports recursive fields but not recursive
dependencies (on a different field).
unconditional followup scheduling
---------------------------------
Previously scheduling forward-port followup was contigent on the FW
policy, but it's not actually correct if the new PR is *immediately*
validated (which can happen now that the field is computed, if there
are no required statuses *or* all of the required statuses are
overridden by an ancestor) as nothing will trigger the state change
and thus scheduling of the fp followup.
The followup function checks all the properties of the batch to port,
so this should not result on incorrect ports. Although it's a bit more
expensive, and will lead to more spam.
Previously this would not happen because on creation of a PR the
validation task (commit -> PR) would still have to execute.
Misc Changes
============
- If a PR is marked as overriding / canceling stagings, it now does
so on retry not just when setting initially.
This was not handled at all previously, so a PR in P0 going into
error due to e.g. a non-deterministic bug would be retried and still
p=0, but a current staging would not get cancelled. Same when a PR
in p=0 goes into error because something was failed, then is updated
with a fix.
- Add tracking to a bunch of relevant PR fields.
Post-mortem analysis currently generally requires going through the
text logs to see what happened, which is annoying.
There is a nondeterminism / inconsistency in the tracking which
sometimes leads the admin user to trigger tracking before the bot
does, leading to the staging tracking being attributed to them
during tests, shove under the carpet by ignoring the user to whom
that tracking is attributed.
When multiple users update tracked fields in the same transaction
all the changes are attributed to the first one having triggered
tracking (?), I couldn't find why the admin sometimes takes over.
- added and leveraged support for enum-backed selection fields
- moved variuous fields from forwardport to runbot_merge
- fix a migration which had never worked and which never run (because
I forgot to bump the version on the module)
- remove some unnecessary intermediate de/serialisation
fixes#673, fixes#309, fixes#792, fixes#846 (probably)
- move all commands parsing to runbot_merge as part of the long-term
unification effort (#789)
- set up an actual parser-ish structure to parse the commands to
something approaching a sum type (fixes#507)
- this is mostly prep for reworking the commands set (#673), although
*strict command parsing* has been implemented (cf update to
`test_unknown_commands`)
Sadly m2ms don't support tracking, so add a bunch of ad-hoc tracking
to the override rights in order to know who, what, when at least.
Do the same for the review rights although maybe tracking works for
those.
It might not be a huge amount of extra work since we're never actually
retrieving the rows, but it still seems completely unnecessary.
Sadly we can't do something cleaner like an aggregation, because
aggregating requires moving the locking query to a subquery, and
experimentally that seems slower than just ignoring / discarding the
result set.
If an untracked PR is closed, especially on an inactive or untracked
branch, the closer (or author) almost certainly don't care to receive
3 different notifications on the subject.
The fix requires a schema change in order to track that we're fetching
the PR due to a `closed` event, as in other cases we may still want to
notify the user that we received the request (and it just happened to
resolve to a closed PR).
Fixes#857
`gc --prune` can not take a *separate* parameter, it has to be part of
the same arg (the `=` is not optional), otherwise the `gc` call blows
up.
So use the positional form of the git command to generate the correct
invocation, Python-level `foo=bar` generates a split-style option in
two args which does not please git.
Before this, we would check if a repository had a name and run
maintenance on it, leading to repeated (but unnoticed until now
because I didn't monitor it) tracebacks as the maintenance cron would
fail to find the local repo then run maintenance on nowhere anyway.
Also augment the repo-finding process to try and get better
information about what it's doing when it fails, rather than failing
completely silently.
The logging line was copied over from the github-api version, but it
was not correctly fixed up to match, leading to a lot of spam on
stderr when debug is enabled (aka spams journalctl on the production
server).
Splat the logging call out of `rebase` and into the various callers,
so they have access to the pr object to log it.
Forgot to deref the id of the staging we're trying to lock, so the
specific case where we start a freeze with a bump PR and an
outstanding staging in master would instantly blow up.
The low-level APIs used by the staging process don't do any merge
check, so because of the way Git works it's possible for them to merge
commits with content as empty commits, e.g. if something was merged
then backported and the backport was merged on top. This should
trigger a merge failure as we don't really want to merge newly
empty. This is a feature which some high level commands of git
support, kind-of, e.g. by default `git rebase --interactive` will ask
about newly empty commits.
Take care to allow merging already-empty commits, as these do have a
use for signaling, freezes, ....
Fixes#809
Prepares the possibility of either more direct communication with the
CI platform(s) or just assuming CI has gotten reliable enough and
colleagues intelligent enough that this is not an issue anymore
because they've stopped pushing empty branches (which we know is not
the case).
Fixes#806
During the 17.0 freezeathon, the freeze wizard blew up with
MergeError: merge-tree: {oid} - not something we can merge
Turns out when freezes were moved to local
(4d2c0f86e1) I forgot to fetch the heads
of the release and bump PRs into the local repo, so rebasing them atop
their branch would fail because the local repository would just not
find the object being rebased.
I had missed that case in testing as well, but in fairness even if I
had tried testing it I'd likely have missed it: implementation
limitations (shortcuts) of dummy central mean it currently ignores
what objects the client requests and bundles everything it can find
associated with the repository (meaning it sends the entire network).
This is not usually an issue because the test repos are pretty small,
but it means the client can have objects they should not because they
never requested them and might not even be supposed to be aware of
their existence.
Anyway solve by doing the obvious: fetch the heads of the release and
bump PRs at the same time we update the branch being forked off. Also
update the freeze tests to trigger the issue (by creating the release
/ bump PRs in different repos) and running the tests against github
actual to make sure we can actually see them fail (correctly, the
merge error we expect) not via errors in the test), and we do fix
them.
Fixes#821
If a branch `foo` is disabled, then `tmp.foo` and `staging.foo` become
unnecessary (with #247 fixed the tmp refs are not used for creating
stagings anymore, but for now they're still used for the "safety
dance" of merging a successful staging into the corresponding
mainline).
Fixes#605
The github API has gotten a lot more constraining (with rate
restrictions being newly enforced or added somewhat out of nowhere),
and as importantly a lot less reliable. So move the staging process
off of github and locally, similar to the forward porting
process (whose repo cache is being reused for this).
Fixes#247
Probably less necessary than for the regular staging stuff, but might
as well while at it.
Requires updating one of the test to generate a non-ff push, as
O_CREAT doesn't exist at the git level, and the client (and it is
client-side) only protects against force pushes. So there is no way to
trigger an issue with just the creation of the new branch, it needs to
exist *and point to a non-ancestor commit*.
Also remove a sleep in the ref update loop as there are no ref updates
anymore, until the very final sync via git.
NB: maybe it'd be possible to push both bump and release PRs together
for each repo, but getting which update failed in case of failure
seems difficult.
It has been a consideration for a while, but the pain of subtly
interacting with git via the ignominous CLI kept it back. Then ~~the
fire nation attacked~~ github got more and more tight-fisted (and in
some ways less reliable) with their API.
Staging pretty much just interacts with the git database, so it's both
a facultative github operator (it can just interact with git directly)
and a big consumer of API requests (because the git database endpoints
are very low level so it takes quite a bit of work to do anything
especially when high-level operations like rebase have to be
replicated by hand).
Furthermore, an issue has also been noticed which can be attributed to
using the github API (and that API's reliability getting worse): in
some cases github will fail to propagate a ref update / reset, so when
staging 2 PRs it's possible that the second one is merged on top of
the temporary branch of the first one, yielding a kinda broken commit
(in that it's a merge commit with a broken error message) instead of
the rebase / squash commit we expected.
As it turns out it's a very old issue but only happened very early so
was misattributed and not (sufficiently) guarded against:
- 41bd82244bb976bbd4d4be5e7bd792417c7dae6b (October 8th 2018) was
spotted but thought to be a mergebot issue (might have been one of
the opportunities where ref-checks were added though I can't find
any reference to the commit in the runbot repo).
- 2be25052e147b151d1d8a5bc73cceb351586ce03 (October 15th, 2019) was
missed (or ignored).
- 5a9fe7a7d05a9df7186072a7bffd60c6b428fd0e (July 31st, 2023) was
spotted, but happened at a moment where everything kinda broke
because of github rate-limiting ref updates, so the forensics were
difficult and it was attributed to rate limiting issues.
- f10d03bf0f2e8f88f62a5d8356b84f714196130f (August 24th, 2023) broke
the camel's back (and the head block): the logs were not too
interspersed with other garbage and pretty clear that github ack'd a
ref update, returned the correct oid when checking the ref, then
returned the wrong oid when fetching it later on.
No Working Copy
===============
The working copy turns out to not be necessary, the plumbing commands
we *need* work just fine on a bare repository.
Working without a WC means we had to reimplement the high level
operations (rebase) by hand much as we'd done previously, *but* we
needed to do that anyway as git doesn't seem to provide any way to
retrieve the mapping when rebasing/cherrypicking, and cherrypicking by
commit doesn't work well as it can't really find the *merge base* it
needs.
Forward-porting can almost certainly be implemented similarly (with
some overhead), issue #803 has been opened to keep track of the idea.
No TMP
======
The `tmp.` branches are no more, the process of creating stagings is
based entirely around oids, if staging something fails we can just
abandon the oids (they'll be collected by the weekly GC), we only
need to update the staging branches at the very end of the process.
This simplifies things a fair bit.
For now we have stopped checking for visibility / backoff as we're
pushing via git, hopefully it is a more reliable reference than the
API.
Commmit Message Formatting
==========================
There's some unfortunate churn in the test, as the handling of
trailing newlines differs between github's APIs and git itself.
Fixes#247
PS: It might be a good idea to use pygit2 instead of the CLI
eventually, the library is typed which is nice, and it avoids
shelling out although that's really unlikely to be a major cost.
Necessary to create commits *as* the mergebot without going through
the github API. Copy of the improved version from forwardport. *Not*
an override, to avoid unnecessarily triggering one or the other which
is confusing and weird.
Move *almost* all the staging code to free functions, in a separate
module, and extensively typed.
The only bits which didn't move are:
- the entry point (the cron hook), because it has to be a model method
in order to be called
- the `_build_merge_message` method, because it needs to be
overridable
There's also a bit of an import mess, because the cron &
`_build_merge_message` need to call into the new module, but the new
module wants the types they belong to, so it's a bit circular.
If the stagings are going to be created locally (via a git working
copy rather than the github API), the mergebot part needs to have
access to the cache, so move the cache over. Also move the maintenance
cron.
In an extermely minor way, this prefigures the (hopeful) eventual
merging of the ~~planes~~ modules.
- add formatting for a bunch of backend objects
- add cross-links in order to use toplevel navigation between objects
e.g. project -> branch -> staging -> PR with breadcrumbs instead of
shitty dialog boxes
Relates to #802
When I updated the status storage (including `previous_failure`) for
some reason I didn't just migrate from the old to the new format, and
added bridge functions instead.
This is not really necessary (or useful), so convert all the legacy
data and remove the conversion helpers.
Relates to #802
Currently the heads of a staging (both staging heads and merged heads)
are just JSON data on the staging itself. Historically this was
convenient as the heads were mostly of use to the staging process, and
thus accessed directly through the staging essentially exclusively.
However this makes finding stagings from merged commits e.g. for
forensic research almost impossible, because querying based on
the *values* of a JSON map is expensive, and indexing it is difficult.
To make this use case more feasible, split the `heads` field into two
join tables, one for the staging heads and one for the merged heads,
this makes looking for stagings by commits much more
efficient (although the queries may not be trivial). Also add two
utility RPC methods, so that it's possible to query stagings
reasonably easily and efficiently based on a set of commits (branch
heads).
related to #768
Allow filtering stagings by state (success or failure), and provide a
control to explicitly update the staging date limit.
Should make it easier to drill through stagings when looking for
specific information.
Related to #751
Fix outstanding query to make a positive `state` filtering, instead of
negative, matching 3b52b1aace8674259812a76b1566260937dbcacb.
Also manually create a map of stagings (grouped by branch) sharing a
single prefetch set.
For odoo the mergebot home page has 12 branches in the odoo project
and 8 in spreadsheet, 6 stagings each. This means 120 queries to
retrieve all the heads (Odoo stagings have 5 heads and spreadsheet
have 1, but that seems immaterial).
By fixing `_compute_statuses` and creating a single prefetch set for
all stagings of all branches we can fetch all the commits in a single
query instead of 120.
- add support for authorship (not just approval)
- make display counts directly
- fix `state` filter: postgres can't do negative index lookups
- add indexes for author and reviewed_by as we look them up
- ensure we handle the entire source filtering via a single subquery
Closes#778
During the 16.3 freeze an issue was noticed with the concurrency
safety of the freeze wizard (because it blew up, which caused a few
issues): it is possible for the cancelling of an active staging to the
master branch to fail, which causes the mergebot side of the freeze to
fail, but the github state is completed, which puts the entire thing
in a less than ideal state.
Especially with the additional issue that the branch inserter has its
own concurrency issue (which maybe I should fix): if there are
branches *being* forward-ported across the new branch, it's unable to
see them, and thus can not create the now-missing PRs.
Try to make the freeze wizard more resilient:
1. Take a lock on the master staging (if any) early on, this means if
we can acquire it we should be able to cancel it, and it won't
suffer a concurrency error.
2. Add the `process_updated_commits` cron to the set of locked crons,
trying to read the log timeline it looks like the issue was commits
being impacted on that staging while the action had started:
REPEATABLE READ meant the freeze's transaction was unable to see
the update from the commit statuses, therefore creating a diverging
update when it cancelled the staging, which postgres then reported
as a serialization error.
I'd like to relax the locking of the cron (to just FOR SHARE), but I
think it would work, per postgres:
> SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as
> SELECT in terms of searching for target rows: they will only find
> target rows that were committed as of the transaction start
> time. However, such a target row might have already been updated (or
> deleted or locked) by another concurrent transaction by the time it
> is found. In this case, the repeatable read transaction will wait
> for the first updating transaction to commit or roll back (if it is
> still in progress). If the first updater rolls back, then its
> effects are negated and the repeatable read transaction can proceed
> with updating the originally found row. But if the first updater
> commits (and actually updated or deleted the row, not just locked
> it) then the repeatable read transaction will be rolled back with
> the message
This means it would be possible to lock the cron, and then get a
transaction error because the cron modified one of the records we're
going to hit while it was running: as far as the above is concerned
the cron's worker had "just locked" the row so it's fine to
continue. However this makes it more and more likely an error will be
hit when trying to freeze (to no issue, but still). We'll have to see
how that ends up.
Fixes#766 maybe
Currently sentry is only hooked from the outside, which doesn't
necessarily provide sufficiently actionable information.
Add some a few hooks to (try and) report odoo / mergebot metadata:
- add the user to WSGI transactions
- add a transaction (with users) around crons
- add the webhook event info to webhook requests
- add a few spans to the long-running crons, when they cover multiple
units per iteration (e.g. a span per branch being staged)
Closes#544
- move sentry configuration and add exception-based filtering
- clarify and reclassify (e.g. from warning to info) a few messages
- convert assertions in rebase to MergeError so they can be correctly
logged & reported, and ignored by sentry, also clarify them
(especially the consistency one)
Related to #544
Largely informed by sentry,
- Fix an inconsistency in staging ref verification, `set_ref`
internally waits for the observed head to match the requested head,
but then `try_staging` would re-check that and immediately fail if
it didn't.
Because github is *eventually* consistent (hopefully) this second
check can fail (and is also an extra API call), breaking staging
unnecessarily, especially as we're still going to wait for the
update to be visible to git.
Remove this redundant check entirely, as github provides no way to
ensure we have a consistent view of anything, it doesn't have much
value and can do much harm.
- Add github request id to one of the sanity check warnings as that
could be a useful thing to send upstream, missing github request ids
in the future should be noted and added.
- Reworked the GH object's calls to be clearer and more coherent:
consistently log the same thing on all GH errors (if `check`),
rather than just on the one without a `check` entry.
Also remove `raise_for_status` and raise `HTTPError` by hand every
time we hit a status >= 400, so we always forward the response body
no matter what its type is.
- Try again to log the request body (in full as it should be pretty
small), also remove stripping since we specifically wanted to add a
newline at the start, I've no idea what I was thinking.
Fixes#735, #764, #544