From 7348e4d7a4d3565433423cc2d91307d40fd46d35 Mon Sep 17 00:00:00 2001
From: Xavier Morel <xmo@odoo.com>
Date: Fri, 11 Aug 2023 11:50:38 +0200
Subject: [PATCH] [IMP] runbot_merge: ensure at least 1s between mutating GH
 calls

Mostly a temporary safety feature after the events of 07-31: it's
still not clear whether that was a one-off issue or a change in
policy (I was not able to reproduce locally even doing several
set_refs a second) and the gh support is not super talkative, but it
probably doesn't hurt to commit the workaround until #247 gets
implemented.

On 2023-07-31, around 08:30 UTC, `set_ref` started failing, a lot
(although oddly enough not continuously), with the unhelpful message
that

> 422: Reference cannot be updated

This basically broke all stagings, until a workaround was implemented
by adding a 1s sleep before `set_ref` to ensure no more than 1
`set_ref` per second, which kinda sorta has been the github
recommendation forever but had never been an issue
before. Contributing to this suspicion is that in late 2022, the
documentation of error 422 on `PATCH git/refs/{ref}` was updated to:

> Validation failed, or the endpoint has been spammed.

Still would be nice if GH was clear about it and sent a 429 instead.

Technically the recommendation is:

> If you're making a large number of POST, PATCH, PUT, or DELETE
> requests for a single user or client ID, wait at least one second
> between each request.

So... actually implement that. On a per-worker basis as for the most
part these are serial processes (e.g. crons), we can still get above
the rate limit due to concurrent crons but it should be less likely.

Also take `Retry-After` in account, can't hurt, though we're supposed
to retry just the request rather than abort the entire thing. Maybe a
future update can improve this handling.

Would also be nice to take `X-RateLimit` in account, although that's
supposed to apply to *all* requests so we'd need a second separate
timestamp to track it. Technically that's probably also the case for
`Retry-After`. And fixing #247 should cut down drastically on the API
calls traffic as staging is a very API-intensive process, especially
with the sanity checks we had to add, these days we might be at 4
calls per commit per PR, and up to 80 PRs/staging (5 repositories and
16 batches per staging), with 13 live branches (though realistically
only 6-7 have significant traffic, and only 1~2 get close to filling
their staging slots).
---
 runbot_merge/github.py | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/runbot_merge/github.py b/runbot_merge/github.py
index 1461dc82..ed3950b0 100644
--- a/runbot_merge/github.py
+++ b/runbot_merge/github.py
@@ -6,6 +6,7 @@ import logging.handlers
 import os
 import pathlib
 import pprint
+import time
 import unicodedata
 
 import requests
@@ -57,6 +58,7 @@ class GH(object):
     def __init__(self, token, repo):
         self._url = 'https://api.github.com'
         self._repo = repo
+        self._last_update = 0
         session = self._session = requests.Session()
         session.headers['Authorization'] = 'token {}'.format(token)
         session.headers['Accept'] = 'application/vnd.github.symmetra-preview+json'
@@ -103,8 +105,16 @@ class GH(object):
         """
         :type check: bool | dict[int:Exception]
         """
+        if method.casefold() != 'get':
+            to_sleep = 1. - (time.time() - self._last_update)
+            if to_sleep > 0:
+                time.sleep(to_sleep)
+
         path = f'/repos/{self._repo}/{path}'
         r = self._session.request(method, self._url + path, params=params, json=json)
+        if method.casefold() != 'get':
+            self._last_update = time.time() + int(r.headers.get('Retry-After', 0))
+
         self._log_gh(_gh, r)
         if check:
             try: