Opened 11 years ago
Closed 9 years ago
#11813 closed defect (wontfix)
Stale caches with trac and transparent proxies
Reported by: | vbraun | Owned by: | mvngu, schilly |
---|---|---|---|
Priority: | major | Milestone: | sage-duplicate/invalid/wontfix |
Component: | website/wiki | Keywords: | |
Cc: | mderickx, was, SimonKing, robertwb | Merged in: | |
Authors: | Harald Schilly | Reviewers: | |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
Many sites are running transparent web proxies. Which should be fine, but I and Simon King both recently ran into a bug where and attempt to download a patch from trac resulted in an old version of the patch. Needless to say, this is very dangerous for development.
To reproduce, you need to have a transparent proxy in front of you, and then
- Upload a patch to trac
- Download the patch (the proxy will cache it)
- Upload a new version of the patch under the same name
- Download the patch again - under some circumstances the old version of the patch is served by the (not so) transparent proxy.
This just happened to me with trac11115-cached_cython.patch
. If I download it from boxen (without proxy), I receive the following http headers:
vbraun@boxen:~$ wget -O- -S http://trac.sagemath.org/sage_trac/raw-attachment/ticket/11115/trac11115-cached_cython.patch | md5sum --05:39:42-- http://trac.sagemath.org/sage_trac/raw-attachment/ticket/11115/trac11115-cached_cython.patch => `-' Resolving trac.sagemath.org... 128.208.160.197 Connecting to trac.sagemath.org|128.208.160.197|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 Ok Date: Sun, 18 Sep 2011 12:39:42 GMT Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.5.1 mod_python/3.3.1 Python/2.5.2 mod_ssl/2.2.8 OpenSSL/0.9.8g mod_wsgi/2.0 ETag: W/"anonymous/Sat, 17 Sep 2011 21:06:12 GMT/False" Content-Disposition: attachment Content-Length: 151548 Last-Modified: Sat, 17 Sep 2011 21:06:12 GMT Keep-Alive: timeout=15, max=1000 Connection: Keep-Alive Content-Type: text/x-diff; charset=iso-8859-15 Length: 151,548 (148K) [text/x-diff] 100%[=============================================================================>] 151,548 --.--K/s 05:39:42 (161.15 MB/s) - `-' saved [151548/151548] 0dc42d7f8d3ae270eb65927ed942ad24 -
This is the correct patch. But behind my proxy, I receive a stale copy:
wget -O- -S http://trac.sagemath.org/sage_trac/raw-attachment/ticket/11115/trac11115-cached_cython.patch | md5sum --2011-09-18 13:37:47-- http://trac.sagemath.org/sage_trac/raw-attachment/ticket/11115/trac11115-cached_cython.patch Resolving trac.sagemath.org... 128.208.160.197 Connecting to trac.sagemath.org|128.208.160.197|:80... connected. HTTP request sent, awaiting response... HTTP/1.0 200 OK Date: Sat, 17 Sep 2011 20:37:09 GMT Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.5.1 mod_python/3.3.1 Python/2.5.2 mod_ssl/2.2.8 OpenSSL/0.9.8g mod_wsgi/2.0 ETag: W/"anonymous/Thu, 26 May 2011 07:16:22 GMT/False" Content-Disposition: attachment Content-Length: 151609 Last-Modified: Thu, 26 May 2011 07:16:22 GMT Content-Type: text/x-diff; charset=iso-8859-15 Age: 57638 X-Cache: HIT from fw.stp.dias.ie X-Cache-Lookup: HIT from fw.stp.dias.ie:3128 Via: 1.1 fw.stp.dias.ie:3128 (squid/2.7.STABLE9) Connection: keep-alive Length: 151609 (148K) [text/x-diff] Saving to: “STDOUT” 100%[============================================================>] 151,609 --.-K/s in 0.002s 2011-09-18 13:37:47 (77.8 MB/s) - written to stdout [151609/151609] f88ca8ad9090aeacb6dc0c726dcc76b5 -
HTTP provides the ETag header to control cache freshness. The proxy (squid/2.7.STABLE9) should have checked with the trac server to see if the cached ETag W/"anonymous/Thu, 26 May 2011 07:16:22 GMT/False"
is still up-to-date. If the resource were still up to date the trac server would reply HTTP 304 Not Modified
, but since the ETag changed the trac server should reply with the new version of the patch. I don't have access to the server logs so I can't say what happened for sure, but something is broken.
A workaround is to set the Pragma: no-cache
in the client query (i.e. use wget --no-cache
), but then its easy to forget that.
Irrespective of who is precisely at fault, we should configure the trac server to never allow caching of the patches since their integrity is crucial for us and client-side caching doesn't really buy us much. For that, I propose to configure Apache to add the following to the headers for all resources under /sage_trac/raw_attachment
:
Cache-Control: no-cache Expires: Thu, 1 Jan 1970 00:00:00 GMT
hitting both the HTTP/1.0 and 1.1 cache control mechanisms.
See also upstream bug http://trac.edgewall.org/ticket/6367
Change History (22)
comment:1 Changed 11 years ago by
comment:2 follow-ups: 3 12 Changed 11 years ago by
Replying to vbraun:
- Download the patch again - under some circumstances the old version of the patch is served by the (not so) transparent proxy.
This just happened to me withtrac11115-cached_cython.patch
.
Well, if you name it such...
SCNR
A workaround is to set the
Pragma: no-cache
in the client query (i.e. usewget --no-cache
), but then its easy to forget that.
You can put it into ~/.wgetrc
.
Irrespective of who is precisely at fault, we should configure the trac server to never allow caching of the patches since their integrity is crucial for us and client-side caching doesn't really buy us much. For that, I propose to configure Apache to add the following to the headers for all resources under
/sage_trac/raw_attachment
:
Cache-Control: no-cache Expires: Thu, 1 Jan 1970 00:00:00 GMT
hitting both the HTTP/1.0 and 1.1 cache control mechanisms.
Hmmm, I cannot tell how large the impact on the machine running trac will be, but in general I think that's a bad idea.
I don't think it will be large at the moment for ordinary users, as most patches are [hopefully] small, but note that this also inhibits the use of -N
(--timestamping
), which is more [or also] crucial to the client side, especially for bots (or if you have a very slow, or expensive / volume-taxed connection).
comment:3 Changed 11 years ago by
Replying to leif:
[...] this also inhibits the use of
-N
(--timestamping
) [...]
Thinking more about it, I'm not so sure whether this would be a problem with wget
(provided Last-modified
is kept as is), but I rather doubt that other programs will be as "smart", i.e., first checking the modification time if a file is not to be cached anyway.
comment:4 follow-up: 5 Changed 11 years ago by
Cc: | robertwb added |
---|
I see your point about bots, though afaik that would only affect the sage patch buildbot. I'll cc Robert, maybe he can tell us how his bot checks freshness.
Ordinary users don't download patches from trac on a regular basis, thats why I don't think there is any bandwidth issue. The problem with the ~/.wgetrc
bandaid is that you often don't know whether you are behind a transparent proxy. Sage developers regularly work away from their home institutions / in coffee shops, and who knows what is being cached.
comment:5 follow-up: 7 Changed 11 years ago by
Replying to vbraun:
I see your point about bots, though afaik that would only affect the sage patch buildbot. I'll cc Robert, maybe he can tell us how his bot checks freshness.
Robert [also] stores md5sums (or some hash code) of patches, so I assume he doesn't specifically use HTTP header information, and his bot does scan the ticket's comments, in contrast to my release tool and AFAIK also Jeroen's. (The advantage is that you don't have to fetch and parse the -- potentially large or long -- HTML version of a ticket at all; that's btw. another reason why we want the relevant files, spkgs or patches to be applied, to be referenced in the ticket's description.)
Ordinary users don't download patches from trac on a regular basis, thats why I don't think there is any bandwidth issue. The problem with the
~/.wgetrc
bandaid is that you often don't know whether you are behind a transparent proxy.
Well, humans are more likely to read the comments on a ticket, so they actually see that a patch was re-uploaded / modified (though they perhaps don't look at the file modification times of the downloaded files, which one IMHO should do).
But the purpose of ~/.wgetrc
in this case would be to always disable caching (by default), such that it wouldn't matter whether you're behind a proxy or not (provided the proxy isn't broken and doesn't refuse to bypass caching).
comment:6 Changed 11 years ago by
P.S.: The best way to bypass proxies or disable caching is to use HTTPS; unfortunately Sage's trac doesn't support it...
comment:7 follow-up: 8 Changed 11 years ago by
Description: | modified (diff) |
---|
Replying to leif:
But the purpose of
~/.wgetrc
in this case would be to always disable caching (by default), such that it wouldn't matter whether you're behind a proxy or not (provided the proxy isn't broken and doesn't refuse to bypass caching).
So you are suggesting that every Sage developer puts a particular entry in ~/.wgetrc
on all of his laptops, just to be safe if he ever leaves his house with it. While we could just work around it in a few lines of the apache httpd.conf
.
Well, humans are more likely to read the comments on a ticket, so they actually see that a patch was re-uploaded / modified (though they perhaps don't look at the file modification times of the downloaded files, which one IMHO should do).
The html version does not get erroneously cached, the bug manifests only with the raw attachment. Trac dishes out the html version with Cache-control: must-revalidate
:
vbraun@boxen:~$ wget -O- -S http://trac.sagemath.org/sage_trac/attachment/ticket/11115/trac11115-cached_cython.patch | md5sum--10:49:49-- http://trac.sagemath.org/sage_trac/attachment/ticket/11115/trac11115-cached_cython.patch => `-' Resolving trac.sagemath.org... 128.208.160.197 Connecting to trac.sagemath.org|128.208.160.197|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 Ok Date: Sun, 18 Sep 2011 17:49:49 GMT Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.5.1 mod_python/3.3.1 Python/2.5.2 mod_ssl/2.2.8 OpenSSL/0.9.8g mod_wsgi/2.0 ETag: W/"anonymous/Sat, 17 Sep 2011 21:06:12 GMT/False" Cache-control: must-revalidate Set-Cookie: trac_form_token=6ee2168bc6a1bd4e46d5ac03; Path=/sage_trac Set-Cookie: trac_session=36b6b4eaaf22a880e1451a6a; expires=Sat, 17-Dec-2011 17:49:53 GMT; Path=/sage_trac Content-Length: 750922 Vary: Accept-Encoding Keep-Alive: timeout=15, max=1000 Connection: Keep-Alive Content-Type: text/html;charset=utf-8 Length: 750,922 (733K) [text/html] 100%[===========================================================================================================================>] 750,922 --.--K/s 10:49:53 (181.60 MB/s) - `-' saved [750922/750922] 0ee8396915e5be21797f03b88cacd53c -
Though the Vary: Accept-Encoding
header is very wrong. Looking at the trac trac (:-), this seems to be a known bug: http://trac.edgewall.org/ticket/6367. That ticket says: "Also note that Request.send_file() function does not send a Cache-Control header. That should be OK if Vary * is sent". This seems to be the issue, raw attachments neither have a Cache-control
nor a Vary: *
header.
And I don't check manually that a downloaded file has the right time stamp, I have a computer to do menial task for me not the other way round :-)
comment:8 Changed 11 years ago by
Replying to vbraun:
Replying to leif:
But the purpose of
~/.wgetrc
in this case would be to always disable caching (by default), such that it wouldn't matter whether you're behind a proxy or not (provided the proxy isn't broken and doesn't refuse to bypass caching).So you are suggesting that every Sage developer puts a particular entry in
~/.wgetrc
on all of his laptops, just to be safe if he ever leaves his house with it.
You're right, that's of course too much to demand, compared to the minimal effort it takes to continually install a recent version of Sage with all necessary prerequisite patches etc. on every computer one might take with oneself.
Well, humans are more likely to read the comments on a ticket, so they actually see that a patch was re-uploaded / modified (though they perhaps don't look at the file modification times of the downloaded files, which one IMHO should do).
The html version does not get erroneously cached, the bug manifests only with the raw attachment.
So it is safe to read the comments; I didn't say anything else. Bots usually don't do that, which was my point.
And I don't check manually that a downloaded file has the right time stamp, I have a computer to do menial task for me not the other way round :-)
Well, you should notice when a patch (which you know of it was) recently re-uploaded on trac suddenly has a modification time of months ago, and in case you have any doubt, you can easily view the HTML version, which also shows the current changeset header, and compare it to what you actually downloaded or have on your disk.
Hopefully this will be fixed by Edgewall, although -- despite the discussion on sage-devel -- we still have the more than two years old 0.11.5 version.
comment:9 Changed 11 years ago by
Well, you should notice when a patch (which you know of it was) recently re-uploaded on trac suddenly has a modification time of months ago, and in case you have any doubt, you can easily view the HTML version, which also shows the current changeset header, and compare it to what you actually downloaded or have on your disk.
Not necessarily. I often read the patch on trac (looking at the html version), then download it with hg qimport .../raw-attachment/x.patch
, and then test it. Unless there are any test failures its easy to miss that mercurial got a stale copy.
Since the upstream fix will add a Cache-control: must-revalidate
and/or Vary: *
, we don't lose anything by having Apache add those headers right now.
comment:10 Changed 11 years ago by
I looked at the squid logs today and it confirms what I said: Squid considers the .../raw-attachment/...
resources to be always fresh and returns the cached values without revalidating them.
comment:11 Changed 11 years ago by
I added the following .htaccess
file to my home directory on boxen
Header set ETag "W/\"anonymous/Mon, 19 Sep 2011 12:41:02 GMT/False\"" Header set Content-Type "text/x-diff; charset=iso-8859-15" Header set Content-Disposition "attachment" Header unset Accept-Ranges AddType "text/x-diff; charset=iso-8859-15" .patch
now the same headers as from sage_trac/raw-attachment are served when trying to download a file with wget. Still, I can't reproduce the stale cache. In fact, squid now refuses to cache files for more than a few seconds. It seems like squid adds some funky request headers when talking to trac which I don't know about.
Somebody needs to capture the tcp stream between the proxy and trac to make further progress.
comment:12 Changed 11 years ago by
Replying to leif:
Replying to vbraun:
Irrespective of who is precisely at fault, we should configure the trac server to never allow caching of the patches since their integrity is crucial for us and client-side caching doesn't really buy us much. For that, I propose to configure Apache to add the following to the headers for all resources under
/sage_trac/raw_attachment
:Cache-Control: no-cache Expires: Thu, 1 Jan 1970 00:00:00 GMThitting both the HTTP/1.0 and 1.1 cache control mechanisms.
Hmmm, I cannot tell how large the impact on the machine running trac will be, but in general I think that's a bad idea.
I don't think it will be large at the moment for ordinary users, as most patches are [hopefully] small, but note that this also inhibits the use of
-N
(--timestamping
), which is more [or also] crucial to the client side, especially for bots (or if you have a very slow, or expensive / volume-taxed connection).
This is the correct solution--it's what you do when you have a URI whose content may change. I don't think it'll be a severe issue for developers to hit trac itself rather than transparent proxies and it's a worthy price to pay for always getting the correct version. The raw patch is certainly cheaper on the server and connection than the html version.
Currently, there's only one patchbot and it's running locally, so no transparent proxy that I'm aware of :). In general, it only downloads patches when a new one was added (it parses the main ticket page (rss feed)). Patches are identified with md5 hashes, so although it could get confused by caches if they were in the way, when it gets a new version of a patch it doesn't get confused.
comment:13 follow-up: 14 Changed 11 years ago by
I propose to add the following to the Apache httpd.conf (untested):
<DirectoryMatch ".*/raw-attachment/.*"> Header set Cache-control "must-revalidate" Header set Vary "*" </DirectoryMatch>
This should work then for all trac instances running on Sage.
Also, this shouldn't effect wget --timestamping
since it only instructs the transparent proxy to always check back with the server, but not necessarily to download anything except the header.
comment:14 Changed 11 years ago by
Replying to vbraun:
I propose to add the following to the Apache httpd.conf (untested):
<DirectoryMatch ".*/raw-attachment/.*"> Header set Cache-control "must-revalidate" Header set Vary "*" </DirectoryMatch>
This should work then for all trac instances running on Sage.
Also, this shouldn't effect
wget --timestamping
since it only instructs the transparent proxy to always check back with the server, but not necessarily to download anything except the header.
Then anybody with the appropriate permissions (ticket owners?!) go ahead!
(Or at least give it a try.)
comment:15 Changed 10 years ago by
Authors: | → Harald Schilly |
---|---|
Status: | new → needs_review |
Harald has made the change (with LocationMatch
instead of DirectoryMatch
) and it did add the headers in question:
vbraun@boxen:~$ wget -S -O /dev/null http://trac.sagemath.org/sage_trac/raw-attachment/ticket/14319/trac_14319-empty.patch --06:47:04-- http://trac.sagemath.org/sage_trac/raw-attachment/ticket/14319/trac_14319-empty.patch => `/dev/null' Resolving trac.sagemath.org... 128.208.160.197 Connecting to trac.sagemath.org|128.208.160.197|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 Ok Date: Thu, 18 Apr 2013 13:47:04 GMT Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.5.1 mod_python/3.3.1 Python/2.5.2 PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch mod_ssl/2.2.8 OpenSSL/0.9.8g mod_wsgi/2.0 ETag: W/"anonymous/Thu, 18 Apr 2013 08:47:28 GMT/False" Content-Disposition: attachment Set-Cookie: trac_session=632bd240e900edb0698c9b97; expires=Wed, 17-Jul-2013 13:47:04 GMT; Path=/sage_trac Content-Length: 9819 Last-Modified: Thu, 18 Apr 2013 08:47:28 GMT Cache-control: must-revalidate Vary: * Keep-Alive: timeout=15, max=1000 Connection: Keep-Alive Content-Type: text/x-diff; charset=iso-8859-15
comment:16 Changed 9 years ago by
Milestone: | sage-5.11 → sage-5.12 |
---|
comment:17 Changed 9 years ago by
Milestone: | sage-6.1 → sage-6.2 |
---|
comment:18 Changed 9 years ago by
Milestone: | sage-6.2 → sage-6.3 |
---|
comment:20 Changed 9 years ago by
Milestone: | sage-6.3 → sage-duplicate/invalid/wontfix |
---|---|
Status: | needs_review → positive_review |
I agree, gave it positive review so Volker can close it.
comment:21 Changed 9 years ago by
Well, while the switch to git alleviates the situation, it doesn't mean we no longer attach files to trac tickets. Anyway, those perhaps get rarely updated in-place.
comment:22 Changed 9 years ago by
Resolution: | → wontfix |
---|---|
Status: | positive_review → closed |
I just looked at the trac server logs, and it never receives a second request for anything under
/sage_trac/raw-attachment
from our squid transparent proxy. So the culprit is definitely squid having a problem with ETag. Even though the docs say that it should work.The caching works correctly for other resources, for example if I request
http://trac.sagemath.org/sage_trac/chrome/common/download.png
from behind the transparent proxy it does send a request up to the trac server, is answered with aHTTP 304
, and serves the up-to-date cached version. The only difference I can see is that trac does not include an ETag header in that case, so it is a different code path in squid.