Opened 10 years ago

Closed 7 years ago

#11813 closed defect (wontfix)

Stale caches with trac and transparent proxies

Reported by: vbraun Owned by: mvngu, schilly
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: website/wiki Keywords:
Cc: mderickx, was, SimonKing, robertwb Merged in:
Authors: Harald Schilly Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by vbraun)

Many sites are running transparent web proxies. Which should be fine, but I and Simon King both recently ran into a bug where and attempt to download a patch from trac resulted in an old version of the patch. Needless to say, this is very dangerous for development.

To reproduce, you need to have a transparent proxy in front of you, and then

  1. Upload a patch to trac
  2. Download the patch (the proxy will cache it)
  3. Upload a new version of the patch under the same name
  4. Download the patch again - under some circumstances the old version of the patch is served by the (not so) transparent proxy.

This just happened to me with trac11115-cached_cython.patch. If I download it from boxen (without proxy), I receive the following http headers:

vbraun@boxen:~$ wget -O- -S http://trac.sagemath.org/sage_trac/raw-attachment/ticket/11115/trac11115-cached_cython.patch | md5sum
--05:39:42--  http://trac.sagemath.org/sage_trac/raw-attachment/ticket/11115/trac11115-cached_cython.patch
           => `-'
Resolving trac.sagemath.org... 128.208.160.197
Connecting to trac.sagemath.org|128.208.160.197|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 Ok
  Date: Sun, 18 Sep 2011 12:39:42 GMT
  Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.5.1 mod_python/3.3.1 Python/2.5.2 mod_ssl/2.2.8 OpenSSL/0.9.8g mod_wsgi/2.0
  ETag: W/"anonymous/Sat, 17 Sep 2011 21:06:12 GMT/False"
  Content-Disposition: attachment
  Content-Length: 151548
  Last-Modified: Sat, 17 Sep 2011 21:06:12 GMT
  Keep-Alive: timeout=15, max=1000
  Connection: Keep-Alive
  Content-Type: text/x-diff; charset=iso-8859-15
Length: 151,548 (148K) [text/x-diff]

100%[=============================================================================>] 151,548       --.--K/s             

05:39:42 (161.15 MB/s) - `-' saved [151548/151548]

0dc42d7f8d3ae270eb65927ed942ad24  -

This is the correct patch. But behind my proxy, I receive a stale copy:

wget -O- -S http://trac.sagemath.org/sage_trac/raw-attachment/ticket/11115/trac11115-cached_cython.patch | md5sum 
--2011-09-18 13:37:47--  http://trac.sagemath.org/sage_trac/raw-attachment/ticket/11115/trac11115-cached_cython.patch
Resolving trac.sagemath.org... 128.208.160.197
Connecting to trac.sagemath.org|128.208.160.197|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.0 200 OK
  Date: Sat, 17 Sep 2011 20:37:09 GMT
  Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.5.1 mod_python/3.3.1 Python/2.5.2 mod_ssl/2.2.8 OpenSSL/0.9.8g mod_wsgi/2.0
  ETag: W/"anonymous/Thu, 26 May 2011 07:16:22 GMT/False"
  Content-Disposition: attachment
  Content-Length: 151609
  Last-Modified: Thu, 26 May 2011 07:16:22 GMT
  Content-Type: text/x-diff; charset=iso-8859-15
  Age: 57638
  X-Cache: HIT from fw.stp.dias.ie
  X-Cache-Lookup: HIT from fw.stp.dias.ie:3128
  Via: 1.1 fw.stp.dias.ie:3128 (squid/2.7.STABLE9)
  Connection: keep-alive
Length: 151609 (148K) [text/x-diff]
Saving to: “STDOUT”

100%[============================================================>] 151,609     --.-K/s   in 0.002s  

2011-09-18 13:37:47 (77.8 MB/s) - written to stdout [151609/151609]

f88ca8ad9090aeacb6dc0c726dcc76b5  -

HTTP provides the ETag header to control cache freshness. The proxy (squid/2.7.STABLE9) should have checked with the trac server to see if the cached ETag W/"anonymous/Thu, 26 May 2011 07:16:22 GMT/False" is still up-to-date. If the resource were still up to date the trac server would reply HTTP 304 Not Modified, but since the ETag changed the trac server should reply with the new version of the patch. I don't have access to the server logs so I can't say what happened for sure, but something is broken.

A workaround is to set the Pragma: no-cache in the client query (i.e. use wget --no-cache), but then its easy to forget that.

Irrespective of who is precisely at fault, we should configure the trac server to never allow caching of the patches since their integrity is crucial for us and client-side caching doesn't really buy us much. For that, I propose to configure Apache to add the following to the headers for all resources under /sage_trac/raw_attachment:

Cache-Control: no-cache
Expires: Thu, 1 Jan 1970 00:00:00 GMT

hitting both the HTTP/1.0 and 1.1 cache control mechanisms.

See also upstream bug http://trac.edgewall.org/ticket/6367

Change History (22)

comment:1 Changed 10 years ago by vbraun

I just looked at the trac server logs, and it never receives a second request for anything under /sage_trac/raw-attachment from our squid transparent proxy. So the culprit is definitely squid having a problem with ETag. Even though the docs say that it should work.

The caching works correctly for other resources, for example if I request http://trac.sagemath.org/sage_trac/chrome/common/download.png from behind the transparent proxy it does send a request up to the trac server, is answered with a HTTP 304, and serves the up-to-date cached version. The only difference I can see is that trac does not include an ETag header in that case, so it is a different code path in squid.

vbraun@boxen:~$ wget -O- -S trac.sagemath.org/sage_trac/chrome/common/download.png | md5sum
--08:21:23--  http://trac.sagemath.org/sage_trac/chrome/common/download.png
           => `-'
Resolving trac.sagemath.org... 128.208.160.197
Connecting to trac.sagemath.org|128.208.160.197|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 Ok
  Date: Sun, 18 Sep 2011 15:21:23 GMT
  Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.5.1 mod_python/3.3.1 Python/2.5.2 mod_ssl/2.2.8 OpenSSL/0.9.8g mod_wsgi/2.0
  Content-Length: 164
  Last-Modified: Fri, 23 Oct 2009 03:56:53 GMT
  Keep-Alive: timeout=15, max=1000
  Connection: Keep-Alive
  Content-Type: image/png
Length: 164 [image/png]

100%[=============================================================================>] 164           --.--K/s             

08:21:24 (10.92 MB/s) - `-' saved [164/164]

750ae6e7739e23934867fb5fe9ee0bee  -

comment:2 follow-ups: Changed 10 years ago by leif

Replying to vbraun:

  1. Download the patch again - under some circumstances the old version of the patch is served by the (not so) transparent proxy.


This just happened to me with trac11115-cached_cython.patch.

Well, if you name it such...

SCNR


A workaround is to set the Pragma: no-cache in the client query (i.e. use wget --no-cache), but then its easy to forget that.

You can put it into ~/.wgetrc.


Irrespective of who is precisely at fault, we should configure the trac server to never allow caching of the patches since their integrity is crucial for us and client-side caching doesn't really buy us much. For that, I propose to configure Apache to add the following to the headers for all resources under /sage_trac/raw_attachment:

Cache-Control: no-cache
Expires: Thu, 1 Jan 1970 00:00:00 GMT

hitting both the HTTP/1.0 and 1.1 cache control mechanisms.

Hmmm, I cannot tell how large the impact on the machine running trac will be, but in general I think that's a bad idea.

I don't think it will be large at the moment for ordinary users, as most patches are [hopefully] small, but note that this also inhibits the use of -N (--timestamping), which is more [or also] crucial to the client side, especially for bots (or if you have a very slow, or expensive / volume-taxed connection).

comment:3 in reply to: ↑ 2 Changed 10 years ago by leif

Replying to leif:

[...] this also inhibits the use of -N (--timestamping) [...]

Thinking more about it, I'm not so sure whether this would be a problem with wget (provided Last-modified is kept as is), but I rather doubt that other programs will be as "smart", i.e., first checking the modification time if a file is not to be cached anyway.

comment:4 follow-up: Changed 10 years ago by vbraun

  • Cc robertwb added

I see your point about bots, though afaik that would only affect the sage patch buildbot. I'll cc Robert, maybe he can tell us how his bot checks freshness.

Ordinary users don't download patches from trac on a regular basis, thats why I don't think there is any bandwidth issue. The problem with the ~/.wgetrc bandaid is that you often don't know whether you are behind a transparent proxy. Sage developers regularly work away from their home institutions / in coffee shops, and who knows what is being cached.

comment:5 in reply to: ↑ 4 ; follow-up: Changed 10 years ago by leif

Replying to vbraun:

I see your point about bots, though afaik that would only affect the sage patch buildbot. I'll cc Robert, maybe he can tell us how his bot checks freshness.

Robert [also] stores md5sums (or some hash code) of patches, so I assume he doesn't specifically use HTTP header information, and his bot does scan the ticket's comments, in contrast to my release tool and AFAIK also Jeroen's. (The advantage is that you don't have to fetch and parse the -- potentially large or long -- HTML version of a ticket at all; that's btw. another reason why we want the relevant files, spkgs or patches to be applied, to be referenced in the ticket's description.)


Ordinary users don't download patches from trac on a regular basis, thats why I don't think there is any bandwidth issue. The problem with the ~/.wgetrc bandaid is that you often don't know whether you are behind a transparent proxy.

Well, humans are more likely to read the comments on a ticket, so they actually see that a patch was re-uploaded / modified (though they perhaps don't look at the file modification times of the downloaded files, which one IMHO should do).

But the purpose of ~/.wgetrc in this case would be to always disable caching (by default), such that it wouldn't matter whether you're behind a proxy or not (provided the proxy isn't broken and doesn't refuse to bypass caching).

comment:6 Changed 10 years ago by leif

P.S.: The best way to bypass proxies or disable caching is to use HTTPS; unfortunately Sage's trac doesn't support it...

comment:7 in reply to: ↑ 5 ; follow-up: Changed 10 years ago by vbraun

  • Description modified (diff)

Replying to leif:

But the purpose of ~/.wgetrc in this case would be to always disable caching (by default), such that it wouldn't matter whether you're behind a proxy or not (provided the proxy isn't broken and doesn't refuse to bypass caching).

So you are suggesting that every Sage developer puts a particular entry in ~/.wgetrc on all of his laptops, just to be safe if he ever leaves his house with it. While we could just work around it in a few lines of the apache httpd.conf.

Well, humans are more likely to read the comments on a ticket, so they actually see that a patch was re-uploaded / modified (though they perhaps don't look at the file modification times of the downloaded files, which one IMHO should do).

The html version does not get erroneously cached, the bug manifests only with the raw attachment. Trac dishes out the html version with Cache-control: must-revalidate:

vbraun@boxen:~$ wget -O- -S http://trac.sagemath.org/sage_trac/attachment/ticket/11115/trac11115-cached_cython.patch | md5sum--10:49:49--  http://trac.sagemath.org/sage_trac/attachment/ticket/11115/trac11115-cached_cython.patch
           => `-'
Resolving trac.sagemath.org... 128.208.160.197
Connecting to trac.sagemath.org|128.208.160.197|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 Ok
  Date: Sun, 18 Sep 2011 17:49:49 GMT
  Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.5.1 mod_python/3.3.1 Python/2.5.2 mod_ssl/2.2.8 OpenSSL/0.9.8g mod_wsgi/2.0
  ETag: W/"anonymous/Sat, 17 Sep 2011 21:06:12 GMT/False"
  Cache-control: must-revalidate
  Set-Cookie: trac_form_token=6ee2168bc6a1bd4e46d5ac03; Path=/sage_trac
  Set-Cookie: trac_session=36b6b4eaaf22a880e1451a6a; expires=Sat, 17-Dec-2011 17:49:53 GMT; Path=/sage_trac
  Content-Length: 750922
  Vary: Accept-Encoding
  Keep-Alive: timeout=15, max=1000
  Connection: Keep-Alive
  Content-Type: text/html;charset=utf-8
Length: 750,922 (733K) [text/html]

100%[===========================================================================================================================>] 750,922       --.--K/s             

10:49:53 (181.60 MB/s) - `-' saved [750922/750922]

0ee8396915e5be21797f03b88cacd53c  -

Though the Vary: Accept-Encoding header is very wrong. Looking at the trac trac (:-), this seems to be a known bug: http://trac.edgewall.org/ticket/6367. That ticket says: "Also note that Request.send_file() function does not send a Cache-Control header. That should be OK if Vary * is sent". This seems to be the issue, raw attachments neither have a Cache-control nor a Vary: * header. And I don't check manually that a downloaded file has the right time stamp, I have a computer to do menial task for me not the other way round :-)

comment:8 in reply to: ↑ 7 Changed 10 years ago by leif

Replying to vbraun:

Replying to leif:

But the purpose of ~/.wgetrc in this case would be to always disable caching (by default), such that it wouldn't matter whether you're behind a proxy or not (provided the proxy isn't broken and doesn't refuse to bypass caching).

So you are suggesting that every Sage developer puts a particular entry in ~/.wgetrc on all of his laptops, just to be safe if he ever leaves his house with it.

You're right, that's of course too much to demand, compared to the minimal effort it takes to continually install a recent version of Sage with all necessary prerequisite patches etc. on every computer one might take with oneself.


Well, humans are more likely to read the comments on a ticket, so they actually see that a patch was re-uploaded / modified (though they perhaps don't look at the file modification times of the downloaded files, which one IMHO should do).

The html version does not get erroneously cached, the bug manifests only with the raw attachment.

So it is safe to read the comments; I didn't say anything else. Bots usually don't do that, which was my point.


And I don't check manually that a downloaded file has the right time stamp, I have a computer to do menial task for me not the other way round :-)

Well, you should notice when a patch (which you know of it was) recently re-uploaded on trac suddenly has a modification time of months ago, and in case you have any doubt, you can easily view the HTML version, which also shows the current changeset header, and compare it to what you actually downloaded or have on your disk.


Hopefully this will be fixed by Edgewall, although -- despite the discussion on sage-devel -- we still have the more than two years old 0.11.5 version.

comment:9 Changed 10 years ago by vbraun

Well, you should notice when a patch (which you know of it was) recently re-uploaded on trac suddenly has a modification time of months ago, and in case you have any doubt, you can easily view the HTML version, which also shows the current changeset header, and compare it to what you actually downloaded or have on your disk.

Not necessarily. I often read the patch on trac (looking at the html version), then download it with hg qimport .../raw-attachment/x.patch, and then test it. Unless there are any test failures its easy to miss that mercurial got a stale copy.

Since the upstream fix will add a Cache-control: must-revalidate and/or Vary: *, we don't lose anything by having Apache add those headers right now.

comment:10 Changed 10 years ago by vbraun

I looked at the squid logs today and it confirms what I said: Squid considers the .../raw-attachment/... resources to be always fresh and returns the cached values without revalidating them.

comment:11 Changed 10 years ago by vbraun

I added the following .htaccess file to my home directory on boxen

Header set ETag "W/\"anonymous/Mon, 19 Sep 2011 12:41:02 GMT/False\""
Header set Content-Type "text/x-diff; charset=iso-8859-15"
Header set Content-Disposition "attachment"
Header unset Accept-Ranges
AddType "text/x-diff; charset=iso-8859-15" .patch

now the same headers as from sage_trac/raw-attachment are served when trying to download a file with wget. Still, I can't reproduce the stale cache. In fact, squid now refuses to cache files for more than a few seconds. It seems like squid adds some funky request headers when talking to trac which I don't know about.

Somebody needs to capture the tcp stream between the proxy and trac to make further progress.

comment:12 in reply to: ↑ 2 Changed 10 years ago by robertwb

Replying to leif:

Replying to vbraun:

Irrespective of who is precisely at fault, we should configure the trac server to never allow caching of the patches since their integrity is crucial for us and client-side caching doesn't really buy us much. For that, I propose to configure Apache to add the following to the headers for all resources under /sage_trac/raw_attachment:

Cache-Control: no-cache
Expires: Thu, 1 Jan 1970 00:00:00 GMT

hitting both the HTTP/1.0 and 1.1 cache control mechanisms.

Hmmm, I cannot tell how large the impact on the machine running trac will be, but in general I think that's a bad idea.

I don't think it will be large at the moment for ordinary users, as most patches are [hopefully] small, but note that this also inhibits the use of -N (--timestamping), which is more [or also] crucial to the client side, especially for bots (or if you have a very slow, or expensive / volume-taxed connection).

This is the correct solution--it's what you do when you have a URI whose content may change. I don't think it'll be a severe issue for developers to hit trac itself rather than transparent proxies and it's a worthy price to pay for always getting the correct version. The raw patch is certainly cheaper on the server and connection than the html version.

Currently, there's only one patchbot and it's running locally, so no transparent proxy that I'm aware of :). In general, it only downloads patches when a new one was added (it parses the main ticket page (rss feed)). Patches are identified with md5 hashes, so although it could get confused by caches if they were in the way, when it gets a new version of a patch it doesn't get confused.

comment:13 follow-up: Changed 10 years ago by vbraun

I propose to add the following to the Apache httpd.conf (untested):

<DirectoryMatch ".*/raw-attachment/.*">
  Header set Cache-control "must-revalidate"
  Header set Vary "*"
</DirectoryMatch>

This should work then for all trac instances running on Sage.

Also, this shouldn't effect wget --timestamping since it only instructs the transparent proxy to always check back with the server, but not necessarily to download anything except the header.

comment:14 in reply to: ↑ 13 Changed 10 years ago by leif

Replying to vbraun:

I propose to add the following to the Apache httpd.conf (untested):

<DirectoryMatch ".*/raw-attachment/.*">
  Header set Cache-control "must-revalidate"
  Header set Vary "*"
</DirectoryMatch>

This should work then for all trac instances running on Sage.

Also, this shouldn't effect wget --timestamping since it only instructs the transparent proxy to always check back with the server, but not necessarily to download anything except the header.

Then anybody with the appropriate permissions (ticket owners?!) go ahead!

(Or at least give it a try.)

comment:15 Changed 8 years ago by vbraun

  • Authors set to Harald Schilly
  • Status changed from new to needs_review

Harald has made the change (with LocationMatch instead of DirectoryMatch) and it did add the headers in question:

vbraun@boxen:~$ wget -S -O /dev/null http://trac.sagemath.org/sage_trac/raw-attachment/ticket/14319/trac_14319-empty.patch                                                                                               
--06:47:04--  http://trac.sagemath.org/sage_trac/raw-attachment/ticket/14319/trac_14319-empty.patch
           => `/dev/null'
Resolving trac.sagemath.org... 128.208.160.197
Connecting to trac.sagemath.org|128.208.160.197|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 Ok
  Date: Thu, 18 Apr 2013 13:47:04 GMT
  Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.5.1 mod_python/3.3.1 Python/2.5.2 PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch mod_ssl/2.2.8 OpenSSL/0.9.8g mod_wsgi/2.0
  ETag: W/"anonymous/Thu, 18 Apr 2013 08:47:28 GMT/False"
  Content-Disposition: attachment
  Set-Cookie: trac_session=632bd240e900edb0698c9b97; expires=Wed, 17-Jul-2013 13:47:04 GMT; Path=/sage_trac
  Content-Length: 9819
  Last-Modified: Thu, 18 Apr 2013 08:47:28 GMT
  Cache-control: must-revalidate
  Vary: *
  Keep-Alive: timeout=15, max=1000
  Connection: Keep-Alive
  Content-Type: text/x-diff; charset=iso-8859-15

comment:16 Changed 8 years ago by jdemeyer

  • Milestone changed from sage-5.11 to sage-5.12

comment:17 Changed 7 years ago by vbraun_spam

  • Milestone changed from sage-6.1 to sage-6.2

comment:18 Changed 7 years ago by vbraun_spam

  • Milestone changed from sage-6.2 to sage-6.3

comment:19 Changed 7 years ago by chapoton

Maybe this can be closed, now that we have switched to git ?

comment:20 Changed 7 years ago by mderickx

  • Milestone changed from sage-6.3 to sage-duplicate/invalid/wontfix
  • Status changed from needs_review to positive_review

I agree, gave it positive review so Volker can close it.

comment:21 Changed 7 years ago by leif

Well, while the switch to git alleviates the situation, it doesn't mean we no longer attach files to trac tickets. Anyway, those perhaps get rarely updated in-place.

comment:22 Changed 7 years ago by vbraun

  • Resolution set to wontfix
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.