Bug 818 – Fix problems fetching DTDs from w3.org: increase HTTP timeout, remove 1 DTD

NOTE: The current preferred location for bug reports is the GitHub issue tracker.

Bug 818 - Fix problems fetching DTDs from w3.org: increase HTTP timeout, remove 1 DTD


Summary:	Fix problems fetching DTDs from w3.org: increase HTTP timeout, remove 1 DTD

Status:	RESOLVED FIXED

Product:	Validator.nu
Classification:	Unclassified
Component:	General
Version:	HEAD
Hardware:	All All

Importance:	P2 major
Assigned To:	Marcin Cieślak

URL:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-02-12 23:43 CET by Marcin Cieślak
Modified:	2011-05-17 11:43 CEST (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marcin Cieślak 2011-02-12 23:43:59 CET

I tried the hg tip today and it seems that www.w3.org has some issues
serving some pages. I have modified build/build.py to use urllib2 instead
of urllib and added the increased HTTP timout. Also it seems that sometimes
an empty response is returned, so I had to catch BadStatusLine exception.

A trivial patch is available here:

https://bitbucket.org/saper/validator-build/changeset/b893eb8c0260

Another issue was that we get 404 fetching

http://www.w3.org/TR/xhtml1/DTD/xhtml11.dtd

(not single "1" after first "xhtml").

There never seems was anything like this (even web.archive.org knows nothing).

I have removed this, but I wonder what should be there instead?

Maybe it was supposed to be this?

http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd

A patch is here:

https://bitbucket.org/saper/validator-validator/changeset/55e83b3149fb

After those changes, the validator built successfully on my FreeBSD machine.

--saper

Comment 1 Michael[tm] Smith 2011-04-20 17:07:48 CEST

https://bitbucket.org/validator/build/changeset/b893eb8c0260
https://bitbucket.org/validator/validator/changeset/05607e55502f

Comment 2 Michael[tm] Smith 2011-04-20 17:09:26 CEST

Marcin,

Thanks for the build patch. It's now landed in the upstream repo, along with a fix for the bogus DTD URL.

Sorry for having taken so long to get around to landing the patch.

Comment 3 Michael[tm] Smith 2011-05-17 11:43:18 CEST

Marcin,

I now notice that your patch depends on the 'timeout' argument to urllib2.urlopen(...), and that argument does not seem to be new in python 2.6. So it won't work in a python 2.5 environment.

So if you have time, please take a look at the following refinement and let me know if you see any problems with it.

diff -r b893eb8c0260 build.py
--- a/build.py	Sat Feb 12 20:51:26 2011 +0000
+++ b/build.py	Tue May 17 17:35:45 2011 +0900
@@ -25,6 +25,7 @@
 import shutil
 import httplib
 import urllib2
+import socket
 import re
 try:
   from hashlib import md5
@@ -643,14 +644,18 @@
   # I bet there's a way to do this with more efficient IO and less memory
   print url
   completed = False
+  defaultTimeout = socket.getdefaulttimeout()
   while not completed:
    try:
-    f = urllib2.urlopen(url, timeout=httpTimeoutSeconds)
+    socket.setdefaulttimeout(httpTimeoutSeconds)
+    f = urllib2.urlopen(url)
     data = f.read()
     f.close()
     completed = True
    except httplib.BadStatusLine, e:
     print "received error, retrying"
+   finally:
+    socket.setdefaulttimeout(defaultTimeout)
   if md5sum:
     m = md5(data)
     if md5sum != m.hexdigest():