[wp-docs] [Fwd: Re: Semantic HTML Tutorial]

Matthew Thomas mpt at myrealbox.com
Mon Apr 26 17:52:07 CDT 2004


On 27 Apr 2004, at 1:01 AM, Scott Merrill wrote:
> ...
>> Obviously

Beware of people saying "obviously". ;-)

>> HTML has grown far beyond anything Tim Berners-Lee ever envisioned, 
>> so the effort is being put forth to preserve as much backward 
>> compatibilty as possible; because let's face it, HTML got a lot 
>> "right" and there's value to preserving that.
>
> I don't agree at all.  I don't even think the HTML concept makes 
> things easier for beginners.  Native XML parsing by browsers is the 
> answer now, and should have been 10 years ago.

If the Web had been using specialized varieties of XML for documents 
ten years ago, rather than HTML, the following would not have been 
possible.
*   Google Search (relies on HTML's A element).
*   Google Images (relies on HTML's IMG element).
*   Google Glossary (relies on HTML's DL, DT, DD, DFN, and B
     elements).
*   Google Sets (relies on HTML's UL, OL, and LI elements).

All of these are part of what I called the Serendipitous Web. 
<http://mpt.phrasewise.com/2003/01/27#a446> People were using A HREF 
and IMG and DL and DT and DD and B and UL and OL and LI mainly to 
achieve a particular *appearance* on their pages, without much thought 
as to other benefits. But Google could derive meaning from them, 
because we're all speaking HTML (which has well-known semantics) rather 
than lots of different XML DTDs.

> ...
> The "semantics" argument, I believe, is a confusion between the data 
> layer and  the presentation layer.  The fact that even with XSLT it's 
> nearly impossible  to transform data into HTML presentation is the 
> problem that
> has plague HTML  from the beginning.  XHTML does nothing to help this 
> problem, except make  people feel that they are being "more rigorous."

I think this is fighting against human nature. Most people are unable 
to understand the difference between a data layer and a presentation 
layer. Best to make languages (like XHTML) that allow them to *think* 
they're doing just presentational stuff, while they're unintentionally 
creating as much semantic data as we can wangle out of them.

Like Sergey Brin famously said 
<http://weblog.infoworld.com/udell/2002/09/19.html#a415>: "Look, 
putting angle brackets around everything is not a technology, by 
itself. I'd rather make progress by having computers understand what 
humans write, than to force humans to write in ways computers can 
understand."

> ...
> I love how many people have taken an awkward, non-extensible markup 
> language then are doctrinaire about its use.  This is good advice, but 
> I'm constantly
> baffled why so many have replaced "<b>" with the far more expensive 
> "<strong>", etc. Or "<em>" instead of "<i>".

People who just swap one for the other, or build tools that let people 
insert <strong> and <em> when they think they're using <b> and <i> (and 
I'm not referring to anyone in particular, am I Matt?;-) are being 
counter-productive.

B and I are important because, as I said, most people are unable to 
understand the difference between a data layer and a presentation 
layer. It's best to let such people keep using the vague B and I, 
rather than pushing them into making mistakes (e.g. using EM when they 
really mean CITE).

Real-world example. Google Glossary understands that B is often used to 
mean DT, so a B word/phrase followed by a non-B sentence is included in 
its database of definitions. But (as far as I can tell) it also knows 
that STRONG *isn't* used to mean DT, so STRONG word/phrases followed by 
non-STRONG sentences can be ignored. If too many people used STRONG 
instead of B, Google Glossary would have a tough decision on including 
STRONG (and ending up with lots of non-definitions) or excluding it 
(and missing out on lots of definitions).

B and I are important in another way. This is the one thing I disagreed 
with in the tutorial:
|
| If you're after italicized text aside from emphasized text
| or citations use CSS (font-style:italic) rather than EM or
| I.
|
But if you do this, browsers that don't apply CSS (like Lynx, or NS4 or 
MSIE/Mac or iCab with style sheets turned off, or any browser on a page 
retrieved with wget) won't render the italics *even when they're 
important*. And there are quite a few important things you might want 
to use bold or italics for that don't have their own semantic elements.

For example, taxonomical names. <i class="taxonomy">Homo sapiens 
sapiens</i> is better than <style>.taxonomy {font-style: 
italic}</style> <span class="taxonomy">Homo sapiens sapiens</span>: 
they have the same amount of semantics (none at all), but the <i> 
version works in more browsers.

> I understand why, "semantically," but all this serves to try to make 
> HTML XML, implying an exactitude which it cannot and will not ever 
> have.

Exactitude need not be the goal. Every page that uses semantic markup 
when appropriate, and presentational markup the rest of the time, makes 
aggregating tools (like Google) work better. That's a worthy goal by 
itself.

-- 
Matthew Thomas
http://mpt.net.nz/




More information about the docs mailing list