[wp-docs] [Fwd: Re: Semantic HTML Tutorial]
Matthew Thomas
mpt at myrealbox.com
Mon Apr 26 17:52:07 CDT 2004
On 27 Apr 2004, at 1:01 AM, Scott Merrill wrote:
> ...
>> Obviously
Beware of people saying "obviously". ;-)
>> HTML has grown far beyond anything Tim Berners-Lee ever envisioned,
>> so the effort is being put forth to preserve as much backward
>> compatibilty as possible; because let's face it, HTML got a lot
>> "right" and there's value to preserving that.
>
> I don't agree at all. I don't even think the HTML concept makes
> things easier for beginners. Native XML parsing by browsers is the
> answer now, and should have been 10 years ago.
If the Web had been using specialized varieties of XML for documents
ten years ago, rather than HTML, the following would not have been
possible.
* Google Search (relies on HTML's A element).
* Google Images (relies on HTML's IMG element).
* Google Glossary (relies on HTML's DL, DT, DD, DFN, and B
elements).
* Google Sets (relies on HTML's UL, OL, and LI elements).
All of these are part of what I called the Serendipitous Web.
<http://mpt.phrasewise.com/2003/01/27#a446> People were using A HREF
and IMG and DL and DT and DD and B and UL and OL and LI mainly to
achieve a particular *appearance* on their pages, without much thought
as to other benefits. But Google could derive meaning from them,
because we're all speaking HTML (which has well-known semantics) rather
than lots of different XML DTDs.
> ...
> The "semantics" argument, I believe, is a confusion between the data
> layer and the presentation layer. The fact that even with XSLT it's
> nearly impossible to transform data into HTML presentation is the
> problem that
> has plague HTML from the beginning. XHTML does nothing to help this
> problem, except make people feel that they are being "more rigorous."
I think this is fighting against human nature. Most people are unable
to understand the difference between a data layer and a presentation
layer. Best to make languages (like XHTML) that allow them to *think*
they're doing just presentational stuff, while they're unintentionally
creating as much semantic data as we can wangle out of them.
Like Sergey Brin famously said
<http://weblog.infoworld.com/udell/2002/09/19.html#a415>: "Look,
putting angle brackets around everything is not a technology, by
itself. I'd rather make progress by having computers understand what
humans write, than to force humans to write in ways computers can
understand."
> ...
> I love how many people have taken an awkward, non-extensible markup
> language then are doctrinaire about its use. This is good advice, but
> I'm constantly
> baffled why so many have replaced "<b>" with the far more expensive
> "<strong>", etc. Or "<em>" instead of "<i>".
People who just swap one for the other, or build tools that let people
insert <strong> and <em> when they think they're using <b> and <i> (and
I'm not referring to anyone in particular, am I Matt?;-) are being
counter-productive.
B and I are important because, as I said, most people are unable to
understand the difference between a data layer and a presentation
layer. It's best to let such people keep using the vague B and I,
rather than pushing them into making mistakes (e.g. using EM when they
really mean CITE).
Real-world example. Google Glossary understands that B is often used to
mean DT, so a B word/phrase followed by a non-B sentence is included in
its database of definitions. But (as far as I can tell) it also knows
that STRONG *isn't* used to mean DT, so STRONG word/phrases followed by
non-STRONG sentences can be ignored. If too many people used STRONG
instead of B, Google Glossary would have a tough decision on including
STRONG (and ending up with lots of non-definitions) or excluding it
(and missing out on lots of definitions).
B and I are important in another way. This is the one thing I disagreed
with in the tutorial:
|
| If you're after italicized text aside from emphasized text
| or citations use CSS (font-style:italic) rather than EM or
| I.
|
But if you do this, browsers that don't apply CSS (like Lynx, or NS4 or
MSIE/Mac or iCab with style sheets turned off, or any browser on a page
retrieved with wget) won't render the italics *even when they're
important*. And there are quite a few important things you might want
to use bold or italics for that don't have their own semantic elements.
For example, taxonomical names. <i class="taxonomy">Homo sapiens
sapiens</i> is better than <style>.taxonomy {font-style:
italic}</style> <span class="taxonomy">Homo sapiens sapiens</span>:
they have the same amount of semantics (none at all), but the <i>
version works in more browsers.
> I understand why, "semantically," but all this serves to try to make
> HTML XML, implying an exactitude which it cannot and will not ever
> have.
Exactitude need not be the goal. Every page that uses semantic markup
when appropriate, and presentational markup the rest of the time, makes
aggregating tools (like Google) work better. That's a worthy goal by
itself.
--
Matthew Thomas
http://mpt.net.nz/
More information about the docs
mailing list