the basics of html

Jump to: navigation, search

The Basics of HTML

This article is Almost Ready.



Summary

In this article, you will learn the basics of HTML in order to gain some insight into the structure and content of an HTML document.

Introduction

This article summarises the purpose and structure of HTML in a very high-level fashion, including how elements work, and what character references are. The articles that follow will drill down into much more detail on specific parts of the HTML language.

What is HTML

Most desktop applications that read and write files use a special file format. For example, Microsoft Word understands “.doc” files and Microsoft Excel understands “.xls”. These files contain the instructions on how to rebuild the documents next time you open them, what the contents of that document are, and “metadata” about the article such as the author, the date the document was last modified, even things such as a list of changes made so you can go back and forth between versions.

HTML (“HyperText Markup Language”) is a language to describe the contents of web documents. It uses a special syntax containing markers (called “elements”) which are wrapped around the text within the document to indicate how user agents (eg. web browsers) should interpret that portion of the document.

A user agent is any software that is used to access web pages on behalf of users. There is an important distinction to be made here—all types of desktop browser software (Internet Explorer, Opera, Firefox, Safari, Chrome etc.) and alternative browsers for other devices (such as the Wii Internet channel, and mobile phone browsers such as Opera Mini and WebKit on the iPhone) are user agents, but not all user agents are browser software. The automated programs that Google and Yahoo! use to index the web for their search engines are also user agents, but no human being is controlling them directly.

What HTML looks like

HTML is just a plain textual representation of content and its general meaning. For example:

<p id="example">This is a paragraph.</p>

The “<p>” part is a marker (which we refer to as a “tag”) that means “what follows should be considered as a paragraph”. Because it is at the start of the content it is affecting, this particular tag is an "opening tag". The “</p>” is a tag to indicate where the end of the paragraph is (which we refer to as a “closing tag”). The opening tag, closing tag and everything in between is called an “element”. The id="example" is an attribute; you'll learn more about these later on. Many people use the terms element and tag interchangeably however, which is not strictly correct.

In most browsers there is a “Source” or “View Source” option, commonly under the “View” menu. Try this now - go to your favorite website, choose this option, and spend some time looking at the HTML that makes up the structure of the page.

The structure of an HTML document

A typical example HTML document looks like so:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Example page</title>
  </head>
  <body>
    <h1>Hello world</h1>
  </body>
</html>

This looks like so when rendered in a web browser:

HTMLrender.png

The document first starts with a document type element, or doctype. This mainly serves to get the browser to render the HTML in what is called "standards mode", so it will work correctly. It also lets validation software know what version of HTML to validate your code against. Don't worry too much about what this all means for now. We will come back to this later. What you can see here is the HTML5 doctype.

After this, you can see the opening tag of the html element. This is a wrapper around the entire document. The closing html tag is the last thing in any HTML document. The html element should always have a lang attribute. This specifies the primary language for the page. For example, en means "English", fr means "French". There are tools available to help you find the right language tag, such as Richard Ishida's Language Subtag Lookup tool.

Inside the html element, there is the head element. This is a wrapper to contain information about the document (the metadata). This is described in more detail in The HTML head element. Inside the head is the title element, which defines the “Example page” heading in the menu bar. Your head element should always contain a meta element with a charset attribute that identifies the character encoding of the page. (The one exception is when the page is encoded in UTF-16, but you should avoid that encoding anyway.) You should use UTF-8 whenever possible. Read more about character encodings.

After the head element there is a body element, which is the wrapper to contain the actual content of the page—in this case, only a level-one header (h1) element, which contains the text “Hello world.”

And that’s our document in full.

Elements often contain other elements. The body of the document will invariably end up involving many nested elements. Structural elements such as article, header and div create the overall structure of the document, and will contain subdivisions. These will contain headings, paragraphs, lists and so on. Paragraphs can contain elements that make links to other documents, quotes, emphasis and so on. You will find out more about these elements in later articles.

The syntax of HTML elements

A basic element in HTML consists of two markers around a block of text, and in almost every case elements can contain sub-elements (such as html containing head and body in the example above). There are some exceptions to the rule: some elements do not contain text or sub-elements, for example img. You'll learn more about these later on.

Elements can also have attributes, which can modify the behaviour of the element and introduce extra meaning. Let's have a look at another example.

<header>
  <h1>The Basics of 
    <abbr title="Hypertext Markup Language">HTML</abbr>
  </h1>
</header>

This looks like so when rendered:

htmlrender2.png

In this example a header element (used to mark up header sections of documents) contains an h1 element (first, or most important level header), which in turn contains some text. Part of that text is wrapped in an abbr element (used to specify the expansion of abbreviations), which has a title attribute, the value of which is set to Hypertext Markup Language.

Many attributes in HTML are common to all elements, though some are specific to a given element or elements. They are always of the form keyword="value". The value is often surrounded by single or double quotes: this is not necessary in HTML5, except when the attribute value has multiple words, in which case you need to use quotes to make it clear that it is a single attribute value, and not several attributes. Saying all this, I would however recommend that you stick to quoting values for now, as it is good practice and can make the code easier to read. In addition, some HTML flavours you may work with in the future DO require quoting of attributes, for example XHTML 1.0, and it doesn't hurt to do so in flavours that don't.

Attributes and their possible values are mostly defined by the HTML specifications—you cannot make up your own attributes without making the HTML invalid, as this can confuse user agents and cause problems interpreting the web page correctly. The only real exceptions are the id and class attributes—their values are entirely under your control, as they are for defining custom meanings .

An element within another element is referred to as being a “child” of that element. So in the above example, abbr is a child of the h1, which is itself a child of the header. Conversely, the header element would be referred to as a “parent” of the h1 element. This parent/child concept is important, as it forms the basis of CSS and is heavily used in JavaScript.

Block level and inline elements

There are two general categories of elements in HTML, which correspond to the types of content and structure those elements represent—block level elements and inline elements.

Block level means a higher level element, normally informing the structure of the document. It may help to think of block level elements being those that start on a new line, breaking away from what went before. Some common block level elements include paragraphs, list items, headings and tables.

Inline elements are those that are contained within block level structural elements and surround only small parts of the document’s content, not entire paragrahs and groupings of content. An inline element will not cause a new line to appear in the document: they are the kind of elements that would appear in a paragraph of text. Some common inline elements include hypertext links, highlighted words or phrases and short quotations.

Note: HTML5 redefines the element categories in HTML: see Element content categories. While these definitions are more accurate and less ambiguous than the ones that went before, they are a lot more complicated to understand than "block" and "inline". We will therefore stick with these throughout this course.

Character references

One last item to mention in an HTML document is how to include special characters. In HTML the characters <, > and & are special. They start and end parts of the HTML document, rather than representing the characters less-than, greater-than and ampersand. For this reason, they must always be used in escaped form in content.

Other than for these characters, you should try to avoid using character references unless you are dealing with an invisible or ambiguous character. If you use the UTF-8 character encoding you can represent any character (other than the three mentioned above) without escaping.

One of the earliest mistakes a web author can make is to use an ampersand in a document and then have something unexpected appear. For example, writing “Imperial units measure weight in stones&pounds” could actually end up appearing as “…stones£s” in some browsers.

This is because the literal string “&pound;” is actually a character reference in HTML. A character reference is a way of including a character into a document that is difficult or impossible to enter using a keyboard, or in a particular document encoding.

The ampersand (&) introduces the reference and the semi-colon (;) ends it. However, many user agents can be quite forgiving of HTML mistakes such as leaving out the semi-colon, and treat “&pound” as a character reference. References can either be numbers (numeric references) or shorthand words (entity references).

An actual ampersand has to be entered into a document as "&amp;", which is the character entity reference, or as "&#38;" which is the numeric reference. Web Platform Docs includes a Table of Common HTML Entities for reference.

For more information about working with character escapes see Using character escapes in markup and CSS.



Language: English  • español • 日本語 • 한국어 • svenska