Skip to content

Command-line usage

The basic form to invoke HTMLTab from the command-line is:

htmltab [OPTIONS] [HTML_DOCUMENT]

The options are, of course, optional but you must supply an HTML document as input.

Supplying an input HTML document

You can pass an HTML document to HTMLTab in three different ways:

  • A local file
  • A remote URL
  • A stream from stdin

Local file

Say you have a file named data.html in the current directory. You can pass that to HTMLTab with:

htmltab data.html

All the usual filepath shortcuts are supported: . for the current directory, .. for the parent directory, ~ for your home directory, ~rebecca for Rebecca’s home directory.

Remote URL

If you want HTMLTab to request a remote URL and parse the returned HTML, use a similar invocation:

htmltab https://www.example.com/data.html

HTMLTab supports http:// and https:// URLs. If a URL returns an HTTP 4xx or HTTP 5xx error — for example, HTTP 404 Not Found — HTMLTab will exit with an error. HTMLTab will only make GET requests. If you need to use any other method you can stream from stdin.

Streaming from stdin

If you want to read from standard input (stdin), use - as the HTML document. Using a Unix pipe to pass the output of a POST request made by curl into HTMLTab:

curl -X POST https://www.example.com/data.html | htmltab -

Or reading from a here string:

htmltab - <<< "<table><tr><td>1</td><td>2</td></tr></table>"

In fact, - is the default value for the input HTML document, so you don’t need to include it explicitly if you’re using stdin. The following two examples are equivalent to the two directly above.

curl -X POST https://www.example.com/data.html | htmltab
htmltab <<< "<table><tr><td>1</td><td>2</td></tr></table>"

Options

You can use command-line options to modify the operation of the command.

--select

The --select option is used when your input HTML document has multiple tables, and you want to convert a table that isn’t the first table in the document. When that’s the case you can use one of three methods of specifying the table:

  • Integer index
  • CSS selector
  • XPath expression

Integer index

When you know you want the nth table in the HTML document, where n > 0, you can simply pass n to --select. This is called the integer index method. For example, if you have a local file data.html and you want to convert its third table to CSV, you can use:

htmltab --select 3 data.html

The integer index is one-based: 1 means the first table in the HTML document (as opposed to the second table as it would be in zero-based numbering). A zero or a negative value is an error.

CSS selector

Sometimes, over time, a table moves around within an HTML document. This is especially true when you’re targetting a remote URL. One day the table may be the fourth within a document, the next it may be the fifth. In these cases you want to select the table by referencing the document structure. This is where CSS selectors or XPath expressions come in handy.

Let’s say you’re interested in an HTML document that contains weekly summarised data, and that table appears below other tables containing daily totals. On Mondays it’s the second table in the document, on Tuesdays it’s the third table in the document, and so on. Fortunately, the table has an id attribute with the value weeklydata. Using a CSS selector, you can use that id to target the table wherever it appears in the document:

htmltab --select "#weeklydata" https://www.example.com/data.html

HTMLTab supports almost all CSS3 selectors. For further details see the documentation of the underlying cssselect library.

XPath expression

CSS selectors will probably be all you need, but in some complex cases you may need something more powerful. If that’s the case you can use an XPath expression as the value for --select. One example would be where you need the last table in an HTML document:

htmltab --select "(//table)[last()]" https://www.example.com/data.html

Default value and short form

The default value of --select is 1, which means the first table in the HTML document will be converted to CSV.

The short form of the --select option is -s.

--output

Writes the CSV data output by HTMLTab to file instead of stdout.

htmltab data.html --output data.csv

The short form of this option is -o.

--keep-numbers

Tells HTMLTab to leave any number-like values in the table unchanged (so, for example, currency symbols or percent signs will not be removed). This option turns off the default behaviour of converting number-like values.

$ htmltab --keep-numbers <<< '<table><tr><td>$1,000.00</td></tr></table>'
"$1,000.00"

This is the opposite of the --convert-numbers option. The two options cannot be used together.

The short form of this option is -k.

--convert-numbers

Tells HTMLTab to convert number-like values in the table into integer or float values (for example, removing currency symbols or percent signs). This is the default behaviour and you shouldn’t need to pass this option explicitly.

$ htmltab --convert-numbers <<< '<table><tr><td>$1,000.00</td></tr></table>'
1000.00

This is the opposite of the --keep-numbers option. The two options cannot be used together.

The short form of this option is -c.

--group-symbol

Defines the character the HTML document uses to group digits in numbers (for example the , in 1,000,000).

$ htmltab --group-symbol , <<< '<table><tr><td>1,000,000</td></tr></table>'
1000000

By default , is used as the grouping character. If you need to use a full stop as a grouping character, pass --group-symbol ..

The short form of this option is -g.

--decimal-symbol

Defines the character the HTML document uses as the decimal separator (for example the . in 1000.00).

$ htmltab --decimal-symbol , <<< '<table><tr><td>1000000,00</td></tr></table>'
100000000

By default . is used as the decimal separator. If you need to use a comma as a decimal separator, pass --decimal-symbol ,.

The short form of this option is -d.

--currency-symbol

Defines the character to remove when converting number-like strings. You can pass the option multiple times if you have more than one currency symbol.

$ htmltab --currency-symbol ₹ <<< '<table><tr><td>10₹</td></tr></table>'
10

By default $, ¥, £, and considered to be currency symbols. These are not used if you pass your own currency symbols (unless you include them explicitly).

The short form of this option is -u.

--null-value

Allows you to define a case-sensitive value to convert to an empty cell in the CSV output. You can pass the option multiple times if you have more than one null value.

htmltab data.html --null-value None --null-value NO

By default, NA, N/A, ., and - are considered null values. These are not used if you pass your own null value (unless you include them explicitly).

The short form of this option is -n.

--version

Show the version of HTMLTab you have installed, and exit.

--help

Show a usage message — essentially a short version of this page — and exit.