XProc Zone Home | Code

XProc From Above

Brief tutorial overviews

From a height the broader outlines come into view.

XProc from Orbit

XProc is a data processing technology for digital data. While it is an XML-based technology using XML syntax, it can work with many kinds of data, including common text-based formats such as JSON.

As a language, XProc describes pipelines. A pipeline combines a sequence or set of processes and applies them to specified inputs ("sources") to create outputs ("results").

Use XProc to build and support complex workflows in document production, data conversion, and information exchange.

Pipelines and steps

A pipeline is made of steps
A pipeline can also be used as a step, when declared with or imported into another pipeline
XProc defines libraries of standard, reusable steps for all processors, supporting many common operations
You can also design and use new steps, in and with your own XProc

Example src/starter.xpl

A simple pipeline with three steps.

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step version="3.0" xmlns:p="http://www.w3.org/ns/xproc">
  <!-- A pipeline is defined with p:declare-step -->
  
  <!-- First, reads a file from the file system -->
  <p:load href="../xproc-from-orbit.md" content-type="text/plain"/>
  
  <!-- Does its best to produce HTML from Markdown input -->
  <p:markdown-to-html/>

  <!-- Saves the result -->
  <p:store href="../out/xproc-from-orbit.html"/>
  
</p:declare-step>

Example src/producer.xpl

Four steps, including an imported step named with a developer's (project) namespace.

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step version="3.0"
  xmlns:p="http://www.w3.org/ns/xproc"
  xmlns:zone="http://wendellpiez.com/ns/xproc-zone">
  
  <!--  Not a step, this imports another pipeline (step declaration or library) -->
  <p:import href="_make-orbital-markup.xpl"/>
  
  <!-- Step 1: Loads a file as plain text -->
  <p:load href="../xproc-from-orbit.md" content-type="text/plain"/>

  <!-- Makes HTML, if possible (supported in XML Calabash) -->
  <p:markdown-to-html/>

  <!-- Calls an imported step -->
  <zone:make-orbital-markup/>
  
  <!-- Saves the result -->
  <p:store href="../xproc-from-orbit.xml" message="SAVING xproc-from-orbit.xml ..."
    serialization="map { 'indent': true() }"/>
  
</p:declare-step>

How to write a pipeline

An XProc pipeline takes the form of an arrangement of steps.

We say 'arrangement' here since steps can accommodate as many inputs and outputs as needed, connecting together.

Example src/double-validate.xpl

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step version="3.0"
  xmlns:p="http://www.w3.org/ns/xproc">
  
  <!-- Reads an XML document -->
  <p:load href="../xproc-from-orbit.xml"/>
  
  <!-- A validation step with two input ports, 'source' and 'schema'
       and two output ports, 'result' and 'report'
       sc https://spec.xproc.org/master/head/validation/#c.validate-with-relax-ng-->  
  <p:validate-with-relax-ng     assert-valid="false" name="TEI-structures">
    <!-- 'source' port is picked up implicitly -->
    <p:with-input port="schema" href="schemas/orbital-promoted.rnc"/>
  </p:validate-with-relax-ng>
  
  <!-- Another validation step with similar ports -->
  <p:validate-with-schematron   assert-valid="false" name="other-regularities">
    <!-- 'source' comes from the preceding step's 'result' -->
    <p:with-input port="schema" href="schemas/orbital-stability.sch"/>
  </p:validate-with-schematron>
  
  <p:wrap-sequence wrapper="VALIDATION-REPORTS">
    <p:with-input>
      <!-- Reading in 'report' output ports from earlier steps -->
      <p:pipe port="report" step="TEI-structures"/>
      <p:pipe port="report" step="other-regularities"/>
    </p:with-input>
  </p:wrap-sequence>
  
  <!-- Saving the combined report -->
  <p:store href="../out/validation-reports.xml" serialization="map { 'indent': true() }"/>

</p:declare-step>

XProc From 40,000 Feet

I/O, ports and documents

Pipelines, and steps, can accept one or more designated inputs, or none - they can accept inputs provided at runtime, or find and load data sources, or both, or generate their own
Steps can also expose processing results as outputs, and also interact with your system, writing files on disk or communicating through other channels
The connecting points on an XProc step are called its ports
By means of ports, we pass documents into steps and get documents out - and while XProc calls them documents, they can be pretty much any kind of data
A pipeline can be defined with ports for connecting with other steps, when used as a step in another pipeline
The composability of steps is key to the efficiency and power offered by XProc

Example src/double-validate.xpl

This pipeline validates its input twice and collects validation reports into a single document.

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step version="3.0"
  xmlns:p="http://www.w3.org/ns/xproc">
  
  <!-- Reads an XML document -->
  <p:load href="../xproc-from-orbit.xml"/>
  
  <!-- A validation step with two input ports, 'source' and 'schema'
       and two output ports, 'result' and 'report'
       sc https://spec.xproc.org/master/head/validation/#c.validate-with-relax-ng-->  
  <p:validate-with-relax-ng     assert-valid="false" name="TEI-structures">
    <!-- 'source' port is picked up implicitly -->
    <p:with-input port="schema" href="schemas/orbital-promoted.rnc"/>
  </p:validate-with-relax-ng>
  
  <!-- Another validation step with similar ports -->
  <p:validate-with-schematron   assert-valid="false" name="other-regularities">
    <!-- 'source' comes from the preceding step's 'result' -->
    <p:with-input port="schema" href="schemas/orbital-stability.sch"/>
  </p:validate-with-schematron>
  
  <p:wrap-sequence wrapper="VALIDATION-REPORTS">
    <p:with-input>
      <!-- Reading in 'report' output ports from earlier steps -->
      <p:pipe port="report" step="TEI-structures"/>
      <p:pipe port="report" step="other-regularities"/>
    </p:with-input>
  </p:wrap-sequence>
  
  <!-- Saving the combined report -->
  <p:store href="../out/validation-reports.xml" serialization="map { 'indent': true() }"/>

</p:declare-step>

Ports, steps, sources and results

Ports go in only one direction: they are for input or output, never both. One of the input ports will be designated as primary, while others are secondary. Similarly, one and only one of the output ports may be designated as primary. Not all steps have secondary ports, but some steps make little sense without them.

The conventional name for the primary input port is source. The conventional name for the primary output port is result. These names correspond with the uses of these terms in XSLT and XQuery.

It is sometimes useful for a step to have output but no input (like p:load), or input with no output (like p:sink). And not all steps produce modifications of inputs among their outputs. For example, p:directory-list has no input but produces XML on its result port listing the contents of a file system directory.

The source and result ports can carry sequences when they are defined as such on their steps - and when permitted to be sequences they may also be empty, with no documents bound to them.

While the the primary ports will ordinarily be named source and result, the names of secondary ports may be less generic, to indicate what roles they play for their steps. For example, validation steps all have a schema secondary input port, for their schemas; the p:insert step has an insertion port for the data to be inserted, and so forth.

Steps with implicit port connections

XProc syntax can be fairly concise and clean - at least, as XML-based formats go - because it has sensible fallback rules and some nice ways of keeping syntax simple.

One important example: as long as steps in your pipeline are to be applied in sequence, their connections do not have to be shown. It has been suggested they snap together.

The XProc feature in play here is called the default readable port. The concept is simple. Any step with a primary input port that is not connected explicitly, will be bound to the primary output of the immediately preceding step.

This works well, with the caveat that steps that have no input ports don't connect like this, even when given in sequence - and steps with no output ports can't be connected as inputs at all. Know your steps.

Example src/moresteps.xpl

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step version="3.0" xmlns:p="http://www.w3.org/ns/xproc">
  
  <!-- Reads a file as plain text -->
  <p:load href="../xproc-from-orbit.md" content-type="text/plain"/>
  
  <!-- If the processor supports the step, Markdown is converted to HTML -->
  <p:markdown-to-html/>

  <!-- Applying an XSLT transformation using the nominated stylesheet -->
  <p:xslt>
    <p:with-input port="stylesheet" href="xslt/html-resuscitate.xsl"/>
  </p:xslt>
  
  <!-- Saves a file in the ../out folder -->
  <p:store href="../out/xproc-from-orbit.html"/>
  
</p:declare-step>

Example src/moresteps-explicit.xpl

The same pipeline, with connections spelled out.

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step version="3.0" xmlns:p="http://www.w3.org/ns/xproc">
  
  <p:load name="loading" href="../xproc-from-orbit.md" content-type="text/plain"/>
  
  <p:markdown-to-html name="making-html">
    <p:with-input port="source">
      <p:pipe port="result" step="loading"/>
    </p:with-input>
  </p:markdown-to-html>

  <p:xslt name="transforming">
    <p:with-input port="source">
      <p:pipe port="result" step="making-html"/>
    </p:with-input>
    <p:with-input port="stylesheet">
      <p:document href="xslt/html-resuscitate.xsl"/>
    </p:with-input>
  </p:xslt>
  
  <p:store href="../out/xproc-from-orbit.xml">  
    <p:with-input port="source">
      <p:pipe port="result" step="transforming"/>
    </p:with-input>
  </p:store>
  
</p:declare-step>

Knowing your formats

XProc comes with native support for XML and JSON reading and parsing, for plain text inputs, and for inputs defined with regular grammars using Invisible XML (ixml).

All of these types of data, and others, can be passed from one step to another, as long as both steps can accommodate the given format or media type.

Example src/read-json.xpl

With map and array object types, XPath 3.0/3.1 supports JSON natively.

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step version="3.0" xmlns:p="http://www.w3.org/ns/xproc"
  xmlns:array="http://www.w3.org/2005/xpath-functions/array">
  <!-- A pipeline is defined with p:declare-step -->

  <p:output serialization="map { 'indent': true(), 'omit-xml-declaration': true() }"/>

  <!-- Reads in a JSON literal (this time) -->
  <p:input port="source">
    <p:inline content-type="application/json" expand-text="false" xml:space="preserve">{    
"title": "TEI P4: Guidelines for Electronic Text Encoding and Interchange (XML-compatible edition)",
"date": "2004",
"contributors": [
  { "role": "editor",
    "indexName": "Sperberg McQueen, C. Michael",
    "displayName": "C. M. Sperberg-McQueen" },
  { "role": "editor, XML conversion",
    "indexName": "Burnard, Lou",
    "displayName": "Lou Burnard" },
  { "role": "XML conversion",
    "indexName": "Bauman, Syd",
    "displayName": "Syd Bauman" },
  { "role": "XML conversion",
    "indexName": "DeRose, Steven J.",
    "displayName": "Steven DeRose" },
  { "role": "XML conversion",
    "indexName": "Rahtz, Sebastian",
    "displayName": "Sebastian Rahtz" } ],
"archive": "https://www.tei-c.org/Vault/P4/doc/html/"
}
    </p:inline>
</p:input>

  <p:variable name="cit" select="."/>

  <!-- Makes an XDM sequence from the array contained in the map -->
  <!-- New in XPath 3.1, the ? lookup operator returns a value from a map by its key (property name) -->
  <p:variable name="contribs" select="array:flatten($cit?contributors)"/>

  <p:identity>
    <p:with-input>
      <citation>{ head($contribs)?indexName },{ ' and'[empty($contribs[3])] } {
        $contribs[2]?displayName }{ $contribs[3]!', et al' }. "{ $cit?title }", { $cit?date }. {
        $cit?archive ! (. || '.') }</citation>
    </p:with-input>
  </p:identity>

</p:declare-step>

Powerful embedded languages

XPath, XSLT and XQuery all play well with XProc, which is designed first and foremost to accommodate these kindred technologies.

XProc steps can either invoke or embed instances of these declarative processing languages, or others.

Options

In addition to inputs and outputs (connection ports), steps can also have options.

These provide runtime configurations when invoking steps and pipelines.

For some steps, certain options are required, for example to designate which nodes to delete on a p:delete step (using the match option).

Values assigned to options can be simple (string value flags) or complex (such as map objects, or expressions to be evaluated).

Its ports and options together provide an interface for using a step.

Options can be set on steps using abbreviated syntax (attributes) or long syntax (p:with-option)

Foundations of XProc

The more you know, the better you feel.

XML syntax
XML namespaces
XPath (including XPath 3.0, 3.1) and XDM (XQuery and XPath Data Model)
XSLT, XQuery
REST and the web / URIs
schema technologies, standard vocabularies, validation and workflow
HTML/CSS/SVG, XSL-FO

Minimal XProc

If your skills include XSLT, this might be all you ever need in XProc:

p:declare-step
p:variable
p:load
p:store
p:xslt
@serialization
@message

Take another lesson on Day Two - learn to write a pipeline and use it as a step

p:declare-step/@type
p:input and p:with-input
p:output
p:option and p:with-option
p:library

Three everyday utility steps

p:sink
p:cast-content-type
p:identity