Overview
The TransformationFramework provides a way to perform sets of generic
transformation operations on any type of object. For example, a set
of transforms can be configured that transform a filename for an EDI
document into a set of ParsedLine objects, then into XML segments,
then into a structured XML document. By decoupling the pieces, it's
possible to transform into several output formats at once, or use a
smaller set of transforms when XML segments are needed in one place
but structured XML is required in another. Splitting a task
into a set of components not only increases reuse but also makes it
easier to test each piece in isolation.
What do I get?
Getting Started
How do I get the Objects out of a Transform?
JavaDoc
One possible application of this technique is to dynamically figure
out which transforms are required. The following two scenarios could
make use of common transforms:
- A process receives EDI documents and updates a
database.
A Parser transform is defined to transform a filename into
ParsedLine objects. Another transform transforms the ParsedLine
objects into XML segments. A final transform transforms the
XML segments into an Object representation.
- An XPath query engine examines transformed EDI documents to
find usage patterns.
The previous transforms are used to transform a filename into
XML segments. A new transform is defined to transform the
XML segments into a structured XML document. An XPath query can
then be run on this final form.
XPathFind is a utility that can
execute XPath queries.
Wait a second, you said
convert a filename, not an actual file?
That's right. By starting with a filename, it's possible to separate
the finding of the file into its own transform. For example, with this
additional layer of indirection, it would be possible to handle a URL
with a custom
URI scheme
(an example of a URI scheme is "http") or the ability to create a
parsed stream of data on the fly from a database or other source.
There's no requirement that a transform must have a filename as its
input type. However, there are convenience methods available for
transforming filenames into other types, since that is expected to be
the largest use.
In practice, the transform that takes a "Filename" as an input is
responsible for handling encoding issues. The sample ParserTransform
provided (which takes type "Filename" as input) delegates to a
helper class to determine what encoding the file should be loaded with
as well as what delimiter is expected. The helper class checks the
filename against a properties file. In a more realistic example,
the helper class might consult a database to retrieve that information.
In order to pass in a database connection or an object that will
provide a connection without JNDI, the
GlobalProperties and
ThreadProperties
classes can be used. See the second bullet point
in the next section for more details.
What do I get?
The TransformationFramework provides many reusable components.
- Transforms are loaded dynamically via a configuration file,
and are created with reflection. The arguments required for each
constructor are recursively created, so arbitrary objects with
diverse constructor arguments can be dynamically created.
- A
GlobalProperties and a
ThreadProperties
object are provided. These can be used to pass extra data
into transforms
that is only available at runtime. Use of these classes is not
required, but can serve as an example of how extra runtime data
might be passed into a transform.
- The type safety of transforms is guaranteed at compile time.
It is not necessary for the transformation graph to be loaded
before the types can be resolved.
The transformation graph loader uses reflection to look for
a transform(...) method on each input transform which can
take any single argument. Proper use of interfaces to connect
the transforms at compile time is possible (and encouraged).
- A genetic algorithm library written in Java and provided under
the GPL, written by José Galaviz, is included. See
the site
or download the code.
- An optimized implementation to find the optimal Minimum Steiner Tree
on directed graphs with a small vertex subset is included.
This implementation requires the choice of the root of the tree.
Computing the Minimum Steiner Tree
allows for the minimum amount of transformation work to be done
to produce the desired outputs (if there is more than one output).
The Minimum Steiner Tree is the minimum cost tree that connects the
input type with all of the output types.
In general, this problem is NP-Complete, however, a small vertex subset
makes the constants small enough to tackle directly.
On the author's AMD Athalon 1800, finding the Minimum Steiner Tree
on a 60 node complete graph with a vertex subset of 5 nodes (which of
the 5 nodes is the root is chosen by the user) completes in under
200 ms. For k < 5, the entire algorithm runs in
O(2k-2·(k-1)! + (k-1)2·|E|+|V| log |V|) time,
where k is the input type (1) plus the number of output types.
(For 4 output types, k is 5.) In general, the entire
algorithm will run in
O(f(k)·(k-1)! + (k-1)2·|E|+|V| log |V|) time,
where f(k) is the number of mutually non-isomorphic trees of
size k.
k | f(k) |
2 | 1 |
3 | 2 |
4 | 4 |
5 | 9 |
6 | 20 |
7 | 49 |
8 | 119 |
The input graph is pruned before the minimum cost tree is found
with a depth-first search.
Any nodes or edges that are not on one or more paths from the input
node to any output node are (temporarily) removed from the graph.
If the resulting pruned graph is a tree, it is simply returned
as the minimum cost tree (since there is only one).
If the resulting pruned graph is not a tree, then the optimized
Minimum Steiner Tree algorithm described above will be run on the
pruned graph.
If the vertex subset is larger, a near-optimal solution will be
sought using a genetic algorithm on the pruned graph. The
threshold for using the genetic algorithm is configurable.
For a single input and a single output the algorithm collapses to
Dijkstra's algorithm.
There is also code for the following:
- Find and return all mututally non-isomorphic trees of
any given size.
- Find and return all orders of a String array as a List
of Lists.
- Find and return all sets of a given size of positive
integers that add up to a given total.
- A Fibonacci Heap (from the JGraphT library)
-
Dijkstra's algorithm
using a Fibonacci heap, which runs in O(|E|+|V| log |V|) time.
Getting Started
The following example code is provided as a starting point to see how
the TransformationFramework works, and what code that uses it looks
like. First, you'll need to start with a
transformation configuration file.
TransformConfig config = TransformConfig.load("resources/transform-config.xml");
config.applyTransforms("resources/CA-product-availability.edi",
"JDOM Documents");
If the input is a filename, performing a transform to a given output
type only requires the code above. For transforming something that is
not a filename, use the following code:
TransformConfig config = TransformConfig.load("resources/transform-config.xml");
TransformGraphCollection transformGraphs = config.getTransformGraphs();
TransformGraph transformGraph = transformGraphs.getTransformGraph("ediTransforms");
transformGraph.transform("resources/tests/CA-product-availability.edi",
"Filename", "JDOM Documents", null);
It is also possible to ask the
TransformGraph
to return an Object that implements
ITransformer,
then reuse it for multiple transforms.
The methods for doing so are
here
and
here.
If you've noticed that the method on
ITransformer
returns void, and that seems odd, see
the next major section.
The TransformationFramework handles
finding the shortest path from the input type to the output type given
all the transforms. For complicated graphs, an optional "cost" can
be added to each transform. This cost represents the amount of "work"
it takes to get from one type to another. Any positive integer value
can be
specified, and it can represent processing power, memory used, or any
value desired. For the best results, each cost should be correct
relative to the other costs.
Later, if a transform is
written that takes some type to another type faster without going
through intermediate steps, this path will be taken if its cost is
lower than going through the extra steps. If a cost is not provided,
it is assumed to be one. Leaving the original transform(s) in the
file means that they can still be used, if they are still desired input
or output types.
JavaDoc
The JavaDoc is available
here.
How do I get the Objects out of a Transform?
When a transform is run, there is the possibility that many Objects
will be produced. If a large EDI document is being converted to XML,
for example, it can be split into multiple units of work. Each unit
of work can then be worked on independently, reducing the memory
required.
Let's say that there is a transform chain defined in the transform
config that takes an EDI filename as input and eventually produces
"Java objects" as output (through some number of steps, possibly just
1). The desired goal is that the information is stored in a database.
There are two ways of achieving this goal.
The first way of achieving this goal is to set up a database
transform that takes "Java objects" as input,
updates the database, and provides "Database updates" as output.
This transform doesn't even need to produce something that is
consumable by another following transform. When the filename is
transformed with "Database outputs" as the output type, the database
updates will happen as part of the transformation process.
The second way of achieving the goal of database updates is to provide
a transform reciever. A transform receiver acts like a regular
transform, except that it is programmatically defined, and does not
need to be in the transformation config file. In this case, a
transform between the EDI filename and "Java objects" is requested,
but a transform reciever is passed in. See the
getTransformer
and
transform
methods on
TransformGraph
for more details.
The output transform that produces "Java objects" is then connected to
the transform receiver, and each Java object produced will cause the
TransformGraph to call the
transform
method on the
transform receiver. In our example, this transform receiver would then
execute the database updates based on the Java objects passed into
its
transform
method.