Fork me on GitHub

Supergroup.js

Supergroup brings extreme convenience and understandability to the manipulation of Javascript data collections, especially in the context of D3.js visualization programming.

As if in submission to the great programmers commandment–Don’t Repeat Yourself–every time I find myself writing a piece of code that solves basically the same problem I’ve solved a dozen times before, a little piece of my soul dies.

Utilities for grouping record collections into maps or nests abound: d3.nest, d3.map, Underscore.groupBy, Underscore.Nest, to name a few. But after these tools relieve us of a certain amount of repetitive stress, we’re often left with a tangle of hairy details that fill us with a dreadful sense of deja vu. Supergroup may seem like the kind of tacky wonder gadget you’d find on a late-night Ronco ad, but, for the low, low price of free, it makes data-centric Javascript programming fun again. And, when you find yourself in a D3.js callback routine holding a datum object that might have come from anywhere–for instance, with a tooltip callback used on disparate object types–everything you want to know about your object and its associated metadata and records is right there at your fingertips.

Just to be clear about the problem—you start with tabular data from a CSV file, a SQL query, or some AJAX call:

Some very fake hospital data in a CSV file...

    tabulate(d3.select('pre#csv'), data, ['Patient','Patient Age','PatientVisit','Date','Time','Unit','Physician','Charge','Copay','Insurance','Inpatient']); // # run
...turned into canonical array of Objects (using d3.csv, for instance)

    data; // #   render result.replace(/{/g,'\n   {').replace(/]/,'\n]');

Without Supergroup, you’d group the records on the values of one or more fields with a standard grouping function, giving you data like:


d3.nest().key(function(d) { return d.Physician; })
            .key(function(d) { return d.Unit; })
            .map(data);  // # show render indent2

or


d3.nest().key(function(d) { return d.Physician; })
            .key(function(d) { return d.Unit; })
            .entries(data);  // # show render indent2 result.replace(/,\n/g, ", ").replace(/("key".*, )/g,"$1\n").replace(/,   */g, ", ")

To my mind, these are awkward data structures (not to mention the awkwardness of the calling functions.) The map version looks ok in the console, but D3 wants data in arrays, not as objects. The entries version gives us arrays of key/value pairs, but on upper levels values is another array of key/value pairs while on the bottom level values is an array of records. In both entries and map, you can’t tell from a node at any level what dimension was being grouped at that level.

Supergroup gives you almost everything you’d want for every item in your nest (or in your single array if you have a one-level grouping):

Supergroup

Works as an Underscore (or Lo-Dash) mixin:


// # showhtml result.replace(/(.|\n)*/,'')

Basics: A plain Array of Strings, enhanced with children and records

_.supergroup(data, fieldname) returns an array whose elements are the distinct values of <fieldname> in the original data records. These elements, or Values can be String or Number objects (Dates to be implemented eventually). Each Value holds a .records property which is an array containing the subset of original records matching that Value.

In the example below we do a multi-level grouping by Physician and Unit. So sg = _.supergroup(data,['Physician','Unit']) returns a list of physicians (the top-level grouping). The first item in this list, sg[0], is “Adams”, a String object. sg[0].records is an array containing the records where Physician=“Adams”. sg[0].children is a list of the Units (our second-level grouping) in the records where Physician=“Adams”. sg[0].children[0].records would be the subset of records where Physician=“Adams” and Unit=“preop”.

Supergroup on physician and unit
 
sg = _.supergroup(data, ['Physician','Unit']); // # show render
sg[0]  // # show render
sg[0].records // # show render
sg[0].children // # show render

Dimension names and paths

When you’re using D3 for any kind of significant application, you’ll be writing callbacks that could accept datums of different sorts, from different hierarchy levels or whatever. D3 makes it super easy to pass the data values around, but then you spend half your time trying to reattach metadata to the values you’re using. Not with Supergroup:


sg = _.supergroup(data, ["Physician","Unit"]) //#show render
sg[0].children[0] //#show render
sg[0].children[0].dim // # show render
sg[0].children.dim //#show render
sg.dim //#show render
sg[0].children[0].parent // # show render
sg[0].children[0].namePath() // # show render
sg[0].children[0].dimPath() // # show render

Aggregates

You can apply aggregate functions to the records of a single group or to all the groups in a list.

_.each(data, function(rec) {
    rec.Charge = parseFloat(rec.Charge); // make these actual numbers
    rec.Copay = parseFloat(rec.Copay);
});
sg = _.supergroup(data, ['Physician','Unit']); // # run
sg[0].aggregate(d3.sum, "Charge") // # show render
sg[0].aggregate(d3.sum, function(rec) { return rec.Charge - rec.Copay; }) // # show render
sg.aggregates(d3.sum, "Charge") // # show render
sg.aggregates(d3.sum, "Charge", "dict") // # show render

Finding specific values


sg.lookup("Feldman") // # show render
sg.lookup("Feldman").aggregate(d3.sum,"Charge") // # show render
sg.lookup(["Gupta", "pediatrics"]).namePath() // # show render
sg.lookupMany(["Baker", "Doom", "Feldman","A Name With No Match"]) // # show render

Retrieving sets of values


sg.leafNodes()  // all bottom level groups  # show render
sg.flattenTree()  // all groups  # show render
_(sg.leafNodes()).invoke("namePath") // call .namePath() on all bottom level groups using underscore invoke  # show render

Using Supergroup for D3 hierarchy layouts

D3 hierarchy layouts (Cluster, Pack, Partition, Tree, Treemap) require a slightly different data structure than those produced by d3.nest. Underscore.Nest does very close to the right thing, but Supergroup gives you a bunch of added benefits.

I’ll demonstrate using Supergroup in a D3 hierarchy with code from this basic div-based treemap example.

The kind of tree D3 wants for its hierarchy layouts has a single root node and at the leaf level are the raw records. Except for the leaves, every node has a children array. On upper levels, a group node’s children are other group nodes. At the next-to-bottom level, the children are raw records. Supergroup generally considers records and children to be two different things, and the children of a group value are other group values.

So, for D3 hierarchies, we get a root node by calling root = sg.asRootVal(). Then we add a final level of raw records by calling root.addRecordsAsChildrenToLeafNodes(). Now root is ready to be used in a treemap. To see details, inspect code here.


window.root = _.supergroup(data, ['Physician','Unit']).asRootVal('All Physicians'); // # show run
root.addRecordsAsChildrenToLeafNodes();
d3.layout.hierarchy()(root); // # show render

var color = d3.scale.category20c();
var treemap = d3.layout.treemap()
    .size([700, 400])
    .padding([18,3,3,3])
    .value(function(d) { return d.Charge })
var div = d3.select("div#viz");
var node = div.datum(root).selectAll(".treemapnode")
        .data(treemap.nodes)
    .enter().append("div")
        .attr("class", "treemapnode")
        .call(position)
        .style("background", function(d) { return d.children ? color(d) : null; })
        .text(function(d) { 
            return d.children ? d : 
                _.chain(d).pick('Patient', 'Date', 'Charge')
                    .values().value().join(', ');
        }) // # run show

function position() {
    this.style("left", function(d) { return d.x + "px"; })
        .style("top", function(d) { return d.y + "px"; })
        .style("width", function(d) { return Math.max(0, d.dx - 1) + "px"; })
        .style("height", function(d) { return Math.max(0, d.dy - 1) + "px"; });
} // # run

Multi-valued Groups

Sometimes it makes sense to group on multi-valued fields, which leads to the result that records with multiple values in a grouped field end up in more than one group. It doesn’t happen often, but when it does, good luck getting underscore or lodash or d3.nest or anything to help you with the grouping.

One of our fake data records has two values separated by a semicolon in the Insurance field. We turn that field into an array. First we show that by default, Supergroup rejoins the array (with commas) and groups as usual, giving us four Insurance groups. But when we ask for multiValuedGroups, we only get three groups. And that one record will show up in both of them.


_.each(data, function(d) { d.Insurance = d.Insurance.split(';')}) //  make Insurance field an array instead of ;-delimited string  # show run
_.supergroup(data, "Insurance") // supergroup by default just makes the array back into a string, joined with comma. so, 4 Insurance groups  // # show render
_.supergroup(data, 'Insurance', {multiValuedGroup: true}); // now only 3 Insurance groups!  # show render
mvnest = _.supergroup(data, ['Insurance','Patient'], {multiValuedGroups: ['Insurance']});
_.invoke(mvnest.leafNodes(),'namePath') // # show render

(In order to get this to work, I exposed an internal function of lodash. You can see the tiny change in my lodash fork.)

License

MIT: http://sigfried.mit-license.org/