This second blog-post is a somewhat auspicious one: I have just completed a rather extensive overhaul of both the back and front-end of Spinoza's Ethics 2.0. Much of this was not directly related to DataViz or Digital Humanities more generally. But there are some important reasons to think about these things which are the topic of this blog: data consistency and coherence (in acquisition, storage, interpreting, and visualizing), and data complexity (again, from all sides).
All of this is to say that the framework on and through which data is presented can matter just as much as the methods by which one accrues, stores, or organizes it. All of these are clearly important aspects of digital scholarship. A massive data-set is useless if the people who would most benefit from access face excessively steep learning curves for the software needed to make use of it. There are, of course, parallels here to industry, where working with databases, or at least needing to handle reports sent from or to them (or even simply MS Excel spreadsheets) has been common practice for far longer than it has been to, say, digitize the world's entire library and run big-data analytics on it (by which I mean of course things like Google's Ngram viewer).
In producing a substantial update for Spinoza's Ethics 2.0, I was aiming not only to run it on the latest and greatest platforms, nor to produce anything particularly aesthetically impressive (I am not, primarily, a web-designer). But I do think that usability and aesthetics are both important top-layers for publicly accessible Digital Humanities. The most important layer, the foundation, is, of course, the data set. This, too, required an update, which highlights that data is in some sense not static, cardinal, or anodyne.
The original data I compiled came from a single-pass run through the text by hand. This, of course, meant many errors. In the beginning, I ran multiple checks before putting it up on the website, including a careful cross-check of the Latin and Dutch, but still some errors persisted (and probably still persist).
One side-effect of creating the Elements section of the website is that by separating out each piece and viewing the data as organized by my code (which would also need to be verified and re-verified), it became even easier to find and correct mistakes. In this case, I admit that working with a single text is considerably easier than dealing with massive corpuses that could never be manually checked with the same sort of accuracy. There are tools and techniques designed to account for these things, but the fuzziness of the terrain itself suggests that it may just be a fundamental and persistent problem for this kind of analysis. This leads to a different sort of issue, with which I have also contended in the most recent update: how to represent and maintain inherent complexity and accuracy of a source.
The issue of accuracy seems to arise at both ends of any project: the initial filtering, sorting, and categorization that goes on in the procuring of data will necessarily colour the way in which it is stored, and therefore the way in which it must be accessed and reconstructed. In the case of Spinoza's Ethics, there is a built-in labeling system, which would seemingly make it an excellent case-study in digital analysis (of course I would say this...). However, the limitations of Spinoza's own capacity to keep things straight, and of his editors to balance not interfering too much with the work of a great philosopher have played a significant role in making the text a bit of a mixed bag.
The Euclidean-esque labels are an excellent idea, and indeed, one of the things that people find most distinctive about Spinoza's philosophical work. However, as it turned out, some strange issues have arisen in the scholarship which otherwise might not have been noticed. Consider, for example, that Spinoza references seemingly non-existent elements in places, presumably because they did exist in earlier drafts. Worse, he somehow managed to label multiple axioms identically in the second part of the text, leading to the needlessly convoluted references to such things as "axiom 1, after Lemma 3 after proposition 13" (which is probably why Spinoza-scholars tend to call that section, simply, the "physical digression").
This raises some relatively minor, but no-less frustrating issues for digitizing and storing such a text as pieces of data. If we adopt Spinoza's labeling system (thus acting as good archivists), we are preserving idiosyncrasies that hinder clarity and make working with the data more difficult. In my own case, I ran into a serious conundrum about how to store the labels for the aforementioned duplicated axioms.
As it happens, I had inadvertently stored these labels in multiple tables in the database using slightly different conventions, and this may have led to some inaccuracy in how things were displayed (I'm fairly certain I caught these things early on and dealt with them, but I didn't get around to re-jigging the data until now.)
There is one last point I want to make about the issue of data-organization and its effect on analytics and visualization. As a scholarly exercise, it seems reasonable to expect that there should be a certain amount of difficulty demanded by a sufficiently useful undertaking, and there should additionally be a certain care taken with data so that it is not corrupted or distorted into uselessness or falsehood.
The original tables I constructed preserved complexity by adopting the shorthand of the profession: references using single-letters such as 'D', 'A', 'P', 'C', 'S' are all familiar to Spinoza scholars. However, as it turned out, Spinoza's text is more corrupt and more complex than this. There are Propositions ('P') for which there are demonstrations ('D'), after which follow a scholium ('S'), or two ('S', 'S2', or 'S1', 'S2'), or sometimes there are just divisions within scholia, though these are generally not referred to explicitly by Spinoza in the Euclidean manner.
But sometimes these elements are inconsistently numbered, or not numbered at all, and they occur in strange orders, such as, e.g., the absolute monstrosity of '3P55scds', which refers to the scholium following the demonstration of the corollary following the scholium of 3P55. I do not know if that reference has ever been made in quite that way by anyone other than myself, but it was my solution to this issue. If applied consistently, simply tacking on additional letters gives their placement in the order, and therefore, hopefully, avoids creating further confusion. This is still not quite the full extent of the madness, however, as there are also definitions of affects. (Are these 'D' or something else? I chose 'DA', though in the literature it is often simply left as 'def. aff'.)
There are also sometimes explications ('E'). These are relatively rare everywhere in the text, except the Definitions of the Affects, which is suffused with them. (Incidentally, see the Usage table, using the filter 'e' (without quotation-marks to see for yourself where these all are.) This isn't so much a problem for data-organization, but it is a strange inconsistency. Perhaps 'explicatio' was reserved by Spinoza specifically for non-propositions, since it is only used for definitions. But then he perhaps might have helped us all out if he had offered an explanation for every definition in the text (especially that pesky first one)!
There is, in addition, the issue of trying to preserve a single-letter nomenclature (aside from 'DA', which was a compromise), when there are multiple types that have the same first-letter token. In Spinoza's case, that would be both 'A' (for axioms and alternate demonstrations) and, more problematically, 'D' for both definition and demonstration.
Both of these conflicts are solved by consistently maintaining an upper-case / lower-case divide between the major elements and their sub-elements. This is what I have tried to do in the database, but of course it is not standard practice among scholars to follow this, and so many secondary sources use lower-case letters for both '1d8' and '1p8d'. I can't think of any cases where this would cause issues of clarity outside of computing, but for databases it can be a major headache if a database stores these parts separately and without reference keys to their parent elements. What's more, the inconsistency of reference in traditional scholarship may cause issues of accuracy for any digital scouring that does not take into account all possible variants. This is, however, not the most difficult of issues to resolve.
It is far more difficult, at least for me, to work out a balance between a number of competing factors when presenting data-analytic humanities scholarship visually: accuracy (in part affected by the issues raised above) is paramount, otherwise we could simply produce artistic renditions or themes; usability is also very important. It does no good if a dataset is incredibly fine-grained and massive, if the tools provided to make use of it are insufficiently usable.
Aesthetic concerns are less important, but there is a relationship between aesthetics and these other concerns. A janky, ugly, unappealing presentation might still be highly usable and useful, but there is often undoubtedly some aesthetic value that arises out of a well-organized piece of work. Indeed, that may be one of the reasons why Spinoza's Ethics retains so much of its mystique (not to mention Wittgenstein's Tractatus).
There is simply no straightforward way to produce the appropriately formatted data directly from the database, and the coding required to make the data work for highly complex, useful visualizations (such as interactive graphs with multiple changeable data-points), was for the past few years beyond me. I have produced more accurate visualizations, but they come at some cost (almost entirely in terms of utility and usability), but seem to have some improved aesthetic value. Consider the radial hierarchical graph of the entire Ethics that includes all sub-element uses. It is undoubtedly beautiful. Yet it is nearly impossible to make out any details that could be of any use. There is just too much going on. So, some finer controls are needed. (Incidentally, that is why the current radial hierarchical graphs on the site use edge-bundling to help clarify connections and simplify the visual complexity to some degree.)
I simply have not had the time to work out how to best utilize layout algorithms or present data in ways that would require extensive reorganization. However, I am now able to devote a bit more effort to this, and the tools have also become easier to use.
So, to conclude for now, as a teaser of where I am headed, consider the contrast between the images above. The force-directed graph of Ethics I (see the image above) uses D3.js, and Mike Bostock's code. The colours of the nodes represent different element-types, and it is possible to have the node-size represent the number of references in the text. This is a good example of a nice, clean visualization, which nevertheless retains a high degree of information in a usable form, and which in its interactive form might provide some interesting insights into the text. The radial graph, on the other hand, is effectively useless.
There are, however, many other forms that may be even more useful, and I am hopeful that I'll be adding them to the website soon.