Phyloreferences: Tree-Native, Reproducible, and Machine-Interpretable Taxon Concepts

1 Evolutionary and organismal biology, similar to other fields in biology, have become inundated 2 with data. At the same rate, we are experiencing a surge in broader evolutionary and ecological 3 syntheses for which tree-thinking is the staple for a variety of post-tree analyses. To fully take 4 advantage of this wealth of data to discover and understand large-scale evolutionary and ecological 5 patterns, computational data integration, i.e. the use of machines to link data at large scale by 6 shared entities, is crucial. The most common shared entity by which evolutionary and ecological 7 data need to be linked is the taxon to which they belong. In this paper, we propose a set of 8 requirements that a system for defining such taxa should meet for computational data science: 9 taxon definitions should maintain conceptual consistency, be reproducible via a known algorithm, 10 be computationally automatable, and be applicable across the tree of life. We argue that Linnaean 11 names based in Linnaean taxonomy, by far the most prevalent means of linking data to taxa, fail 12 to meet these requirements due to fundamental theoretical and practical shortfalls. We argue that 13 for the purposes of data-integration we should instead use phylogenetic clade definitions 14 transformed into formal logic expressions. We call such expressions phyloreferences, and argue 15 that, unlike Linnaean names, they meet all requirements for effective data-integration.


Introduction
The last two decades have witnessed a vast increase of available digital biodiversity data.This richness in data has been fostered, in part, by a call to mass-digitize museum repositories (Beaman and Cellinese 2012; Page et al. 2015), and is fueled by the emergence of new applications and data sources, analytical methods, faster algorithms, and improved environmental sensors, among others (Philippe et al. 2005;Porter et al. 2009;Michener and Jones 2012;Chan and Ragan, 2013;Hampton et al. 2017;Kozlov et al. 2019).Additionally, it has led to a corresponding, increasing need for digital access, sharing, and re-purposing of data, and, consequently, to a need of using machines to link data from different sources to shared entities.The natural framework for such synthesis of biodiversity data is the Tree of Life.Tree-thinking has seized a prominent role in systematics since the advent of phylogenetics (Zimmermann 1931(Zimmermann , 1934(Zimmermann , 1943;;Hennig 1950Hennig , 1966)).The rapidly increasing knowledge across the Tree of Life has now enabled a synthesis of phylogenetic hypotheses on a Tree of Life scale, to produce an encompassing--and digitally fully reusable--view of Life's evolution, the Open Tree of Life (Hinchliff et al. 2015;McTavish et al. 2017).As a comprehensive and repeatable phylogenetic synthesis, it provides unprecedented opportunities for studying evolutionary patterns across all clades, at large as well as small scales.These clades are the perfect loci at which to integrate the suite of different data types resulting from evolutionary and biodiversity research (e.g., Allen et al. 2018;Eliason et al. 2019;Folk et al. 2019;Howard et al. 2019).Thus, a system of defining clades is needed to link the vast amount of available biodiversity data in a way that it can be recovered, aggregated, and integrated.However, there is wide disagreement about which system should be used for this purpose.Currently, most biological data and knowledge are directly or indirectly linked to biological taxa via Linnaean taxon names.As we will discuss below, it is well known that in its current shape the Linnaean system leads to numerous problems when applied to data-intensive science that depends on computation.Therefore, an alternative is needed.Broadly speaking, there are two main candidates for such an alternative: to modify the current Linnaean system such that it can fulfill certain requirements (see list below), or, more radically, to abandon the Linnaean system in this context and implement a purely phylogenetic system for clade definitions.The former of these involves repurposing Linnaean names to refer to clades, and using these names as labels for taxon concepts. 1In that sense, this option is a hybrid between the Linnaean and a phylogenetic system.The latter of these, instead, consists in generating purely phylogenetic definitions of clades.
To arbitrate between these alternatives, we propose the following four requirements that any system suitable for data-integration should meet: (i) The mapping maintains conceptual consistency, meaning that when mapped to different phylogenies, the semantics of the retrieved clades are consistent.2(ii) The mapping of a given clade concept to a given phylogenetic hypothesis is exactly reproducible via a known algorithm.(iii) The algorithm to (re)produce the mapping is computationally automatable, which is necessary for processing the very large phylogenies and datasets characteristic of modern biology.This means consulting expert opinion cannot be part of the algorithm.(iv) The system is applicable to all lineages in the Tree of Life, including in particular those where Linnaean names are not available (e.g., Archaea, fungi, etc.).
In this paper, we show that it is in principle impossible for the Linnaean system to meet these requirements, and present a purely phylogenetic alternative that does meet them.In section 2 we elaborate on the problems of the Linnaean system, and show that it is beyond repair.In section 3 we introduce the purely phylogenetic approach, and show how it can address the shortcomings of the Linnaean system.In section 4 then we introduce one way in which such a phylogenetic alternative could be implemented, namely, phyloreferences, and in section 5 we argue that this implementation is preferable over other existing implementations.Finally, in section 6 we address various objections to our proposal, and section 7 concludes the paper.
First of all, it is important to emphasize that the issue at stake in this paper is not that of nomenclature.The question of how to define taxon concepts for data integration is independent from the question of whether these taxon concepts also are named, and even whether these names are Linnaean names.While the approach we propose in this paper fits more naturally with a form of phylogenetic nomenclature, it is also compatible with retaining Linnaean names.Related to this, the issue at stake is not that of whether we should recognize certain taxa as species (Mishler and Wilkins 2018).While a phylogenetic approach like the one proposed here denies that there is an ontological difference between taxa at different levels, it is compatible with recognizing some of these taxa as species.Thus, what is at stake is the best way of defining taxa for data integration, and not the names of these taxa or whether they can be listed as species.

The Poverty of Linnaean Names
Many authors before us have pointed to problems caused by Linnaean nomenclature and classification.This section instead discusses two problems of the Linnaean system that make it unsuitable for data integration, and argues that it is not possible to eliminate these problems simply by making small changes to the system.

The Linnaean shortfall limits data discovery
A first problem of the Linnaean system is often referred to as the 'Linnaean shortfall.'This is the significant gap in our current knowledge of described vs. unknown biological diversity (Brown and Lomolino 1998;Hortal et al. 2015), and it highlights our limited ability to first discover and then describe taxa according to the rules of nomenclatural codes.In view of the sixth mass extinction we are currently experiencing (Brook et al. 2008), this represents a true plague in biodiversity science because it implies that we are also losing unknown diversity, and the diversity we do discover is not described (in a Linnaean framework) fast enough.From a computational perspective, the latter point represents a true obstacle to addressing the computable taxon concept challenge because taxa need to be described before they can serve as loci to link data.
Two causes of the Linnaean shortfall are particularly relevant in this context.First, the process of describing diversity is very time consuming and relies on detailed comparative studies of specimens in museum's repositories and field observations.Second, there are far more levels of clades in the Tree of Life than there are ranks to name them.As a result, we continue to discover lineages that persist between revisions of the Tree of Life, yet do not have, and may never receive, the kind of names required to facilitate discovery and reuse in a name-based system, let alone formal Linnaean names.Adopted placeholders such as 'phylotype X' or 'clade A' may serve their purpose within a publication, but they are not discoverable and reusable terms beyond it (also, see appendix in de Queiroz and Donoghue 2013).This predicament applies across the Tree of Life, but is particularly prevalent in Archaea and other prokaryotes, and very common even in many eukaryotes.Consequently, such lineages have often been referred to as 'dark taxa' (Parr et al. 2012).
The result is that there are a lot of data about taxa that cannot yet, and may never be, linked to Linnaean names.This way, the Linnaean system fails to meet requirement (iv), i.e., to provide the tools to define, communicate and query these unnamed taxa.

Linnaean names make data discovery difficult to reproduce
One might argue that the rate of species descriptions and formal names could, in principle, increase dramatically and thus alleviate the problem described in the previous subsection.This subsection argues that even if that were the case, Linnaean names would not be suitable for integrating data from different sources.This is because it falls short of the three other requirements as well: (i) it fails to maintain conceptual consistency, (ii) the mapping of a Linnaean name to a phylogeny is not reproducible by a known algorithm, and (iii) the algorithm to do this mapping is not automatable.
To see why the Linnaean system falls short of these requirements, it is helpful to briefly consider its design and history.Prior to Linnaeus, biological knowledge was organized in large, poorly defined categories, and nomenclature was completely unstructured.Linnaeus was a revolutionary for his time, not so much for the system he created (other botanists before him experimented with the ranking system), but for what he enabled.He brought order by formalizing criteria to define logical relationships among abstract classes (categorical ranks) and restructuring the nomenclatural system by enforcing a binomen to every organism at the species level and a single name to every higher rank.Outside of the-yet to be established-unifying context of evolution, taxa were assumed to be static entities, with character similarity providing the best approach to defining groups of organisms.In this context, Linnaean nomenclature served the need of linking names to taxon groups.
Darwinian theory then revolutionized the perspective on biological relationships and taxon group membership, with the notion that it is natural processes that give rise to taxa, while characters can only diagnose, but not define categories (Darwin 1859).Zimmermann (1931Zimmermann ( , 1934Zimmermann ( , 1943) ) and Hennig (1950Hennig ( , 1966) ) formalized these theories and provided the criteria to construct phylogenetic trees.In this theoretical framework, in which taxa are no longer seen as static entities, it quickly became clear that the phylogeny-governed hierarchy of Hennig's framework is better suited for defining taxa than the logical relatedness of groups in Linnaeus' hierarchical framework (see also Ereshefsky 2001).Consequently, as common practice Linnaean nomenclature has been repurposed to link names to clades.In this hybrid system, Linnaean names are used to label taxon concepts, which are clades rather than fixed entities defined by a set of characters.
However, the Linnaean elements that this hybrid system retains make it impossible for it to be used for effective data-integration.There are three reasons for this.
First, repurposed Linnaean names define taxon concepts by means of a type specimen and description (Brzozowski 2020).However, whenever the type is missing from the phylogeny--which is typically the case--there are no agreed rules for mapping type specimens to clades.Instead, this mapping relies on expert judgement.As different experts tend to do this in different ways (see our example of Campanula below), this means that the Linnaean system does not meet requirement (ii) of reproducibility by a single algorithm.In addition, the necessity of expert judgement means that the mapping of names to clades cannot be automated.This means that the Linnaean system also fails to meet requirement (iii).
Second, the lack of reproducibility in the Linnaean system leads, over time, to confusion over the taxon concept to which a name is linked.Through time, different experts often apply the same name in different ways due to different interpretations of the original taxon protologue, 3 and consequently, the meaning of this name becomes difficult to track.This problem is further exacerbated by purely nomenclatural issues that notoriously plague taxonomy, such as synonymy, homonymy, misapplication, etc.And even though these can often be reconciled (albeit not always easily) by taxonomic name resolution services (Boyle et al. 2013;Chamberlain and Szöcs 2013), this provides little relief to the long-standing informatics challenge of reconciling names with taxon concepts.This problem is particularly heightened in names with a long history and legacy of taxonomic literature.Because repurposed Linnaean names still point to traditionally circumscribed groups that are not generated in an evolutionary framework, they inherit these problems.In that sense, repurposed Linnaean names approximate to clades, but never exactly match them.This is because traditional groups and the clades we discover are fundamentally two different entities, created by very different criteria (Cellinese et al. 2012).Furthermore, even if the extension of a Linnaean name were to coincide with that of a particular clade, over time this would quickly fall prey to the same problems of interpretation and taxonomic as well as phylogenetic revision.Due to the above points, the Linnaean system fails requirement (i), i.e., it cannot maintain conceptual consistency.
Third, the hybrid system still links data to a Linnaean name.These names are text strings without computational meaning.Thus, even if we repurpose a Linnaean name to refer to a clade, this name can never express the semantics of that clade.Instead of defining the taxon in a way that would allow machines to identify the taxon, these names link to type specimens and descriptions that, as described above, have been used and interpreted in different ways by different researchers.Thus, as long as Linnaean names are used to point to taxon concepts, it will be impossible for machines to reliably integrate data.This means, again, that the hybrid Linnaean system inevitably fails to meet the requirement of making taxon definitions computationally automatable (iii).
The failure of the Linnaean system to meet these three requirements is easiest to explain by drawing an analogy with geolocation-linked data: like taxa, such location data is incredibly useful for integrating data.Imagine that for geolocation-linked data only place names, not standard latitude/longitude geo-coordinates, were available for computation.Data could not be aggregated by region, users could not draw a bounding box on a map to query a database, species occurrence data could not be queried for "all species within 50 miles of my location", and users querying by place would have to know country, state, and possibly city to make the query less ambiguous.Yet, this is the current situation in computing with taxon-linked data.Consider, as an example to illustrate the problems of the Linnaean system, the genus Campanula formalized by Linnaeus in 1753, for which Campanula latifolia L. was later selected as a lectotype (Britton and Brown 1913).When discussing Campanula L., Lammers (2007) states that "there is no modern classification which accounts for this large genus in its entirety" and therefore, the exact number of species is unknown, but the current count is at more than 400.The original description applied to Campanula has been so stretched through time that, unsurprisingly, Campanula as a Linnaean taxon concept is highly polyphyletic, scattered across the entire Campanuloideae tree with other polyphyletic genera (Crowl et al. 2016;Fig. 1).The clade including the type specimen (Campanula latifolia) would have to retain the original name, which would imply a cascade of name changes across the tree, not an uncommon repercussion in taxonomic revisions.Even ignoring the nuisance of name changes, all phylogenetic studies to date have analyzed a significantly incomplete taxon sample, which had stalled any formal update in the taxonomy and classification because it would be premature.The most challenging bottleneck is the inability to retrieve taxonomic concepts unambiguously.Aside from its type specimen, what constitutes the traditional taxon Campanula, in view of how the name has been applied across time, is not even easy to verbalize, given an author's subjective taxon description and the lack of informative synapomorphies.Figure 1 illustrates some of the practical conse- OPEN ACCESS -PTPBIO.ORG quences of this complex issue, by requesting occurrence data from GBIF (gbif.org)using a query for Campanula as a genus.Integrating data obtained in this way with the known phylogeny will necessarily be very challenging at best, given that Campanula as a clade does not exist.Examples like Campanula are very common across all domains at any taxonomic level, and the harmonization between traditional ideas about life and the phylogenetic approaches we employ to discover natural entities has become a true impediment to progress in querying, communicating, and 'decorating' all of the parts of the Tree of Life in a consistent and reproducible way.In the next section, we discuss an alternative way of defining taxon concepts for data integration that does not suffer from the problems of the Linnaean system. et al. 1988).A phylogenetic definition represents a formal statement that describes a clade in a phylogeny.This body of work laid the foundation for phylogenetic taxonomy, later renamed phylogenetic nomenclature, which takes a strictly tree-thinking approach to biological nomenclature (de Queiroz and Gauthier 1990and Gauthier , 1992and Gauthier , 1994)).Soon thereafter, the PhyloCode (www.phylocode.org) was drafted as an application of phylogenetic nomenclature's principles.
Many systematics papers (e.g., de Queiroz 1992de Queiroz , 1994de Queiroz , 1997;;Rowe and Gauthier 1992;Judd et al. 1993Judd et al. , 1994;;Bryant 1996Bryant , 1997;;Sundberg and Pleijel 1994;Christoffersen 1995;Schander and Thollesson 1995;Lee 1996Lee , 1998Lee , 2001;;Wyss and Meng 1996;Brochu 1997;Cantino et al. 1997Cantino et al. , 2007;;Kron 1997;Baum et al. 1998;Eriksson et al. 1998;Härlin and Sundberg 1998;Hibbett and Donoghue 1998;Alverson et al. 1999;Pleijel 1999;Sereno 1999;Bremer 2000;Brochu and Sumrall 2001) clearly articulated the need to communicate parts of the Tree of Life and demonstrated that Life could be described by using three basic clade types and their associated phylogenetic definitions.These are (1) minimum clade definitions, denoting the smallest clade that includes the most recent common ancestor, and all its descendants, of two or more internal specifiers; (2) maximum clade definitions, denoting the largest clade that includes the first ancestor, and all its descendants, of one or more internal specifiers but excludes one or more external specifiers; and (3) apomorphy-based definitions, denoting the clade that arises from the first ancestor, and includes all its descendants, that possesses a specified character that is synapomorphic with an internal specifier (Fig. 2).Specifiers are reference points in the phylogeny that serve as anchors for the clade definition and these can be species, specimens, or apomorphies, which would include molecular sequences.Ideally, when using species as specifiers, these would already have a phylogenetic definition available or the Linnaean type present in the phylogeny; likewise, when using apomorphies, ideally every trait used as specifier should be semantically defined.
While there has been extensive debate in the literature (Benton 2000;Blackwell 2002;Schuh 2003;Polaszek and Wilson 2005;Rieppel 2006;Stevens 2006; de Queiroz and Donoghue 2011; among many others) about possible advantages and disadvantages of the PhyloCode as a nomenclatural system, the PhyloCode is simply one application of phylogenetic nomenclature, in the realm of nomenclatural codes.Our concern here is not arguing the merits of, or issues with the PhyloCode, or, for that matter, any nomenclatural code.Instead, we posit that phylogenetic definitions have unquestionable benefits as a means to unambiguously label all clades in the Tree of Life, and use these for data integration.
Compared to traditional taxon descriptions, phylogenetic definitions have clear advantages for computing with taxon concepts in a phylogenetic context.They draw unambiguous reference to any part of the Tree of Life and can be expressed in a formal and standardized format.Although when published they refer to a taxon concept (clade) originating from a specific phylogenetic topology, a formal clade concept established by an author is an unambiguous statement and approach to communicate taxa, and thus data for those taxa, regardless of future changes in phylogenetic knowledge.That is, as long as the specifiers used in a clade definition have been matched to a given phylogenetic tree, there is no arguing about the clade identified by the definition. 4Obviously, this cannot prevent or resolve disagreements about the actual taxon concept, but it does enable clearly articulating which element(s) of a phylogenetic definition is(are) the point(s) of contention.In other words, disagreement over a concept does not imply ambiguity over what the concept represents.Additionally, a change in phylogenetic knowledge after the original publication of a phylogenetically defined clade concept may result in taxa now included in the clade that the original author did not intend to be included, or for which the community is divided about the merits of their inclusion.Definitions constructed in some ways 4 We come back to the problem of matching specifiers in section 6.1.
 OPEN ACCESS -PTPBIO.ORG will prove more robust, in the judgement of the community, than those built in other ways.However, whether judged "robust" and agreed upon or not, phylogenetic definitions will always unambiguously point to the same clade on any tree containing all its specifiers.For example, our definition of Campanulaceae is "the clade originating with the most recent common ancestor of Campanula latifolia Linnaeus and all extant organisms or species that share a more recent common ancestor with Campanula latifolia than with Roussea simplex (Rousseaceae) J. E. Smith, Pentaphragma ellipticum (Pentaphragmataceae) Poulsen, or Stylidium graminifolium (Stylidiaceae) Swartz ex Willdenow" (Fig. 3; Cellinese 2020).
Others may disagree with this definition, however, there is no ambiguity about the concept being referred to, and the clade it would identify on a given phylogeny.
Phylogenetic definitions are not only beneficial at higher (above species), but also at shallow (species or below-species) taxonomic levels.For example, reconciling Linnaean names with polyphyletic taxa, which are very common across all domains of life, is clearly non-trivial.Often, clades can be diagnosed by interesting morphological or genetic synapomorphies.Traditional taxon names offer little help in referring to such clades, especially if, as is very common, type specimens are missing from the analyses.For example, Crowl et al. (2015) found that Campanula erinus, a widespread taxon in the Mediterranean basin, nested in a clade of narrow Aegean archipelago endemics, is polyphyletic and polyploid.In a more in-depth study, Crowl et al. (2017) discovered cryptic diversity within this species due to hybridization with C. creutzburgii, which revealed a hybrid lineage that is morphologically identical to C. erinus, but differs by having a different ploidy (8× vs. the parental 4×).An apomorphy-based clade defini- OPEN ACCESS -PTPBIO.ORG tion using the trait octoploidy now allows the semantically unambiguous taxonomic recognition of this otherwise cryptic group (Crowl and Cellinese 2017).
Likewise, in other domains, in particular fungi and bacteria, taxa are often so poorly known that only unnamed "phylotypes" can be identified (e.g., Massana et al. 2000;Kim et al. 2012;Lin et al. 2014;Hibbett 2016).Phylogenetic definitions can address these cases, because specifiers can use any uniquely identifiable object suitable for matching the taxonomic unit represented by nodes in a tree.To illustrate this point, in the above Campanulaceae example, the taxonomic unit identified by having scientific name Campanula latifolia could also be identified by molecular sequence(s) (e.g., "GenBank: EF141027"), or, as in Crowl and Cellinese (2017), using a specific herbarium specimen with a globally unique identifier.
This potential extends below the species level, for example, to label and query monophyletic entities corresponding to subsets of populations or polyploid derivatives that show interesting evolutionary and/or biogeographic patterns, but are currently unnamed.These entities are not considered 'species' and a clear mechanism to name them is lacking from all of the formal  OPEN ACCESS -PTPBIO.ORG nomenclature codes.For data publishing, aggregation, and retrieval systems built around names instead of meaning, data for such entities cannot be recovered, certainly not computationally.
These advantages of phylogenetic definitions are widely acknowledged, and phylogenetic definitions have been applied across multiple biological domains in numerous recent phylogenetic studies, resulting in the publication of many clade names, some of which were subsequently repurposed in other analyses (Borchiellini et al. 2004;Joyce et al. 2004;Cantino et al. 2007;Conrad et al. 2011;Soltis et al. 2011;Adl et al. 2012;Cárdenas et al. 2012;Hill et al. 2013;Mannion et al. 2013;Schoch 2013;Sterli et al. 2013;Torres-Carvajal and Mafla-Endara 2013;Wojciechowski 2013;Clemens et al. 2014;Hundt et al. 2014;Rabi et al. 2014;Sferco et al. 2015;Madzia and Cau 2015;Spatafora et al. 2016;Crowl and Cellinese 2017;Wright et al. 2017;Hibbett et al. 2018;de Queiroz et al. 2020; among numerous others).Arguably, this constitutes ample evidence that generating and using taxon concepts defined by patterns of ancestry constitutes an increasing need by the community, and that there is a growing consensus on how to define and use names for such concepts.

What Is a Phyloreference?
In the form commonly published by authors, phylogenetic definitions-whether following strict rules of a nomenclatural code (such as the PhyloCode) or not-are natural language text expressions.In this form, the ability to compute with the semantics expressed in the text, as requirement (iii) demands, is severely limited.However, unlike definitions in the Linnaean system, it is possible to transform phylogenetic definitions in natural language text into computable representations and thereby make their semantics accessible to machines.We develop a system for such transformations here, and refer to these computable representations as phyloreferences.Specifically, a phyloreference is a representation of a phylogenetic definition as a formal, logic expression that makes its semantics explicit and machine-accessible through the use of terms drawn from ontologies.In this way, phyloreferences are an informatics tool for communicating taxon concepts to machines, as opposed to, for example, a stand-in for Linnaean (or other) nomenclature.As an informatics tool, phyloreferences harness the theoretical, as well as applied, results from a wealth of earlier work in phylogenetic nomenclature to enable machines to integrate and navigate organism-linked data by concepts not afforded by Linnaean taxonomies.
Our proposed approach is based on the Web Ontology Language (OWL 2) (W3C OWL Working Group 2012) Description Logic (DL) framework.OWL has been widely adopted across the life sciences for representing domain knowledge in machine-processable form as ontologies (Mungall et al. 2010(Mungall et al. , 2011(Mungall et al. , 2012;;Vogt 2009;Jensen and Bork 2010;Deans et al. 2011Deans et al. , 2015;;Dahdul et al. 2014;Haendel et al. 2014;Thessen et al. 2015;Senderov et al. 2018).In the context of information science, in which our approach is based, an ontology is a representational model of a knowledge domain, specifically the concepts (represented as classes) comprising the domain, and the relationships that hold between them (represented as relationships between class members).Ontologies have revolutionized our ability to compute with the semantics of natural language expressions.For example, if terms in free-text phenotype descriptions are linked to formal concepts in community ontologies for the relevant knowledge domains, machine reasoners and statistical algorithms can be used to compute quantitative metrics for the semantic similarity of different phenotype descriptions (Pesquita et al. 2009;Washington et al. 2009;Vision et al. 2011;Bauer et al. 20012;Mabee et al. 2012;Manda et al. 2015;Mabee et al. 2018).Enabling machines to understand the semantics of clade definitions for the purposes of computational data integration is a much less complex task.Nevertheless, clades used by researchers to aggregate or communicate data arguably form part of our body of knowledge  OPEN ACCESS -PTPBIO.ORG about the evolution of the Tree of Life, and it would thus seem prudent to render it as much computable as other life science knowledge domains.
To afford such capabilities to clade definitions, we propose a model of phyloreferences as defined OWL classes. 5In this model, the semantics of a phyloreference, and thus the clade concept it represents, are declared by a so-called OWL class expression, which essentially gives the necessary and sufficient conditions for class membership.For a class defined in this way, software tools called reasoners can (among other things) infer for any individual that all individuals that fulfill all conditions necessarily must be instances of the class.We then model the topology of a given phylogeny by declaring its nodes as individuals, and asserting relationships between those that reflect the topological relationships between nodes.This allows a reasoner to infer which nodes in the phylogeny, if any, match a given phyloreference.This class expressionbased model also enables other inferences through computational reasoning.For example, aside from inferring class membership of individuals, OWL reasoners can use these to infer which phyloreferences are equivalent, and which are subclasses of another.Where found, such relationships would be implied solely by the semantics of the clade as represented in the OWL class definition, and as such would hold universally.This is in contrast to approaches that attempt to map Linnaean names to clades in a tree by comparing the clade on the tree and the Linnaean taxon concept based on the relationship (inclusion, overlap, etc.) between their respective sets of members (see Section 5, "Other Efforts" below).
As argued in the large body of work on phylogenetic nomenclature on which we have based our approach, our proposed models for phyloreference expressions represent patterns of shared and divergent descent, as included and excluded lineages.To illustrate this, a phyloreference for the clade Campanuloideae might be expressed in OWL like this (OWL Manchester Syntax (Horridge and Patel-Schneider 2012); properties are in italics, and, for readability, ontologies of constituent terms are omitted, and term labels are used in place of identifiers): <Campanuloideae> EquivalentTo includes_TU some <Campanula_latifolia> and excludes_TU some <Lobelia_cardinalis>.
This expression 6 models a maximum clade definition and asserts that the class Campanuloideae is logically equivalent to the set of nodes that include the taxon concept (TU, for Taxonomic Unit) 'Campanula_latifolia', and exclude the taxon concept 'Lobelia_cardinalis', two necessary and sufficient conditions (called property restrictions in OWL).The properties "includes_TU" and "excludes_TU" are drawn from an ontology, specifically, the Phyloreferencing Ontology, an application ontology that we are developing on top of the Comparative Data Analysis Ontology (CDAO) (Prosdocimi et al. 2009) for defining the semantics of clade definition com-5 By "class" we mean a concept in an ontology, and thus an abstract object (in contrast to individuals or instances, which are concrete objects).Unless stated otherwise, in our use classes have intensional rather than extensional definitions, meaning their descriptions state constraints that must be true for an individual object to be a member of the class.The constraints can be stated in natural language, or as a set of logic conditions.In the latter case, a reasoner can infer class membership.Similarly, we use the term individual in the sense of an individual member of a group.The usage of this term should not be confused with the question of whether taxa are, in a metaphysical sense, classes or individuals.We hold that, depending on the epistemic context, taxa can be construed as both individuals and kinds (see also Brigandt 2009).Hence, the approach we take here is compatible with the view that taxa are, in a metaphysical sense, individuals. 6The token "some" in the phyloreference example is from OWL Manchester Syntax and signifies existential quantification.Existential quantification (as opposed to universal quantification) properly represents the semantics of the clade definition: for a taxon concept to be included, some instance of it needs to be included, not every possibly existing one (observed or not).Likewise for exclusion.TU here is the class of entities that are instances of a given taxon concept.<Campanula_ latifolia> refers to the TU class, "some <Campanula_latifolia>" is some instance of that class.
 OPEN ACCESS -PTPBIO.ORG ponents.For example, includes_TU as a property is defined such that in the above definition "includes_TU some <Campanula_latifolia>" is true for all nodes that represent an instance of the taxon concept Campanula latifolia, or from which such a node descends.In contrast, in the above definition "excludes_TU some <Lobelia_cardinalis>" is true for nodes that have a sibling node representing an instance of the taxon concept Lobelia cardinalis, or from which such a node descends.The semantics of a definition with these properties are transparent, unambiguous, and readable by machines.As an ontology class, the definition does not pinpoint one particular node in one particular taxonomy or phylogeny, but the set of all nodes that satisfy the definition.Because the definition is a formal logic expression, class membership can be inferred computationally by a reasoner.
Defining phyloreferences as ontology classes makes possible promoting their adoption, reuse, unambiguous reference, and even community vetting using the same mechanisms as for other widely used community ontologies in the life sciences.Specifically, they can be given a label, allowing reference to them by name; assigned globally unique identifiers, making them unambiguously referenceable; and assembled into an ontology maintained in an infrastructure, such as a GitHub repository that facilitates version control, releases, and community collaboration.
Ultimately, a phyloreference in our approach bears the following important properties.Foremost, it meets our four requirements.Its semantics are unambiguous and machine interpretable because they are expressed in formal logics with uniquely identified ontology terms.This enables reproducing their mapping to a given phylogeny with a fully computational algorithm (requirements (ii) and (iii), and enables maintaining semantic consistency when mapped to different (such as updated) phylogenies (requirement (i)).When a phyloreference is applied to a particular phylogeny that lacks a clade with consistent semantics, there will not be a node that "matches" (i.e., can be inferred as an instance).As a logically defined ontology class, a phyloreference can but need not be named.If it is named, the name is only a label to aid human communication, and this label does not carry semantics a machine is expected to recognize.Phyloreferencing can thus be applied to any branch of the Tree of Life, whether useful names exist or not (requirement (iv)).A phyloreference class can be given a globally unique identifier by which to unambiguously reference it for machines, independent of whether it has a label.
Furthermore, in this way phyloreferences are quite similar to terms in other community ontologies, and our system therefore interoperates naturally with the communities of practice and tool ecosystems that have developed around collections of ontologies in different domains, in particular in the life sciences (Smith et al. 2007).

Other Efforts to Improve the Computability of Taxon Concepts
Even though there has been much controversy over the application of phylogenetic nomenclature (Benton 2000;Blackwell 2002;Schuh 2003;Polaszek and Wilson 2005;Rieppel 2006;Stevens 2006; de Queiroz and Donoghue 2013; among many others), its potential to define taxon concept semantics in a logical manner with unambiguously expressible meaning has been recognized before.Hibbet et al. (2005), Keesey (2007), and in part Sereno (2005) and Sereno et al. (2005), already envisioned mechanisms and applications that would leverage computable clade definitions to unambiguously retrieve taxa based on shared descent-based specifications.Keesey (2007) includes a notation and formalism for defining clade names based on mathematical set theory and operators, using the Mathematical Markup Language (MathML), an XML derivative, and extensions to it.Keesey's approach, unlike ours, also supports group concepts that are not monophyletic.However, because MathML is a structured syntax language, not a formal logic, Keesey's approach requires defining custom, bespoke semantics for his notations.
 OPEN ACCESS -PTPBIO.ORG It also does not lend itself to publishing clade definitions in the form of ontologies that are readily interoperable with the wealth of other community ontologies increasingly widely used in biology, and the software support even for only reading and interpreting MathML is limited.In practice, Keesey's proposal has not been adopted.Thau and Ludäscher (2007) and Thau et al. (2008) proposed to use Region Connection Calculus (RCC, specifically RCC-5; Randell et al 1992) as a formal logic for computationally reconciling different Linnaean taxonomies (or taxonomic checklists derived from such taxonomies) with each other.RCC-5 defines five basic relationships between two entities: equality, proper inclusion, inverse proper inclusion, overlap, and disjointness.In their approach, human experts assert which relationship(s), called articulations, hold between the concepts from different input taxonomies, such as concepts with identical names, or names that exist in only some of the input taxonomies.Experts also assign or relax a number of so-called global (or latent) taxonomic constraints, such as disjointness of sibling taxa, and parent taxon coverage (every member of a parent taxon is a member of some child taxon).Thau et al. (2008) show that certain machine reasoners can prove the consistency (or inconsistency) of different taxonomies under the asserted articulations and constraints, and can infer minimally informative relationships (a disjunction of one or more of the RCC-5 base relationships) between concepts.
More recently, Franz et al. (2016Franz et al. ( , 2019) ) and Cheng et al. (2017) applied this approach to a variety of complex biological use cases, and also extended it to the challenge of reconciling concepts from traditional Linnaean nomenclature with clades in a phylogenetic tree, as well as aligning clade concepts from competing phylogenetic hypotheses.Although evidently useful for the problem of computationally reconciling taxon concepts, for each new input taxonomy or phylogenetic hypothesis to be reconciled, a considerable amount of effort from trained human experts is necessary to create the articulations and constraints, and the resulting assertions still do not disambiguate or make computable the original intensional semantics of a taxon concept.Therefore, it does not make the exercise of repurposing Linnaean names for clades in a phylogenetic tree a less subjective and manual approximation than it necessarily is, because the concepts at hand are fundamentally different in nature.

Challenges and Limitations
Previous proposals to replace the Linnaean system with a purely phylogenetic alternative have proven to be very controversial.As our proposal does not concern taxonomic nomenclature or classification, many of these controversies are not directly relevant.However, there are various ways in which opponents might object against the arguments in this paper.We respond to these briefly, and point to limitations and challenges for our approach.

Specifiers
One of the greater challenges in applying phyloreferences on a larger scale, and across different phylogenetic trees, is that phylogenetic definitions are "anchored" by the specifiers designating the taxon concepts that are to be included or excluded.Therefore, resolving a phyloreference on a tree necessarily requires that the anchoring taxon concepts of a phyloreference, and the taxon concepts linked to (typically terminal) nodes in a phylogeny, can be "matched" by a reasoner.More specifically, these taxon concepts need to be defined such that the reasoner can infer when a taxon concept used in the phyloreference is congruent with, or includes, a taxon concept linked to a tree node.In some cases such a match will be exact and unambiguous, for example, if the specifier and node-linked taxon concept are referenced to the same globally unique identifier.
 OPEN ACCESS -PTPBIO.ORG In practice, matching specifiers between phyloreference and phylogeny is an inherently nontrivial problem, and matches will range from unambiguous to approximate.For example, if taxon concept references are, as will commonly be the case, Linnaean taxon names, even an exact match is not necessarily free of ambiguity, such as when the names are not demonstrably drawn from the same taxonomy.Indeed, this is the taxonomic name resolution problem that arises whenever Linnaean taxon names must be reconciled, and the confidence in name matches will follow the familiar spectrum.Especially for phylogenies with incomplete taxon sampling, a taxon concept used as specifier in a phyloreference may also be altogether absent from a tree.The question is, then, whether or not one of the taxon concepts present on the tree can substitute for the specifier without changing the semantics of the clade definition.Whether this is possible or not will in turn depend on the definition of the clade and the phylogeny at hand on which it is to be recovered, and may require sophisticated algorithms to determine.
Phyloreferences by themselves do not obviate the need to match or reconcile Linnaean taxon names.However, this is due to the prevailing practice of identifying taxon concepts through names, rather than a specific weakness in the phyloreferencing approach; and because phyloreferences are in essence uniquely identifiable ontology terms, this problem and the ambiguity it confers are not re-introduced every time data are linked to a taxon.Furthermore, how and why a taxon concept for a specifier matches one for a node in a tree can be expressed through formal axioms in the same logic framework (i.e., OWL2 in our case), and thus be documented in a fully reproducible manner.For example, if a target phylogeny lacks a node for Campanula latifolia, but contains a node for Campanula, a "mapping" axiom asserting that the concept Campanula includes Campanula latifolia will allow matching a phyloreference for the Campanuloideae clade that references Campanula latifolia as a specifier that must be included.
Finally, it is worth emphasizing that the ambiguity inherent in reconciling names by itself does not introduce ambiguity into the semantics of the clade definition, though it does render recovering the clade semantics on phylogenies, other than the one used by the original author, prone to the same problems that beset taxon name matching in general.Creating mapping axioms in an effective and scalable manner may be non-trivial, but we are confident that solutions to address this challenge can and will be developed.In the meantime, the Open Tree of Life offers a comprehensive, even if synthetic, phylogeny that is continuously updated with evolving phylogenetic knowledge, and with names for terminal nodes sourced from dozens of taxonomies (Rees and Cranston 2017).

Genealogical discordance
It is well-known that, due to phenomena such as lateral gene transfer, hybridization, introgression, and others, evolution is often not tree-like across all domains of life, including Archaea, bacteria and fungi.One might worry then that the phyloreferences proposed here are not suitable for capturing groups whose evolutionary relations are more suitably represented by a network than by a bifurcating pattern.Although phylogenies are hierarchical, with clades that are either nested or mutually exclusive, reticulation due to different biological processes results in partially overlapping clades, with hybrid lineages belonging to both parental clades.Partially overlapping clades can, in fact, be phylogenetically defined, which demonstrates the flexibility of this approach.For example, Crowl and Cellinese (2017) illustrate how phylogenetic definitions apply to lineages derived from hybridization and polyploidy (using ploidy in an apomorphy-based definition), and allow the naming of cryptic diversity.
Phylogenetic reconstructions may generate discordant hypotheses that are best synthesized by networks rather than bifurcating patterns.For considering the question whether phyloref- OPEN ACCESS -PTPBIO.ORG erences can be meaningfully applied to such networks, note that in principle the key concepts used in our approach for encoding the semantics of a clade definition, namely ancestors and descendants, and taxon concepts included in or excluded from a line of descendents, still fully apply in networks.Hence, there is no theoretical or technical reason that would prevent resolving a phyloreference on a phylogenetic network.Nonetheless, a clade retrieved in this way should be treated with great caution, because at least for now the underlying clade definition will have almost universally been erected based on a phylogenetic tree, not a network.Therefore, the benefit of applying phyloreferences to networks as part of, for example, a data integration project, seems questionable at best.

Adoption cost
One could object that even if phyloreferences are in principle preferable over Linnaean names for integrating data, the cost of adoption would be very high, or high enough to outweigh the benefits.For a response, we note but set aside the fact that such an argument would attribute limited value to the problems caused by using the Linnaean system; we disagree that irreproducible science has only limited costs.Nonetheless, we acknowledge that as for any novel system for indexing data, for a resource such as GBIF, with huge amounts of data that need to be queryable very efficiently by a large user community, to fully support phyloreferencing would likely have a significant engineering cost.This notwithstanding, we find it important to note that phyloreferences can already be taken advantage of right now, including for data integration projects, by tapping into and combining already existing technologies.To sketch out an example, the programming interface (API) to the Open Tree of Life (Rees and Cranston 2017) includes a most recent common ancestor query service that depending on the input parameters returns the common ancestor node semantically fully consistent with minimum clade and maximum clade definitions, respectively, that underlie phyloreferences.Additional Open Tree of Life query services can then be used to obtain the species contained by the clade resolved in the previous step, which then in turn allow querying a database indexed by Linnaean names for data associated with the clade.This approach can already be used, for example, to find how phylogenetic vs. Linnaean names can result in different inferences, such as geographical distribution.

Final Remarks
We strongly believe we are at a crossroads where the idiosyncratic applications of Linnaean nomenclature and taxonomy to the approach we use to discover and name taxa is simply untenable in the age of computationally-driven science.Linnaean names represent an incurable theoretical and practical shortfall (see Sterner and Franz 2017).We suggest that phyloreferencing lays the foundation for an informatics infrastructure that enables using the Tree of Life to organize, query, and navigate our knowledge of biodiversity.Building this foundation now is timely.Large phylogenies encompassing diverse groups across the Tree of Life are published in increasing numbers (e.g., Smith et al. 2011;Hinchliff et al. 2015;Smith and Brown 2018;Howard et al. 2019).Especially for large tree synthesis projects, the need for phyloreferencing has already arisen, because it is the basis for persistently and reproducibly linking data and metadata to internal nodes (i.e., clades) in the tree.There are also parts of the Tree of Life for which a stunning organismal and trait diversity is only just beginning to be characterized, and for which the traditional fallback of Linnaean names is hardly available, and perhaps never will be (e.g., microbial diversity, and population-level diversity).Yet, the ability to unambiguously refer to these groups is necessary, not least to organize, query, and retrieve our knowledge about  OPEN ACCESS -PTPBIO.ORG any group of interest.In contrast to Linnaean names, phylogenetic definitions can be created using any identifiable object, including specimens, samples, and sequences.If appropriately labeled and distributed in community-vetted ontologies, phyloreferences can provide names and concepts that allow researchers to communicate data and knowledge about their groups, yet also have fully computable and thus reproducible semantics built-in.
One of the key goals of phyloreferences is to enable computationally querying, navigating, integrating, and visualizing any data linked to groups of organisms, in a way that is driven by evolutionary relatedness.We have argued that merely repurposing Linnaean names onto trees cannot achieve this goal.Phyloreferences allow us to compare parts of the Tree of Life about which we would otherwise not be able to communicate.Consequently, the number of phylogenetic taxon definitions being published has already increased rapidly in recent years across multiple domains, signifying that phylogenetic approaches to diagnose taxonomic groups and their names are being increasingly widely adopted and ideally, every clade discovered should bear a definition.When translated into formal phyloreferences, the semantics of these definitions not only become fully accessible to machines, but by curating them into a community ontology, they become much more findable and reusable compared to when buried in the text of publications.
We believe that a phylogenetic data synthesis encompasses far more than a challenging topological synthesis.The approach we propose is native to tree-thinking and completely flexible because phyloreferences adapt seamlessly to changes in phylogenetic knowledge and would therefore apply to small and large topologies and syntheses.In view of the upcoming publication of the PhyloCode and the ever-increasing number of published phylogenetic definitions, now is the time to envision the Tree of Life as a navigable map where clade definitions (taxon concepts) serve as physical addresses and phyloreferences provide the means to achieve a retraceable navigation.

Figure 1 :
Figure 1: Upper half: phylogeny of Campanuloideae redrawn from Crowl et al. (2016) showing the polyphyly of Campanula (lineages in blue).Lower half: distribution of Campanula as retrieved from a GBIF query.

Figure 3 :
Figure 3: The phylogeny of Asterales showing the clade Campanulaceae with its five lineages, the sister group Rousseaceae, and other related lineages (adapted from Steven 2017).