My life with the Boost Graph Library

The article, the first part of which is presented here, contains various considerations of the author that have accumulated during the long development of a specialized system for searching for social connections, based on the Boost Graph Library (BGL). This (technical) section summarizes the author’s impressions of working with this library, raises issues of instrumentation when creating graph applications, and touches on some practical problems of metaprogramming in C ++.



BGL and what it is eaten with



The BGL template library is probably known to any developer who has encountered graph tasks. Appearing in Boost 1.18.1 in 2000, she immediately earned approving reviews from such classics of the genre as Alexander Stepanov. The library guide, compiled by Jeremy Sik, Lai-Kwan Lee, and Andrew Lamsdane, was published in Russian in 2006 by Peter (original - Jeremy G. Siek, Lie-Quan Lee and Andrew Lumsdaine, “The Boost Graph Library”, 2001 , Addison-Wesley). The library was intensively updated and developed almost until the end of 2013 (Boost 1.55.0). In particular, in 2005, the announcement of its distributed version (PBGL) appeared, which was included in Boost from version 1.40 in 2009 and to this day remains a kind of de facto standard for graph computing on high-performance clusters, in any case, in the academic world. As far as the history of commits can be judged, until 2005, the main developer of the library was Jeremy Sick, after 2005 - Douglas Gregor, and in general at various times a considerable amount of diverse people worked on the library. The publications devoted to it have repeatedly appeared on habr.com: first of all, a series of articles by Vadim Androsov should be noted: [ 1 , 2 , 3 ]. Thus, in principle, a good and diverse literature is devoted to the library, but its own documentation, also, generally speaking, quite extensive, suffers somewhat due to the fact that:



  1. Its table of contents and root sections, which claim to provide an exhaustive list of key entities, have not changed since 2001. For example, the author of these lines, who naively believed that:
    The BGL currently provides two graph classes and an edge list adapter:



    adjacency_list

    adjacency_matrix

    edge_list
    , after some time I was surprised to find the compressed_sparse_row_graph (sparse matrix) representation implemented in 2005. A similar story took place with the Bron-Kerbosch algorithm. Do not believe the table of contents, use a direct search in the header files;

  2. There is no single commented list of the library’s internal categories (container_category, parallel_edge_traits, iterator_stability, etc., etc.) needed to implement your own views. Problems with understanding what is happening overtake, apparently, all users of the library who want to dig deeper, which leads to the appearance of a “kind of working code”, bringing it to a completely complete state takes a lot of time and effort: see, for example, a typical discussion .


The number of categories and various selectors, including those that are confusingly similar, is so great that the authors themselves sometimes get confused in them. For example, in the constructors of the compressed_sparse_row_graph already mentioned above, in the current version there is a systematic error leading to crashes when trying to copy an undirected adjacency list:







It may be noted here on occasion that the full testing of such a flexible mechanism is a separate problem, since it is accompanied by a combinatorial explosion of the number of possible substitutions.



It should be noted with regret that at present the main developers have apparently lost interest in further work on the library, and for the last six years it, by no means having exhausted its development potential and not even completely freeing itself from internal inconsistencies and direct errors, is in free flight. The plans voiced in the region of 2011 to significantly expand the set of methods and cover new areas of graph theory (including by adding internal graph partitioning support to the ability to read METIS format) remained unfulfilled. It also seems that the library could benefit a lot (at least in terms of readability) from the widespread use of new products that became standard after 2011.



Thus, when looking from 2019, the issues of choosing a reference library for graph applications do not look as unambiguous as we would like, and over the past 5 years, the uncertainty has increased rather than decreased.



This situation causes some sadness, because the creation of a universal mechanism, like BGL, is in itself a kind of intellectual feat, both in terms of the power of the approach and the richness of the arsenal of implemented universal methods (a good one and a half hundred single-threaded and a couple dozen distributed) the library, as far as the author of these lines is known, still has no equal.



At the moment, only this library fundamentally allows, without loss of performance, imposing strict agreements on the presentation of data and loss of control over the internal mechanisms of the library itself, to completely separate graph algorithms and graph representations, additionally making the latter completely independent of the presentation of metadata associated with edges and vertices ( which, in principle, is obviously the most correct way of doing things).



The word “fundamentally” is used here for a reason. Considering the specific situation on the example of the long-suffering compressed_sparse_row_graph class already mentioned above, we can note, for example, the following deviations from high standards:



  1. The [] operator for the adjacency list and the sparse matrix handle Internal and Bundled Properties differently: the first returns only the external properties (the internal ones are accessible only using property_map), the second returns the framing property structure containing the general list of properties.
  2. The get function to get the index of the edge using boost :: property_map <compressed_sparse_row_graph, boost :: edge_index_t> :: type fell into boost :: detail, and not into boost, as in all other cases.


Finally, in the compressed_sparse_row_graph template, the specialization for the undirected graph (boost :: undirectedS) remained unfulfilled.



In this regard, when using the edge_index property (edge ​​serial number), additional difficulties arise due to the fact that for the adjacency list this property must be explicitly set as internal and as such can be changed arbitrarily, but for an undirected graph its value does not depend on the direction where the rib passes. For a sparse matrix (always directional), it is a built-in constant property_map of a special form (calculated as an index in an array of edges). Accordingly, the values ​​for oncoming edges (representing an undirected graph) cannot change and will always be different.



All these discrepancies lead to the impossibility of "simply replacing the graph representation with an equivalent one" when calling algorithmic functions, which significantly undermines the main advantage of the library. In practice, in such cases, either an excessive specialization of the code is required, or its processing to exclude elements with different behavior, or such adjustment of the graph templates so that they “behave identically” with different definitions of attributes, or, finally, removal of individual files from the library and creation "Personal boost version."



Additionally, the following, not so significant, inconveniences can be noted:





BGL and other animals



Taking into account a rather long list of specific complaints, it would not be inappropriate to ask: can the author recommend BGL for new projects in 2019. The answer will be this: the author believes that libraries of this style and applications based on them must have a future . As for the choice of a reference library for a particular project, we should seriously consider the instrumentation, not losing sight of the problems listed above. The answer obviously depends on many circumstances, including but not limited to those listed in the following paragraphs:





Probably, ceteris paribus, the use of BGL can be justified either in the case of a very small one-time use (to extrude or copy a working piece of code and forget), or for a large system for which increased flexibility will pay for heavy entry and other costs over time. In other cases, it makes sense to carefully study other options.



As for the possible alternatives, their list includes at least the following items :

Title Lemon
Type of library C ++ Template Header
URL lemon.cs.elte.hu
Distributed no
Multithreaded no
OS any
Latest version 2014

Distributed by the archive
Stackoverflow mentions ~ 100 (36 in the [lemon-graph-library] section)
Comment According to some reports, in single-threaded mode significantly exceeds BGL in speed .

The authors' attitude toward multithreading is evident from the following dialogue . In view of the above in the section on PBGL, this position is doubtful.
Title SNAP
Type of library C ++
URL github.com/snap-stanford/snap.git
Distributed no
Multithreaded yes (part of the methods)
OS Linux, Mac, Cygwin
Latest version 2018

The repository is being actively updated.
Stackoverflow mentions <50
Comment One of the largest (over 10 MB of code) network analysis libraries (Network Ananlysis), which has been actively developing for many years. In a strange way, it is comparatively ignored by public attention.

See the description of the ideology of the system . The attitude to the implementation of parallel methods expressed on page 12 is close to the author of this article. Under the operating conditions of a typical modern machine park, it is the most natural. The paradigm shift took place in conditional 2011, to which the LEMON declaration above refers.
Title MTGL
Type of library C ++ Template Header
URL software.sandia.gov/svn/public/mtgl/trunk
Distributed ?
Multithreaded Yes
OS any
Latest version ?
Stackoverflow mentions 3
Comment Mysterious member of the meeting. The library was actively developing between 2005 and 2012. Sources were uploaded in 2017. Status unclear, mention of the project from the Sandia website removed. Ideologically inspired by the same BGL, but the code is completely independent. The total volume of source codes (including numerous tests and examples) reaches 17 Mb. The code looks well designed. See description .
Title igraph
Type of library C
URL github.com/igraph/igraph.git
Distributed no
Multithreaded no
OS any?
Latest version 2014

The repository is being actively updated.

Stackoverflow mentions About 100 in the [igraph] [c ++] and [igraph] [c] sections, and more than 500 in total (for all languages)

Comment Another library of network analysis, apparently, is very popular (mainly among pythonists, etc.). Description here .
Title graph-tool
Type of library C ++ python lib
URL git.skewed.de/count0/graph-tool.git
Distributed no
Multithreaded Yes
OS Judging by the use of autoconf - * nix only, but a simple adaptation to other systems is likely
Latest version 2019
Stackoverflow mentions <20
Comment Another actively developing network analysis library with a long history of commits that directly uses BGL (in the local patched version).

See performance comparison table.
Title LEDA
Type of library C ++
URL www.algorithmic-solutions.com/index.php/products/leda-for-c
Distributed no
Multithreaded ?
OS any
Latest version ?
Stackoverflow mentions ~ 10
Comment Commercial license. A large (and, one might say, old) library for scientific and technological computing, including a graph section. Apparently, it relies on its own infrastructure, and not on stl / boost, and in this sense is archaic.


Of particular general interest is the question of the classification of various software products oriented to work with graphs. Their diversity, not to mention the number, is very large. Without pretending to complete (and even formally correct) classification, you can try, nevertheless, to highlight the following important areas in the development of graph applications:

  1. Graph DBMS (neo4j, etc.).



    Systems of this kind are focused on performing transactional operations on large (distributed disk) graphs. Although the API of such a system can be highly developed, the execution speed of graph algorithms as far as one can judge is not the first priority. The system may not even try to load the entire graph into memory. For graph modification and traversal, declarative languages ​​(SPARQL, Cypher, Gremlin) are supported. Great importance is attached to ensuring continuity with traditional SQL systems.
  2. Graph extensions of big data processing systems working in the map / reduce paradigm (GraphX ​​in Spark, Pegasus and Giraph for Hadoop) and independent cluster systems ( MS Trinity / MS Graph Engine , GraphLab). The first to perform operations on the graph implement the Google Pregel model (but not only it) and can be configured for use including massive parallel computing nodes. Both of them can be used, among other things, as a basis for corporate software projects.



    Although the API of such systems can be quite developed (among other things, GraphX ​​supports SPARQL and Cypher), the main focus when working with them is on solving infrastructure problems. GraphX ​​is characterized by data immutability and a bias in the pipelining of all operations. MS Trinity currently does not include high-level methods and provides only a set of primitives for working with nodes and edges. Systems running on top of Hadoop are, in principle, of little use for solving arbitrary graph problems.

  3. Actually universal tool libraries that implement more or less wide sets of methods (BGL / PBGL, LEMON etc.), including massively parallel ones (nvGraph, Gunrock).



    Based on them, application systems can be created that adapt graph algorithms to specific subject areas.
  4. Systems and libraries specializing in certain complex tasks of universal importance (METIS, paraMETIS, Zoltran: graph partitioning, GraphViz, Gephi: visualization, GraphBLAS: algebraic algorithms for working with graphs, etc.).



    Many independent graph applications can be conditionally assigned to this category, a detailed analysis of which would require too much time. The latter contains applications of all conceivable varieties: academic and commercial, single-user and multi-user, recently appeared and existing for more than a decade, etc.


An obscure but significant part of graph applications is focused on the tasks of Network Analysis and, already, Social Network Analysis (Community Detection). Oddly enough, Link Analysis systems (used, as a rule, by various "crime fighters"), which have some similarities with the system we are developing, are much less common. In all cases, without a special check, it is difficult to determine the nature of the data models used by various systems and the associated performance limitations, supported volumes, sets of operations, etc.



Notes



  1. BGL is not a pure header library, but at the moment the only functionality requiring linking is (rather optional) work with GraphViz format DOT files. Therefore, in the vast majority of cases, there is no need for linking and automatic linking with the proper version of libbost-graph to include BGL headers in the Boost configuration is not provided. Thus, for consistency with the libboost-regex library used by non-header BGL functions, it is convenient to simply plug in the boost \ regex.hpp header from the project code, even if the latter does not use regular expressions.
  2. Additional chaos is introduced by the presence of entities whose apparent equivalence encourages the hunt for (possibly absent) black cats in dark rooms .
  3. Before proceeding to its description (using a specific example, where it manifested itself especially strongly and unpleasantly), we note that the author is among the relatively few lucky people working with a loaded project in a powerful Windows operating system and the God-saved line of MSVC compilers. It is possible that the troubles described below are artifacts of this line of compilers: a variety of particular circumstances make it difficult to conduct a comparative experiment with gcc / clang in the * nix environment. If so, you can only congratulate users of other compilers.
  4. To soften which, in some cases, the recently appeared constexpr if will probably help.
  5. In our case, this led to special attention to the state-saving function, which allows debugging with convenience, first bringing the system to the desired initial state in an optimized assembly.
  6. In my practice, for various reasons, there was a need to convert runtime parameters into template arguments, and quite often I had to resort to, very accurately, very elaborate methods (inspired by the now outdated implementations of boost typeof and boost lambda for C ++ 98 that directly hit treat the programming technique in C ++ as a solution to the rebus), among which the star shines the selection of the argument by dividing in half , but, in general, the main problems with such operations were always associated with the inability to export selected type outside the scope, which gave rise to exotic patterns.
  7. The twenty-fold speed loss indicated above (in absolute figures - about 80 seconds versus 4 on a test directional graph with 50 million vertices and 200 million edges represented as an adjacency list) was obtained on an unprepared (randomly broken) graph when compared with the local version delta walks on an eight-core machine. , . , 6-8 — , .
  8. , . ( , - , . , , , , , , , ).
  9. , , – , , «» (-- ..) . ( , ), , «» , — ( ). , , , - . , , : «» ( ) , «» ( ), , . . , - , « », . , « » ? , , , : – , , , , .
  10. . , , .
  11. «LIBNAME C++ graph» , stackoverflow. , BGL 500 [boost-graph].



All Articles