Graue Linie

Fachartikel

Deep Learning

Generating Interpretation of Graphs in Scientific Articles Using Deep Learning

Introduction & Motivation
Data is exponentially growing worldwide. 90 % of it has been generated in the recent two years alone1. This trend will continue to manifest in the same speed in the next few years and will generate vast amounts of data unimaginable to the human brain. This inevitably makes us wonder how we will deal with data in the future and what it will look like.

Balnojan2, as many other academics and data specialists, predicts that a lot of the data, we will be working with in 6 years’ time, will be in the image form. The scope of this data will mainly be to provide consumers an easy way to find desired products online through images on their mobile devices. We can make sense of such image data through computer vision and deep learning technologies. These technologies have improved immensely in the recent years – they even surpass human vision in many cases. This means, that we are probably in the right path to handling this vast amount of data in real time through AI-technologies.

Computer Vision’s most impressive use cases range from cancer detection in medicine, performance assessment in sports, to autonomous vehicles3. Inspired by the advancement in the deep learning technologies that enable us to improve our lives, I too, would like to report on an interesting use case of deep learning technologies on image data.

This use case came about as part of my master thesis at the Berlin School of Economics and Law. Similar to the aforementioned exponential growth of data, the number of scientific papers published worldwide is also growing rapidly4. Thus, academics face the challenge of an exhaustive literature review. In fact, this is not a challenge anymore, but has become a nearly impossible task. When working on scientific research, we need to make sure that we go through all the available scientific work, and as a result differentiate our work from what has already been contributed. This becomes even harder when we take into consideration that only approximately 30 % of the delivered results on online libraries, are relevant to the searched topic.

This is where the CauseMiner5 software comes to help. It is a software, developed by Müller and Hütteman5 that uses natural language processing, linguistic rules, and text mining to analyse the contents of scientific articles and finally extract the main ideas behind the research. This way, researchers can quickly decide whether a scientific paper is relevant and thus worth their time.

But the main hypothesis or ideas behind scientific work are not only expressed through text, often researchers also provide visual graphs that concisely express the main hypothesis and their relationships through nodes and edges (see figure 1). These graphs are also known as graphical research models. Having the contents of these graphs as an output of the CauseMiner Software would be a good extension. This was the goal of the master thesis, to take the image data of these graphical research models and extract the information utilizing deep learning models.

Grafik für Graphical Research
Figure 1: Example of a graphical research model retrieved from 6

Methodology
In this article, I would like to show step by step, how you can utilize image data to extract insightful information from it through deep learning.

The main goal of the thesis was to assess the capabilities of deep learning technologies to extract the structure of graphical research models in image form. I would like to point out that we are only interested in the structure of these graphs. We can extract the text provided in them through other computer vision technologies, such as Tesseract7.

I needed image data extracted from research papers. Clark and Divvala8 have developed a framework that extracts all image data from a corpus of papers. This provides us with a good amount of “real” graphical research models. I had to go through all these images and extract those that could be classified as graphical research models. These images then would become part of the dataset that I used to run my deep learning model on.

Unfortunately, my dataset was not ready for the deep learning models yet. While doing my research, I found out that the best deep learning models out there, that could extract information from the graphs in text form, were image captioning models and instance segmentation models.

Image captioning models generate a textual description of images. For this I could generate the needed data using a Python library: Graphviz9. instead of writing a description to every image myself.

For the image segmentation model, on the other hand, I could not generate any data and had to annotate it myself. Instance Segmentation Models are deep learning models that detect objects on images and generate masks on top of them. In other words, they assign every pixel to a class. I decided to detect only a few classes as the amount of data available was limited. The classes were nodes, edges (lines), arrows, and edge labels. This was done with an open-source tool, the VGG Image Annotator10.

After having my datasets ready, I was finally free to experiment with deep learning models. As an image captioning model, I chose to use a captioning model with semantic attention developed by Xu et al.11. This approach turned out to not be successful with my data. Therefore, I would like to not further dive into explanations on this, but would like to point out, that this technology is not able to generate the structure of graphical research models.

The instance segmentation model, on the other hand, delivered very promising results. I utilized Mask R-CNN12, which is an extension of an object detection model Faster R-CNN12. I chose this model as it was one of the state-of-the-art solutions by the time I was working on the thesis and due to all the literature available on it.

A short explanation of the Mask R-CNN model architecture:
An image is processed as follows: firstly, it is passed to a Convolutional Neural Network (CNN), the backbone network for feature extraction. Convolutional Neural Networks are multi-layer neural networks, that are applied to visual imagery and can learn local patterns or features in images that they can later recognize again in different locations13. The extracted features are fed to the Region Proposal Network, which creates anchor boxes that contain potential objects to detect. The ROI Align layers maintains the spatial orientation. Then, the fully connected network (FCN), processes the proposed regions and passes them to two different fully connected layers for object detection and bounding box refinement. The same output is processed in parallel by the Mask R-CNN branch, which generates the segmentation masks14.

Grafik architecture of Mask R CNN
Figure 2: The architecture of Mask R-CNN14

I trained the model on 245 annotated images of graphs, or 435 real and augmented images. I was able to achieve an overall accuracy of 87 %. This was a very satisfying result when we consider the limited amount of training data available.

In figure 3 we demonstrate the tasks performed by Mask R-CNN, that is object detection, semantic segmentation, and instance segmentation.

Graphs Tasks performed my Mask R CNN
Figure 3: Tasks performed my Mask R-CNN (Object detection, semantic segmentation, instance segmentation)

As you might have already noticed, this deep learning model, would only deliver me the objects with their segmentation masks and classes, not with the desired graph structure. I could, however, use the information provided to me by Mask R-CNN to generate the graph structures (see figure 4)

graph to generate the structure of graph
Figure 4: An annotated graph to generate the structure of graph retrieved from 15
With the segmentation masks provided by the model, one could tell just by looking at an image with the masks visualized on top, that these masks intersect with each other (see figure 4). This information can lead us to the structure of the graph: one node intersects with a line, the line with an arrow and so on. In Python we can use Shapely16, a package for “manipulation of planar geometric objects”, to identify whether our generated segmentation masks intersect with each other. After obtaining shapely polygons out of the segmentation masks, we save them in a dictionary that contains a generated element id, the class name of the element, the polygons, and the cropped image of every node and edge label. We also generate a set of intersecting ID-Pairs using Shapely.
graph with polygons and their ids
Figure 5: Annotated graph with polygons and their id-s retrieved from [14]
We use the cropped image of every node and edge label to pass it through Tesseract. Tesseract returns the text of every node and edge label. We save this information in the dictionary again. Now we have all the information we need for the structure generation of our graphical research models: the intersecting ID-Pairs, the classes, and the text on nodes and edge labels. Through the intersecting ID-Pairs, we can identify the structure of a graph.  We can follow the path from one element to another. For example, if elements 1 and 2 intersect, but also elements 2 and 5 intersect, we can conclude that element 1 leads to element 2 and element 2 to element 5. This is how we come to the full structure of our graphs. However, developing an algorithm that does this properly, would be very time-consuming. Thus, we make use of Networkx17, a “python package for creation, manipulation, and the analysis of the structure of complex networks“. With the help of Networkx, we can generate all possible paths in a graphical research model. All these possible paths are our structure.

Let’s wrap up the main steps we took to come to our generated graph structures:

  1. Generating instance segmentation masks with Mask R-CNN.
  2. Converting the generated masks to polygons.
  3. Creating dictionaries that contain the polygons, an identifying key, and their classes (node, arrow, edge label, line).
  4. Using Tesseract to detect the text on every node and attaching this text to the corresponding element in the dictionary.
  5. Determining which elements (polygons) intersect with each other. If two elements intersect with each other, then they are connected.
  6. Creating id-pairs of intersecting elements and identifying all the possible network paths in the graph. These paths are the generated structure of the graphs

This is what we have retrieved from the graph shown in Figure 5:

Übersicht generated structure
Figure 4: Generated Structure

Note that the order of line and arrow, defines the direction of the connection as shown through the red arrows on Figure 6.
Now that we have finally extracted the information behind graphical research models, with a little refinement we can now add it to the CauseMiner Software.

Final Words
We’ve gone through all the steps of solving a problem through deep learning: from creating an appropriate dataset, to training a deep learning model and using its output to generate insightful information for us. I would like to point out, how powerful data can be and how much knowledge we can extract from it. With a little imagination, we can probably find many use cases for different datasets and who knows what we can achieve…

References:
1 D. Reinsel, J. Gantz, J. Rydning, “The Digitization of the World From Edge to Core” 2018. Retrieved May 5, 2022
2 S. Balnojan, “The Future Of Good Data – What You Should Know Now!” 2020. Retrieved May 5, 2022
3 V. Meel, “87 Most Popular Computer Vsion Application in 2022” 2022. Retrieved May 5, 2022
4 A. E. Jinha, “Article 50 million: an estimate of the number of scholarly articles in existence,” Learned Publishing, vol. 23, no. 3, pp. 258–263, 2010.
5 R. M. Mueller and S. Huttemann, “Extracting Causal Claims from Information Systems ¨ Papers with Natural Language Processing for Theory Ontology Learning,” in Proceedings of the 51st Hawaii International Conference on System Sciences, pp. 5295–5304, 2018.
6 V. Cova, V. Abbas, “The cultural aspect in the relationship customer-place: Proposal and test of an integrated model” 2018. Retrieved May 5, 2022
7 A. Rosebrock, “OpenCV OCR and text recognition with Tesseract – PyImageSearch,” 2018. Retrieved August 8, 2020,
8 C. Clark and S. Divvala, “PDFFigures 2.0: Mining figures from research papers,” in Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries – JCDL ’16, pp. 143–152, ACM Press, 2016.
9Graphviz – graph visualization software.”
10 A. Dutta and A. Zisserman, “The VIA annotation software for images, audio and video,” in Proceedings of the 27th ACM International Conference on Multimedia, pp. 2276– 2279, ACM.
11 K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning – Volume 37, ICML’15, p. 2048–2057, JMLR.org, 2015.
12 K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in ´ 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988.
13 F. Chollet, Deep learning with Python. New York: Manning Publications Co, 2018. OCLC: ocn982650571
14 J. Hui, “Image segmentation with mask R-CNN,” 2019. Retrieved July 29, 2020
15 A. Azadegan and G. Kolfschoten, “An assessment framework for practicing facilitator,” Group Decision and Negotiation, pp. 1013—-1045, 2012.
16Shapely — shapely 1.8dev documentation.”
17NetworkX — NetworkX documentation.”
Sie haben Fragen, möchten Ihr Projekt mit uns besprechen oder suchen technische Unterstützung? Wir freuen uns auf ein Gespräch mit Ihnen.

Jetzt Termin vereinbaren

Gerne leiten wir Sie weiter. Hierbei übermitteln wir einige Daten an den Anbieter. Mehr Informationen unter: Datenschutz

Gerne leiten wir Sie weiter. Hierbei übermitteln wir einige Daten an den Anbieter. Mehr Informationen unter: Datenschutz

Gerne leiten wir Sie weiter. Hierbei übermitteln wir einige Daten an den Anbieter. Mehr Informationen unter: Datenschutz