Impressum und Nutzungsbedingungen
Cookies & Tracking
Sie sind hier:
DLR Workshop on Quantum Computing for Applications in Science and Industry (QSI)
TiGL Workshop 2018
Dynamic provisioning and execution of HPC workflows using Python
High-performance computing (HPC) workflows over the last several decades have proven to assist in the understanding of scientific phenomena and the production of better products, more quickly, and at reduced cost. However, HPC workflows are difficult to implement and use for a variety of reasons. In this paper, we describe the development of the Python-based cumulus, which addresses many of these barriers. cumulus is a platform for the dynamic provisioning and execution of HPC workflows. cumulus provides the infrastructure needed to build applications that leverage traditional or Cloud-based HPC resources in their workflows. Finally, we demonstrate the use of cumulus in both web and desktop simulation applications, as well as in an Apache Spark-based analysis application.
Migrating legacy Fortran to Python while retaining Fortran-level performance through transpilation and type hints
We propose a method of accelerating Python code by just-in-time compilation leveraging type hints mechanism introduced in Python 3.5. In our approach performance-critical kernels are expected to be written as if Python was a strictly typed language, however without the need to extend Python syntax. This approach can be applied to any Python application, however we focus on a special case when legacy Fortran applications are automatically translated into Python for easier maintenance. We developed a framework implementing two-way transpilation and achieved performance equivalent to that of Python manually translated to Fortran, and better than using other currently available JIT alternatives (up to 5x times faster than Numba in some experiments).
PALLADIO: a parallel framework for robust variable selection in high-dimensional data
The main goal of supervised data analytics is to model a target phenomenon given a limited amount of samples, each represented by an arbitrarily large number of variables. Especially when the number of variables is much larger than the number of available samples, variable selection is a key step as it allows to identify a possibly reduced subset of relevant variables describing the observed phenomenon. Obtaining interpretable and reliable results, in this highly indeterminate scenario, is often a non-trivial task. In this work we present PALLADIO, a framework designed for HPC cluster architectures, that is able to provide robust variable selection in high-dimensional problems. PALLADIO is developed in Python and it integrates CUDA kernels to decrease the computational time needed for several independent element-wise operations. The scalability of the proposed framework is assessed on synthetic data of different sizes, which represent realistic scenarios.
High-performance Python-C++ bindings with PyPy and Cling
The use of Python as a high level productivity language on top of high performance libraries written in C++ requires efficient, highly functional, and easy-to-use cross-language bindings. C++ was standardized in 1998 and up until 2011 it saw only one minor revision. Since then, the pace of revisions has increased considerably, with a lot of improvements made to expressing semantic intent in interface definitions. For automatic Python-C++ bindings generators it is both the worst of times, as parsers need to keep up, and the best of times, as important information such as object ownership and thread safety can now be expressed. We present cppyy, which uses Cling, the Clang/LLVM-based C++ interpreter, to automatically generate Python-C++ bindings for PyPy. Cling provides dynamic access to a modern C++ parser and PyPy brings a full toolbox of dynamic optimizations for high performance. The use of Cling for parsing, provides up-to-date C++ support now and in the foreseeable future. We show that with PyPy the overhead of calls to C++ functions from Python can be reduced by an order of magnitude compared to the equivalent in CPython, making it sufficiently low to be unmeasurable for all but the shortest C++ functions. Similarly, access to data in C++ is reduced by two orders of magnitude over access from CPython. Our approach requires no intermediate language and more pythonistic presentations of the C++ libraries can be written in Python itself, with little performance cost due to inlining by PyPy. This allows for future dynamic optimizations to be fully transparent.
A New Architecture for Optimization Modeling Frameworks
We propose a new architecture for optimization modeling frameworks in which solvers are expressed as computation graphs in a framework like TensorFlow rather than as standalone programs built on a on low-level linear algebra interface. Our new architecture makes it easy for modeling frameworks to support high performance computational platforms like GPUs and distributed clusters, as well as to generate solvers specialized to individual problems. Our approach is particularly well adapted to first-order and indirect optimization algorithms. We introduce cvxflow, an open-source convex optimization modeling framework in Python based on the ideas in this paper, and show that it outperforms the state of the art.
Performance of MPI Codes Written in Python with NumPy and mpi4py
Python is an interpreted language that has become more commonly used within HPC applications. Python benefits from the ability to write extension modules in C, which can further use optimized libraries that have been written in other compiled languages. For HPC users, two of the most common extensions are NumPy and mpi4py. It is possible to write a full computational kernel in a compiled language and then build that kernel into an extension module. However, this process requires not only the kernel be written in the compiled language, but also the interface between the kernel and Python be implemented. If possible, it would be preferable to achieve similar performance by writing the code directly in Python using readily available performant modules. In this work the performance differences between compiled codes and codes written using Python3 and commonly available modules, most notably NumPy and mpi4py, are investigated. Additionally, the performance of an open source Python stack is compared to the recently announced Intel Python3 distribution.
Boosting Python performance on Intel Processors: A case study of optimizing music recognition
We present a case study of optimizing a Pythonbased music recognition application on an selected Intel high performance processor. With support from Numpy and Scipy, Python addresses the requirements of music recognition problem on math library utilization and special structures on data access. However, a general optimized Python application cannot fully utilize the latest high performance multi-core processors. In this study, we survey an existing music recognition application, written in Python, to discover the effect of applying changes to the Scipy and Numpy libraries to achieve full processor resource occupancy and reduce code latency. Instead of comparing across many different architectures, we focus on Intel high performance processors that have multicores and vector registers, and we attempt to preserve both userfriendliness and code scalability so that the revised library functions can be ported to other platforms and require no extra code changes.
ePython: An implementation of Python for the many-core Epiphany coprocessor
The Epiphany is a many-core, low power, low on-chip memory architecture and one can very cheaply gain access to a number of parallel cores which is beneficial for HPC education and prototyping. The very low power nature of these architectures also means that there is potential for their use in future HPC machines, however there is a high barrier to entry in programming them due to the associated complexities and immaturity of supporting tools. In this paper we present our work on ePython, a subset of Python for the Epiphany and similar many-core co-processors. Due to the limited on-chip memory per core we have developed a new Python interpreter and this, combined with additional support for parallelism, has meant that novices can take advantage of Python to very quickly write parallel codes on the Epiphany and explore concepts of HPC using a smaller scale parallel machine. The high level nature of Python opens up new possibilities on the Epiphany, we examine a computationally intensive Gauss-Seidel code from the programmability and performance perspective, discuss running Python hybrid on both the host CPU and Epiphany, and interoperability between a full Python interpreter on the CPU and ePython on the Epiphany. The result of this work is support for developing Python on the Epiphany, which can be applied to other similar architectures, that the community have already started to adopt and use to explore concepts of parallelism and HPC.
Devito: Towards a generic Finite Difference DSL using Symbolic Python
Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain scientists to build applications within the comforts of the Python software ecosystem. In this paper we present Devito, a new finite difference DSL that provides optimized stencil computation from high-level problem specifications based on symbolic Python expressions. We demonstrate Devito's symbolic API and performance advantages over traditional Python acceleration methods before highlighting its use in the scientific context of seismic inversion problems.
Mrs: High Performance MapReduce for Iterative and Asynchronous Algorithms in Python
Mrs is a lightweight Python-based MapReduce implementation designed to make MapReduce programs easy to write and quick to run, particularly useful for research and academia. A common set of algorithms that would benefit from Mrs are iterative algorithms, like those frequently found in machine learning; however, iterative algorithms typically perform poorly in the MapReduce framework, meaning potentially poor performance in Mrs as well. Therefore, we propose four modifications to the original Mrs with the intent to improve its ability to perform iterative algorithms. First, we used direct task-to-task communication for most iterations and only occasionally write to a distributed file system to preserve fault tolerance. Second, we combine the reduce and map tasks which span successive iterations to eliminate unnecessary communication and scheduling latency. Third, we propose a generator-callback programming model to allow for greater flexibility in the scheduling of tasks. Finally, some iterative algorithms are naturally expressed in terms of asynchronous message passing, so we propose a fully asynchronous variant of MapReduce. We then demonstrate Mrs’ enhanced performance in the context of two iterative applications: particle swarm optimization (PSO), and expectation maximization.
Call for Papers
PyHPC Workshop Series
PyHPC 2011 (Seattle, USA)
PyHPC 2012 (Salt Lake City, USA)
PyHPC 2013 (Denver, USA)
PyHPC 2014 (New Orleans, USA)
PyHPC 2015 (Austin, USA)
PyHPC 2016 (Salt Lake City, USA)
PyHPC 2017 (Denver, USA)
Copyright © 2022 Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR). Alle Rechte vorbehalten.