Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures

Tallennettuna:
Bibliografiset tiedot
Julkaisussa:arXiv.org (Nov 15, 2024), p. n/a
Päätekijä: Satyarth, Ishna
Muut tekijät: Yin, Chao, Xu, RuQing G, Matthews, Devin A
Julkaistu:
Cornell University Library, arXiv.org
Aiheet:
Linkit:Citation/Abstract
Full text outside of ProQuest
Tagit: Lisää tagi
Ei tageja, Lisää ensimmäinen tagi!

MARC

LEADER 00000nab a2200000uu 4500
001 3129864273
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3129864273 
045 0 |b d20241115 
100 1 |a Satyarth, Ishna 
245 1 |a Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures 
260 |b Cornell University Library, arXiv.org  |c Nov 15, 2024 
513 |a Working Paper 
520 3 |a The factorization of skew-symmetric matrices is a critically understudied area of dense linear algebra (DLA), particularly in comparison to that of symmetric matrices. While some algorithms can be adapted from the symmetric case, the cost of algorithms can be reduced by exploiting skew-symmetry. A motivating example is the factorization \(X=LTL^T\) of a skew-symmetric matrix \(X\), which is used in practical applications as a means of determining the determinant of \(X\) as the square of the (cheaply-computed) Pfaffian of the skew-symmetric tridiagonal matrix \(T\), for example in fields such as quantum electronic structure and machine learning. Such applications also often require pivoting in order to improve numerical stability. In this work we explore a combination of known literature algorithms and new algorithms recently derived using formal methods. High-performance parallel CPU implementations are created, leveraging the concept of fusion at multiple levels in order to reduce memory traffic overhead, as well as the BLIS framework which provides high-performance GEMM kernels, hierarchical parallelism, and cache blocking. We find that operation fusion and improved use of available bandwidth via parallelization of bandwidth-bound (level-2 BLAS) operations are essential for obtaining high performance, while a concise C++ implementation provides a clear and close connection to the formal derivation process without sacrificing performance. 
653 |a Parallel processing 
653 |a Computer memory 
653 |a Linear algebra 
653 |a Matrices (mathematics) 
653 |a Formal method 
653 |a Algorithms 
653 |a Numerical stability 
653 |a Machine learning 
653 |a Determinants 
653 |a Electronic structure 
653 |a Symmetry 
653 |a Factorization 
700 1 |a Yin, Chao 
700 1 |a Xu, RuQing G 
700 1 |a Matthews, Devin A 
773 0 |t arXiv.org  |g (Nov 15, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3129864273/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2411.09859