Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures

Tallennettuna:

Bibliografiset tiedot
Julkaisussa:	arXiv.org (Nov 15, 2024), p. n/a
Päätekijä:	Satyarth, Ishna
Muut tekijät:	Yin, Chao, Xu, RuQing G, Matthews, Devin A
Julkaistu:	Cornell University Library, arXiv.org
Aiheet:	Parallel processing Computer memory Linear algebra Matrices (mathematics) Formal method Algorithms Numerical stability Machine learning Determinants Electronic structure Symmetry Factorization
Linkit:	Citation/Abstract Full text outside of ProQuest
Tagit:	Lisää tagi Ei tageja, Lisää ensimmäinen tagi!

MARC


LEADER	00000nab a2200000uu 4500
001	3129864273
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3129864273
045	0		\|b d20241115
100	1		\|a Satyarth, Ishna
245	1		\|a Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures
260			\|b Cornell University Library, arXiv.org \|c Nov 15, 2024
513			\|a Working Paper
520	3		\|a The factorization of skew-symmetric matrices is a critically understudied area of dense linear algebra (DLA), particularly in comparison to that of symmetric matrices. While some algorithms can be adapted from the symmetric case, the cost of algorithms can be reduced by exploiting skew-symmetry. A motivating example is the factorization \(X=LTL^T\) of a skew-symmetric matrix \(X\), which is used in practical applications as a means of determining the determinant of \(X\) as the square of the (cheaply-computed) Pfaffian of the skew-symmetric tridiagonal matrix \(T\), for example in fields such as quantum electronic structure and machine learning. Such applications also often require pivoting in order to improve numerical stability. In this work we explore a combination of known literature algorithms and new algorithms recently derived using formal methods. High-performance parallel CPU implementations are created, leveraging the concept of fusion at multiple levels in order to reduce memory traffic overhead, as well as the BLIS framework which provides high-performance GEMM kernels, hierarchical parallelism, and cache blocking. We find that operation fusion and improved use of available bandwidth via parallelization of bandwidth-bound (level-2 BLAS) operations are essential for obtaining high performance, while a concise C++ implementation provides a clear and close connection to the formal derivation process without sacrificing performance.
653			\|a Parallel processing
653			\|a Computer memory
653			\|a Linear algebra
653			\|a Matrices (mathematics)
653			\|a Formal method
653			\|a Algorithms
653			\|a Numerical stability
653			\|a Machine learning
653			\|a Determinants
653			\|a Electronic structure
653			\|a Symmetry
653			\|a Factorization
700	1		\|a Yin, Chao
700	1		\|a Xu, RuQing G
700	1		\|a Matthews, Devin A
773	0		\|t arXiv.org \|g (Nov 15, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3129864273/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2411.09859