The chemfp project

Kaydedildi:
Detaylı Bibliyografya
Yayımlandı:Journal of Cheminformatics vol. 11, no. 1 (Dec 2019), p. 1
Yazar: Dalke, Andrew
Baskı/Yayın Bilgisi:
Springer Nature B.V.
Konular:
Online Erişim:Citation/Abstract
Full Text - PDF
Etiketler: Etiketle
Etiket eklenmemiş, İlk siz ekleyin!

MARC

LEADER 00000nab a2200000uu 4500
001 2322055773
003 UK-CbPIL
022 |a 1758-2946 
024 7 |a 10.1186/s13321-019-0398-8  |2 doi 
035 |a 2322055773 
045 2 |b d20191201  |b d20191231 
084 |a 113329  |2 nlm 
100 1 |a Dalke, Andrew  |u Andrew Dalke Scientific AB, Trollhättan, Sweden 
245 1 |a The chemfp project 
260 |b Springer Nature B.V.  |c Dec 2019 
513 |a Journal Article 
520 3 |a The chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similarity search implementations, (3) experiment with funding a pure open source software project through commercial sales, and (4) publish the results and lessons learned as a guide for future implementors. The FPS format has had only minor success, though it did influence development of the FPB binary format, which is faster to load but more complex. Both are summarized. The chemfp benchmark and the no-cost/open source version of chemfp are proposed as a reference baseline to evaluate the effectiveness of other similarity search tools. They are used to evaluate the faster commercial version of chemfp, which can test 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU-based similarity search implementations. Modern CPUs are fast enough that memory bandwidth and latency are now important factors. Single-threaded search uses most of the available memory bandwidth. Sorting the fingerprints by popcount improves memory coherency, which when combined with 4 OpenMP threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 min. These observations may affect the interpretation of previous publications which assumed that search was strongly CPU bound. The chemfp project funding came from selling a purely open-source software product. Several product business models were tried, but none proved sustainable. Some of the experiences are discussed, in order to contribute to the ongoing conversation on the role of open source software in cheminformatics. 
653 |a Similarity 
653 |a Computer programs 
653 |a Informatics 
653 |a Source code 
653 |a Organic chemistry 
653 |a Public domain 
653 |a Algorithms 
653 |a Open source software 
653 |a Software 
653 |a Freeware 
653 |a Searching 
653 |a Format 
653 |a Nearest-neighbor 
653 |a Business 
653 |a Fingerprints 
653 |a Latency 
653 |a Funding 
653 |a Benchmarks 
653 |a Economic 
773 0 |t Journal of Cheminformatics  |g vol. 11, no. 1 (Dec 2019), p. 1 
786 0 |d ProQuest  |t Health & Medical Collection 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/2322055773/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/2322055773/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch